Regular Expressions in C

Regular expressions are used for searching strings in text files. A regular expression is a string of characters and may contain certain metacharacters. A metacharacter has a special meaning. A regular expression denotes a set of strings. Or, in other words, there is a set of strings that are matched by it.

The use case is that, from the requirements, a set of strings of interest is visualised, which might be infinite. This set is encapsulated into a regular expression. The resulting regular expression matches the strings of interest in the input.

The term “regular expression” is often abbreviated as “regex”, and sometimes as “regexp”.

Regular Expressions
Regular Expression API in C
Example Program
See also

1.0 Regular Expressions

By default, Basic Regular Expressions (BRE) are available. With the use of REG_EXTENDED flag in the regcomp call, Extended Regular Expressions (ERE) are activated and the ERE syntax can be used. As suggested by the terminology, ERE have more features than BRE, and, in some cases, the syntax is slightly different between BRE and ERE. Since ERE offers more functionality, it is better to use ERE at all times.

Regular Expressions
Regular Expression	Description
c	Any character, c, except for special characters, matches itself.
\c	For any special character, c, the meaning is turned off and c is matched.
^	Anchors to the beginning of the line.
$	Anchors to the end of the line.
.	Any single character.
[…]	Any one of the characters inside brackets. Ranges like a-e are OK.
[[:lower:]], \l	Any one of the lowercase letters (for C locale and ASCII character coding, a-z).
[[:upper:]], \u	Any one of the uppercase letters (for C locale and ASCII character coding, A-Z).
[[:alpha:]]	Any one of the alphabetic characters (from the union of [[:lower:]] and [[:upper:]]).
[[:digit:]], \d	Any one of the digits, [0-9]
\D	Non-digits [^0-9]
[[:alnum:]]	Any one of the alphanumeric characters (from the union of [[:alpha:]] and [[:digit:]]).
\w	Alphanumeric and underscore (“_”)
\W	Non-word characters [^\w]
\b	Word boundary
[[:punct:]]	Any one of the punctuation characters (for C locale and ASCII character coding, from ! ” # $ % & ‘ ( ) * + , – . / : ; < = > ? @ [ \ ] ^ _ ` { \| } ~)
[[:graph:]]	Any one of the graphical characters (from the union of [[:alnum:]] and [[:punct:]])
[[:space:]], \s	Any one of the space characters (for C locale and ASCII character encoding, from tab, newline, vertical tab, form feed, carriage return, and space).
\S	Non-space characters [^\s]
[[:print:]]	Any one of the printable characters ([[:graph:]] and space).
[[:blank:]]	One of the blank characters (space and tab).
[[:cntrl:]]	Any one of the control characters (for ASCII, octal 000 through octal 037, and octal 177 (DEL)).
[[:xdigit:]], \x	Any one of the hexadecimal digits (0-9 and a-f)
[^…]	Any character not in …
r*	r is matched 0 or more times.
r+	r is matched 1 or more times (ERE)
r?	r is matched zero or 1 time (ERE)
r{n}	r is matched exactly n times (ERE)
r{n,}	r is matched at least n times (ERE)
r{,m}	r is matched at most m times (ERE)
r{n,m}	r is matched at least n times but not more than m times (ERE)
r1r2	r1 followed by r2 are matched.
r1\|r2	Either r1 or r2 is matched. (ERE)
( )	Parenthesis define a marked subexpression. A string matched by a subexpression can be recalled later using the construct \n, where n is a digit from 1 to 9.
\n	Backreferences. String matched by the nth subexpression, earlier in the regular expression. n is a digit from 1 to 9.

2.0 Regular Expression API in C

#include <regex.h>

int regcomp (regex_t *preg, const char *regex, int cflags);
int regexec (const regex_t *preg, const char *string, size_t nmatch,
             regmatch_t pmatch[], int eflags);

size_t regerror (int errcode, const regex_t *preg, char *errbuf,
                 size_t errbuf_size);

void regfree (regex_t *preg);

Before a regular expression can be used in a C program, it needs to be compiled into a form suitable for use. A regular expression is compiled using the regcomp function. regcomp takes in a pointer to the regular expressions, regex along with flags and gives a pointer to the compiled regular expression, preg. preg points to a structure of type regex_t and contains the regular expression in a form suitable for use in a regexec call. cflags are zero or more bit-wise or’ed individual flags. Some of the important flags are REG_EXTENDED, REG_ICASE and REG_NEWLINE. Use of REG_EXTENDED implies that ERE syntax is to be used. By default, the BRE syntax is used. Flag REG_ICASE means that the case of characters is to be ignored and case insensitive searching for matches is to be done. Flag REG_NEWLINE ensures that regular expressions, matching any character, do not match a newline. Lastly, there is a flag, REG_NOSUB, which means that position of matches need not be reported in subsequent regexec calls. If REG_NOSUB is used, the pmatch and nmatch parameters are ignored in subsequent regexec calls.

regexec is the function for matching a precompiled regular expression in buffer pointed by preg against a null-terminated string passed as the second parameter, string. The parameter pmatch is used for returning the information about the location of matches. pmatch is an array of typedef struct regmatch_t, which is,

typedef struct {
    regoff_t rm_so;
    regoff_t rm_eo;
} regmatch_t;

An rm_so value -1 indicates invalid data and array indexes from the first index having rm_so value -1 should be disregarded. If rm_so is not -1, it indicates the start offset of the substring match within the given string. Similarly, rm_eo is the offset of the first character after the substring matched. The matching is greedy, that is, the offsets for largest matching substring are returned.

The data for a match of entire regular expression is returned in pmatch [0]. In subsequent pmatch indexes, data for sub-expressions (given in parenthesis in the regular expression) are returned. The size of the pmatch array should be at least 1, for getting the offset values for the regular expression. If you want N sub-expressions offset data, the size of pmatch should be N+1. All unused index are returned with rm_so value as -1. If you are not interested in any of the offset values, you can pass 0 for nmatch and NULL for pmatch or use the flag REG_NOSUB in the previous regcomp call.

regexec returns zero for a successful match or REG_NOMATCH in case of failure.

regerror provides the error string for an error code returned by an earlier regcomp or regexec call. The input parameters are errcode, the error code returned by an earlier regcomp or regexec call and preg, pointer to the compiled regular expression used earlier. The error string is returned in errbuf, which must be allocated and provided by the caller. errbuf_size is the size of the errbuf. regerror returns the minimum size of buffer required for storing the error string. If errbuf_size is less than the minimum size required to print the error string, the latter is truncated to fit errbuf and is still null-terminated. One way to use regerror is to first call regerror with errbuf as NULL and errbuf_size as zero and the first two parameters as described earlier. regerror returns the length of the buffer required to store the error string. Using this length, an adequate buffer is allocated and a second call to regerror is made to get the error string.

regfree is for freeing memory associated with preg, where preg is the pointer to buffer used in the previous regcomp call. regfree does not free preg itself; it frees the undocumented internal data fields associated with the regex_t structure. regfree must be called for preg after the regexec and regerror calls are done and a new regcomp call is to be made.

3.0 Example Program

We have an example program, find-str, which takes in a regular expression and input filename as arguments and lists lines with matched patterns, somewhat like the grep program. However, find-str lists the strings matched and their offset from the beginning of line for each matched line. First, the regular expression is compiled with the regcomp call and then pattern matching is done using the regexec call.

/*
 *     
 *   find-str.c: find string in file
 *     
 */

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>

#define ARRAY_SIZE(array) (sizeof((array)) / sizeof((array)[0]))

char re [128];
char buf [1024];

int main (int argc, char **argv)
{
    FILE *fp;
    regex_t     regex;
    regmatch_t  pmatch [3]; // Up to 3 sub-expressions
    regoff_t    offset, length;
    int ret;

    if (argc != 3) {
        fprintf (stderr, "Usage: find-str pattern file\n");
	exit (EXIT_FAILURE);
    }

    strcpy (re, argv [1]);

    if ((fp = fopen (argv [2], "r")) == NULL) {
        perror ("fopen");
	exit (EXIT_FAILURE);
    }

    // Extended Regular Expressions, case insensitive search
    if (ret = regcomp (&regex, re, REG_EXTENDED | REG_ICASE | REG_NEWLINE)) {
	(void) regerror (ret, &regex, buf, sizeof (buf));
	fprintf (stderr, "Error: regcomp: %s\n", buf);
        exit(EXIT_FAILURE);
    }

    printf ("Matches:\n");


    while (fgets (buf, sizeof (buf), fp)) {
        char *s = buf;
        for (int i = 0; ; i++) {
            if (ret = regexec (&regex, s, ARRAY_SIZE(pmatch), pmatch, 0)) {
		if (ret != REG_NOMATCH) {
	            (void) regerror (ret, &regex, buf, sizeof (buf));
	            fprintf (stderr, "Error: regexec: %s\n", buf);
		    exit (EXIT_FAILURE);
		}
                break;
	    }
	    if (i == 0)
		printf ("\n%s", buf);
            offset = pmatch [0].rm_so + (s - buf);
            length = pmatch [0].rm_eo - pmatch[0].rm_so;
            printf("#%d:\n", i);
            printf("offset = %jd; length = %jd\n", (intmax_t) offset,
                (intmax_t) length);
            printf("substring = \"%.*s\"\n", length, s + pmatch[0].rm_so);

	    for (int j = 0; j < ARRAY_SIZE(pmatch); j++) {
		if (pmatch [j].rm_so == -1)
		    break;
                offset = pmatch [j].rm_so + (s - buf);
                length = pmatch [j].rm_eo - pmatch [j].rm_so;
                printf("\toffset = %jd; length = %jd\n", (intmax_t) offset,
                    (intmax_t) length);
                printf("\tsubstring = \"%.*s\"\n", length, s + pmatch[j].rm_so);
	        printf ("\t[%d] %d %d \n", j, pmatch [j].rm_so, pmatch [j].rm_eo);
	    }
            s += pmatch[0].rm_eo;
        }
    }

    // free internal storage fields associated with regex
    regfree (&regex);

    if (fclose (fp)) {
        perror ("fclose");
	exit (EXIT_FAILURE);
    }

    exit(EXIT_SUCCESS);
}

The printf statement,

printf("substring = \"%.*s\"\n", length, s + pmatch[0].rm_so);

is a little unusual because there are two arguments after the format string, but only one argument is printed. It is an example of printf‘s dynamic width and precision capability. “%.*” ensures that the length of the string to be printed is passed as an argument, which in this case, is length.

We can compile and run the above program as below. Here, the regular expression, passed as the first argument, is “(\b\w+\b)(\s+\1\b)+”. We look for lines with duplicate consecutive words. Since we have used the regcomp call with the REG_ICASE flag, case insensitive matching is done.

$ gcc find-str.c -o find-str
$ 
$ cat input
She she sells sea sea shells Shells shells by the sea shore Shore.
A quick abcrown fox jumps jumps over a abclazy dog.
$ 
$ ./find-str "(\b\w+\b)(\s+\1\b)+" input
Matches:

She she sells sea sea shells Shells shells by the sea shore Shore.
#0:
offset = 0; length = 7
substring = "She she"
	offset = 0; length = 7
	substring = "She she"
	[0] 0 7 
	offset = 0; length = 3
	substring = "She"
	[1] 0 3 
	offset = 3; length = 4
	substring = " she"
	[2] 3 7 
#1:
offset = 14; length = 7
substring = "sea sea"
	offset = 14; length = 7
	substring = "sea sea"
	[0] 7 14 
	offset = 14; length = 3
	substring = "sea"
	[1] 7 10 
	offset = 17; length = 4
	substring = " sea"
	[2] 10 14 
#2:
offset = 22; length = 20
substring = "shells Shells shells"
	offset = 22; length = 20
	substring = "shells Shells shells"
	[0] 1 21 
	offset = 22; length = 6
	substring = "shells"
	[1] 1 7 
	offset = 35; length = 7
	substring = " shells"
	[2] 14 21 
#3:
offset = 54; length = 11
substring = "shore Shore"
	offset = 54; length = 11
	substring = "shore Shore"
	[0] 12 23 
	offset = 54; length = 5
	substring = "shore"
	[1] 12 17 
	offset = 59; length = 6
	substring = " Shore"
	[2] 17 23 

A quick abcrown fox jumps jumps over a abclazy dog.
#0:
offset = 20; length = 11
substring = "jumps jumps"
	offset = 20; length = 11
	substring = "jumps jumps"
	[0] 20 31 
	offset = 20; length = 5
	substring = "jumps"
	[1] 20 25 
	offset = 25; length = 6
	substring = " jumps"
	[2] 25 31 
$