Regular expressions

This section provides a brief overview over regular expressions. In the context of refdb, we have to deal with two flavors of regular expressions: Unix-style and SQL. The former are more important as we use them to write queries. The latter are used sparingly, e.g. to search the filenames of databases.

The difference between a search and a regular expression search is that the latter allows some "fuzziness" in the search string. The former requires that the search string and the search result match character by character. In simple words, regular expressions allow to search for strings which are similar to some extent, and you can exactly specify to which extent.

Unix-style regular expressions

Regular expressions distinguish between regular characters and special characters (meta characters). The simplest regular expressions actually don't look like regular expressions, as the following example shows:

foo

This will search for the string "foo" at any position in the target elements. This would find strings like "foobar", "lifoo", or "lifoobar". That is: if there are no meta characters, a simple string match is attempted, however at any position in the element. This is different from search strategies in some other databases where a full match or a left-match is attempted by default.

We can now replace one "o" in the above sample with a meta character. We use the "." (dot) which matches any single character, including a newline, at that position:

f.o

This will find strings like "fao", "fdo", but as well all strings of the previous example.

Another very common meta character is the "*", which matches zero or more instances of the previous character. Thus,

fo*

will now find things like "fo", "foooo", but also "fbar" and "lifooobar". The meta character "+" is similar, but requires at least one instance of the previous character:

fo+

This would retrieve all strings of the last example except "fbar" as this contains the "o" zero times.

fo?

The questionmark meta character will retrieve either zero or one instances of the previous character. This would match "f" and "fo", but not "foo".

The meta characters "^" and "$" are important to determine the relation of the search string to the line start or line end:

^foo

This will match "foo" only if it is located at the line start. Similarly,

foo$

will find "foo" only when it is located at the line end. If you combine these two like in the next example:

^foo$

"foo" will be found only if this is the complete element, starting and ending the line.

The following list briefly explains some more terms which are helpful in regular expressions.

()

Use the round brackets to group characters to a sequence. This is particularly useful with the above mentioned metacharacters *, +, and ?.

(foo)*

This will match zero or more instances of the sequence "foo". It will find e.g. "foo" and "foofoo", but not "fofo".

[]

matches any single character between the brackets.

[0-9]

This will match any digit. Continuous ranges of characters can be indicated with a dash, as seen here.

[^]

matches any single character except the ones between the brackets

[^abc]

This will match any character except "a", "b", and "c".

\

The backslash escapes the following meta character and treats it as a literal character.

\.

This will match only the dot instead of any single character.

\{n,m\}

This will find n to m repeats of the previous character.

fo\{2,3\}

This regular expression will find "foo" and "fooo", but not "fo" or "foooo".

For further information about regular expressions, see the regex chapter in the MySQL documentation.

SQL regular expressions

SQL regular expressions are much simpler, as there are only two metacharacters:

%

matches any string

_ (underscore)

matches any single character