An In-depth Look at Regular Expressions Part 1: Languages

by | Jan 25, 2016

Most programmers have used, or are at least aware of, regular expressions and how they work. But, fundamentally, what is a regular expression? Well, to understand that, we’ll have to understand a few core concepts.

First, we need to know what an alphabet is. I’m sure you know what an alphabet is in terms of everyday life, but, for our sake, we’ll define an alphabet as a collection of terminal symbols (characters).
e.g. Σ = {λ, a, b} (you can think of λ as just an empty character).

Simple enough, right? But a necessary building block for the next concept.

A language is a collection of words/strings put together from elements of an alphabet.
e.g. L = {ab, ba, a}
So the language L contains only the words “ab”, “ba”, and “a”.

But suppose we have a language with an unknown or infinite number of words. e.g. L = {λ, a, aa, aaa, …} (for our purposes, λ is just a visual representation of the empty string, “”)
We’d want a better way of describing a language like this. This is where grammars come in (grammars can be used in much more familiar ways as well, which we’ll explore later on).

A grammar requires:
– a collection of non-terminal symbols, e.g. N = {S, A, B},
– a starting symbol, which is usually defined as S,
– an alphabet, e.g. Σ = {λ, a, b},
– and a collection of rules

So the grammar for the language {λ, a, aa, aaa, …} would look something like this:
Let N = {S, A} where S is that start symbol,
Σ = {λ, a},
and P = {
S -> λ | A
A -> a | aA
}

So starting with S, we use the first production rule in P to replace S with λ or A, then we use the second rule in P to replace A with a or aA, and we continue to apply that rule to replace any A in the same manner until only terminal symbols (characters) are left. So the progression for each word in the laguage would look like this:
{ S => λ,
S => A => a,
S => A => aA => aa,
S => A => aA => aaA => aaa,
… }

Let’s look at a more real world example, say, english.
So we have an alphabet Σ = {λ, a, A, b, B, c, C, … x, X, y, Y, z, Z, , .},
Our non-terminals N = {<start>, <paragraph>, <sentence>, <subject>, <verb>, <object>},
let <start> be our start symbol,
and our production rules P = {
<start> -> <paragraph> <start> | λ
<paragraph> -> <sentence> <paragraph> | λ
<sentence> -> <subject> <verb> <object>. | <subject> <verb>.
<subject> -> I | You | They
<verb> -> see | eat
<subject> -> me | you | them | cake
}

Note: usually we’d want to have some way of distinguishing between space characters that are in the language and the space characters used for making the production rules readable, but if you know english, it’s pretty clear which is which in this example.

Now this is a VERY simplified version of the english language, but it gives a somewhat real-world view of how grammars work.
Each element of the language contains some number of paragraphs, which each contain some number of sentences, which each contain different kinds of words in specific orders.

So strings that are in the English language include:
– “I see you.”
– “I see cake. I eat cake.”
– “I eat. You see me. They see. I see cake. I eat”
And strings that aren’t in the inglish language include:
– “I see They.”
– “I see you” (no period)
– “I you.”
– “Me eat cake.”

So finally we get to the question. What is a regular expression?
Now I still can’t answer this fully, but knowing what a language is, I can say that a regular expression is a representation of a specific kind of language called a regular language. And in part 2, we’ll get into what exactly a regular language is.

And since the article is about regular expressions, let’s see a regular expression for our simplified English language:
(((I|You|They) (see|eat) (me|you|them|cake).)* )*

Recent Posts