---------------------- Formal language theory [Chapter 7] ---------------------- Basic definitions: - Alphabet: any finite, non-empty set of atomic symbols (meaning "compound" symbols like "ab" are not allowed) (e.g., {a,b,c}, {0,1}, {+}). - String: any finite sequence of symbols; the empty sequence is denoted "e" and called "empty string" (e.g., a, ab, cccc are strings over {a,b,c}; e, ++++ are strings over {+}). - Language: any set of strings (can be empty, finite, or infinite) (e.g., {bab, bbabb, bbbabbb, ...} is a language over {a,b,c}). - The _length_ of a string s is the number of symbols in s, denoted |s| (e.g., |bba| = 3, |+| = 1, |e| = 0). - The _concatenation_ of two strings s and t is denoted st (sometimes s.t) and consists of every symbol of s followed by every symbol of t (e.g., bba.bb = bbabb, e.+++ = +++). - The _reversal_ of a string x, denoted by (x)^R is the string obtained by listing the elements of x in revers order. Basic conventions: - Alphabets cannot contain the symbol "e" for the empty string. - Two strings s, t are equal iff |s| = |t| and i-th symbol of s = i-th symbol of t for 1 <= i <= |s|. - For an alphabet S, let S^n denote the set of all strings of length n over S, and S^* denote the set of all strings over n. E.g., {0,1}^3 = {000,001,010,011,100,101,110,111}, {+}^* = {e,+,++,+++,++++,...}, {a,b,c}^0 = {e}. Operations on languages: - Complementation: ~L = Sigma^* - L - Union: L union L' = {x: x in L or x in L'} - Concatination: L o L' = {xy: x in L and y in L'} - Kleene star: L^* = {x^n: x in L and n >= 0 is an integer} - Language exponentiation: L^k = { {e} if k = 0 { L^{k-1} o L if k > 0 - Reversal: Rev(L) = {(x)^R : x in L } ------------------- Regular expressions ------------------- - Regular expressions describe sets of strings using a small number of basic operations. - The set of regular expressions (regexps) over an alphabet S is defined as follows, assuming that S does not contain symbols "{}" and "e": . {} (empty set symbol), e (empty string symbol) are regexps; . every element of S is a regexp; . if R and S are regexps, then so are: R+S (union) -- lowest precedence, RS (concatenation), R* (star, also called "Kleene star") -- highest precedence. - For each regular expression R, we define the language described by R (L(R)) as follows: . L({}) = {} (the empty set of strings) . L(e) = {e} (the set that contains only the empty string) . L(a) = {a} for every symbol a in S . L(R+S) = L(R) U L(S) (set union) . L(RS) = { rs : r in L(R) and s in L(S) } . L(R*) = { s in S* : s = e or s = s1...sk for some k>=1 and some strings s1 in L(R), ..., sk in L(R) } = all possible concatenations of 0 or more strings from L(R) = [intuitively] L(e+R+RR+RRR+RRRR+...) - Examples: . L(a+b) = {a,b} . L(ab) = {ab} . L((a+b)a) = {aa,ba} = L(aa+ba) . L(a*) = {e,a,aa,aaa,...} (zero or more repetitions of "a") . L(aa*) = {a,aa,aaa,...} = L(a*a) (one or more repetitions of "a") . L((ab)*) = {e,ab,abab,ababab,...} (zero or more repetitions of "ab") . L((a+b)*) = {e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...} (zero or more repetitions of a's or b's, i.e., every string of a's and b's) . L(a*+b*) = {e,a,b,aa,bb,aaa,bbb,aaaa,bbbb,...} (every string consisting entirely of a's or entirely of b's) . L((a+b)(a+b)*) = {a,b,aa,ab,ba,bb,...} (every nonempty string of a's and b's) . L(a(ba+c)*) = {a,aba,ac,ababa,abac,acba,acc,...} - Examples of regular expressions: 1. All strings of 0's and 1's that have the same first and last symbol: e + 0 + 1 + 0(0+1)*0 + 1(0+1)*1 2. L' = {all strings of 0's and 1's that contain at least one 0}: (0+1)*0(0+1)* or 1*0(0+1)* or (0+1)*01* - Regular expressions R and S are equivalent (denoted R == S) iff they represent the same language, i.e., L(R) = L(S). For example, 1*0(0+1)* == (0+1)*01*. This is not completely obvious and needs proof. - General properties of regular expressions: . R+S == S+R [commutativity] . (R+S)+T == R+(S+T) [associativity] . (RS)T == R(ST) [associativity] . R(S+T) == RS+RT [left distributivity] . (S+T)R == SR+TR [right distributivity] . R+{} == R [identity] . eR == R == Re [identity] . {}R == {} == R{} [annihilator] . R** == R* [idempotence] - Example of proof of equivalence: We prove that (0+1)*01* == 1*0(0+1)*, by double inclusion. L((0+1)*01*) subset of L(1*0(0+1)*): Let s be an arbitrary string in L((0+1)*01*). Then, s = r0t for some strings r in L((0+1)*) and t in L(1*). Consider two cases. Case 1: If r contains no 0, then r also belongs to L(1*). Since t belongs to L((0+1)*) (every string of 0's and 1's does), s = r0t for some string r in L(1*) and some string t in L((0+1)*) so s belongs to L(1*0(0+1)*). Case 2: If r contains at least one 0, then r contains a first 0 so we can write r = u0v for some string u in L(1*) and some string v in L((0+1)*) (either u or v, or both, may be empty). But then, s = u0v0t = u0w (where w = v0t) for some string u in L(1*) and w in L((0+1)*), so s belongs to L(1*0(0+1)*). In both cases, s belongs to L(1*0(0+1)*). L(1*0(0+1)*) subset of L((0+1)*01*): a similar argument works (Note that this is not always the case: in general, each direction of the proof can involve different kinds of arguments and must be done separately.) --------------------- Finite state automata --------------------- - Simple models of computing devices used to analyze strings. A F.S.A. has a fixed, finite set of "states", one of which is the "initial state" and some of which are "accepting" (or "final") states, as well as "transitions" from one state to another for each possible symbol of a string. The F.S.A. starts in its initial state and processes a string symbol-by-symbol: for each symbol processed, the F.S.A. switches states based on the latest input symbol and its current state, based on the transitions. Once the entire string has been processed, the F.S.A. will either be in an accepting state (in which case the string is "accepted") or not (in which case the string is "rejected"). - Example: Simplified control mechanism for a vending machine that accepts only nickels (5c), dimes (10c) and quarters (25c), where everything costs exactly 30c and no change is ever given. Alphabet S = {n,d,q} (for "nickel", "dime", "quarter"), set of states = {0,5,10,15,20,25,30} (for amount of money put in so far; no need to keep track of excess since no change will be provided), and transitions are defined by following table (state across the top, input symbol down the side), with the initial state being "0" and the only accepting state being "30": | 0 5 10 15 20 25 30 ---+---------------------- n | 5 10 15 20 25 30 30 d | 10 15 20 25 30 30 30 q | 25 30 30 30 30 30 30 Computation of F.S.A. on input such as "dndd" proceeds as follows: state 0 -> process 'd' -> state 10 -> process 'n' -> state 15 -> process 'd' -> state 25 -> process 'd' -> state 30. Since the last state is accepting, the string "dndd" is accepted. - Formal definition: A F.S.A. is a quintuple (Q,S,d,s,F) where Q is a finite set of states S is a finite alphabet (Q intersect S is empty) d : Q * S -> Q is a transition function that gives the next state for each possible state in Q and symbol in S s in Q is the initial state F subset of Q is the set of accepting states - Transition function gives new state for each state and single input symbol. Extended transition function d*(q,w) gives new state for F.S.A. after processing string w starting from state q. It can be defined recursively, as follows: { q if w = e (empty), d*(q,w) = { { d(d*(q,w'),a) if w = w'a for some w' in S* and a in S. Example: d*(5,ndn) = d(d*(5,nd),n) = d(d(d*(5,n),d),n) = d(d(d(d*(5,e),n),d),n) = d(d(d(5,n),d),n) = d(d(10,d),n) = d(20,n) = 25. - A string w is "accepted" by a F.S.A. A iff d*(s,w) is in F; otherwise, w is "rejected". The language accepted by a F.S.A. A is defined as L(A) = { w in S* : A accepts w (i.e., d*(s,w) is in F) }. - Example: Come up with F.S.A. that recognizes L = { w in {a,b}* : w contains an even number of a's }. Use states that represent information about string processed so far. In this case, only need to remember if number of a's seen so far is even or odd, so only need two states "even" and "odd". Initial state should be "even" (since before reading any symbol, number of a's processed so far = 0 is even) and set of accepting states is simply {"even"}. To represent transition function, transition diagrams are a useful notation. Each state represented by a node (labelled with state), transitions represented by directed edges labelled with input symbol (i.e., d(q,a) = q' represented by edge from q to q' labelled with a). Initial state has "dangling" in-edge, accepting states have double circles for nodes (in ASCII picture below, accepting states will be boxed and non-accepting states will have no box or circle). a _ ---- -----> |/ \ -->|even| odd | b ---- <----- \_/ / |\ a \___/ b