Equivalence of Regular Expressions and FSA (Chapter 7, cont'd) -------------------------------------------------------------- - We can use R.E. to denote specific languages. Similarly, FSA can be used to denote languages. - We show that for every R.E. R there is a FSA M such that the language accepted by M is the same as the language accepted by R, i.e L(R) = L(M). - Also, we show that for every FSA M there is a R.E. R that accepts the same language as M does, i.e. L(M) = L(R). - These two facts together imply that R.E. and FSA have the same *power*. R.E. to FSA. - We use the defintion of a R.E. and induction: . Basis: if R = empty then M contains only one state and no accepting state. if R = e then M contains only one state which is also an accetping state. if R = a for some a in S then M contains two states (q_0 and q_1) where q_0 is the inital state, q_0 -a-> q_1, and q_1 is accepting. . Ind. Step: Suppose R = (S + T), where S and T are regular expressions s.t. for each of them there is a FSA that accepts the same language as it does. Call these FSA M and M', where L(M) = L(S) and L(M') = L(T). Then L(R) = L(M) union L(M') is accepted by some FSA (as we saw for union operation). Similarly, if R = ST then L(R) = L(M) o L(M') is accepted by some FSA and if R = S^* then L(R) = L(M)^* is accepted by some FSA. - Example: construct a FSA that accepts the language denoted by (1(00)+0)^* First consider 00. The FSA for this would be: q_1 -0-> q_2 -0-> q_3 and q_2 is accepting. Then for 1(00) we get q_0 -1-> q_1 -0-> q_2 -0-> q_3. For 0 we have the FSA q'_0 -0-> q'_1 with q'_1 being the accepting. To take the union of these we get the machine with initial state s which has two e-transitions to q_0 and q'_0 and the accepting states are q_3 and q'_1. To make the Kleene stare, we add e-transitions from q_3 and q'_1 to s. FSA to R.E. - Without loss of generality, suppose that M is a DFSA with states Q = {1..n}. - For each two states i and j in Q and an integer k >= 0 we define a set of strings L^k_{i,j} which are all those strings x in S^*, such that if we start at state i (in M) and the input is x then the computation of M over x does not involve any state larger than k and ends in state j. - We want to prove that the following predicate: P(k): for each i,j in Q, there is a R.E. R^k_{i,j} that denotes L^k_{i,j} holds for all k >= 0. - Why proving this solves the problem? - Because: assume that f_1,...f_t are the accepting states of M. Then the language accepted by M is the set of all strings that take M from the initial state s to one of f_1...f_t. So: { empty if t = 0 L(M) = { { L^n_{s,f_1} union L^n_{s,f_2}...union L^n_{s,f_t} and since each of the L^n_{s,f_i} can be represented by R^n_{s,f_i}, therefore, L(M) can be represented by: R = R^n_{s,f_1} + R^n_{s,f_2} + ... + R^n_{s,f_t}, if t > 0, and R = empty, if t = 0. - To prove P(k), first we write a recursive relation for L^k_{i,j}: {{a in S: d(i,a) = j} if i <> j L^0_{i,j} = { {{e} union {a in S: d(i,a)=j} if i = j for 0 <= k < n: L^{k+1}_{i,j} = L^k_{i,j} union (L^k_{i,k+1} o (L^k_{k+1,k+1})^* o L^k_{k+1,j}) - Here is an explanation: for L^0_{i,j}, if i <> j the only string that takes M from i to j without going through any other state (because k = 0) is the string which has only the symbol(s) on the transition arrow(s) from i to j. If i = j, you can go from i to i by going through nothing else by reading empty string (e), or by reading those symbols that take M to i again. For L^{k+1}_{i,j}, the strings that take M from i to j using states from {1..k+1} can be devided into two groups: those that do NOT take M through k+1 at all, or those that take M through k+1 at least once. The first group are by definition those that take M through states {1..k} only, i.e L^k_{i,j}. For the second group, we divide the computation of each string into 3 parts: the first part is from state i up to the point that M goes to k+1, for the first part. This gives L^k_{i,k+1}. The second part is a concatination of zero or more of strings that take M from k+1 back to k+1 without using states higher than k, i.e (L^k_{k+1,k+1})^*. The last part is from state k+1 to state j without going through any state higher than k, i.e. L^k_{k+1,j}. - Having this recursive definition for L, it is easy to prove P(k) by induction. Basis: k = 0. In this case, L^k_{i,j} contains those symbols that take M from i to j (possibily e as well). So R = empty or a_1 + a_2 + ... + a_m , where a_x takes M from i to j. Ind. Step: Assume that P(m) holds for arbitrary m >= 0. To prove P(m+1) we must give a R.E. R^{m+1}_{i,j} for L^{m+1}_{i,j}. From the definition of L^{m+1}_{i,j}, we get: R^{m+1}_{i,j} = R^m_{i,j} + R^m_{i,k+1}(R^m_{k+1,k+1})^* R^m_{k+1,j} - Example: consdier the following FSA with q_1 being the start state and q_2 being the accepting state: q_1 -0-> q_1 -1-> q_2 -0-> q_2 -1-> q_1 k=0 | k = 0 | k = 1 | k = 2 ----------------------------------------------------------------------- R^k_{1,1}| e + 0 | 0^* | 0^* + 0^*1(0 + 10^*1)^*10^* R^k_{1,2}| 1 | 0^*1 | 01^* + 0^*1(0 + 10^*1)^*(e + 0 + 10^*1) R^k_{2,1}| 1 | 10^* | 10^* + (e + 0 + 10^*1)(e + 0 + 10^*1)^*10^* R^k_{2,2}| e + 0 |e + 0 + 10^*1| (e+0+10^*1)+(e+0+10^*1)(0+10^*1)^*(e+0+10^*1) Regular Languages ----------------- - We have shown that a language is accepted by a FSA iff there is a R.E. for it. - A language is called regular iff it is denoted by some R.E., or equivalently, iff it is accepted by a FSA (deterministic or non-deterministic). - This propery of regular languages can be helpful to build R.E. for them using a FSA for them (or vice versa). Example: Suppose that L and L' are two regular languages denoted by two R.E.'s R and R', respectively. To construct a R.E. for the language L'' = (L intersection L') one way is to first construct a FSA for L call it M and a FSA for L' call it M'. We can do this using R and R'. Then using the technique we had in Theorem 7.22 of the text, we construct a FSA for L'', call it M''. Then we construct a R.E. from M''. Non-regular languages --------------------- - It might seem that we can use R.E. to represent every language - But this is not true! That is, there are languages that are not regular. For example, the language {0^n1^n: n>= 0} is not regular. Intuitively, since each FSA has a finite number of states, it has a bounded memory (states are *memory* of a FSA). If there is a FSA M for this language with X states, and if we choose n large enough, (say larger than 2^X), then there is no way for M to *remember* how many 0's it has seen to compare with the number of 1's that it sees. - These arguments can be made formal in the form of a theorem: - Theorem (Pumping Lemma): Let L be a regular language over S. Then there is some n in N s.t. every string x in L with length at least n has the following property: There are strings u,v,w in S^* such that: . x = u v w . v <> e . u and v together have length at *most* n, i.e |uv| <= n . u v^k w in L, for all k >= 0 - Example: we want to show that L = {0^m 1^m: m >= 0} is not regular. By way of contradiction, assume that L is regular, and let n be the number stated in Pumping Lemma. This means that every string in L with length n, and in particular x = 0^n 1^n (which has lenght 2n), has the properties listed above. So x = u v w, with v <> e, and u v^k w in L, for k >= 0. Since |u v| <= n, it follows that u and v only contain 0's, and therefore, v = 0^i for some i >= 1 (since v <> e). But then u v^0 w = u w has less 0's than 1's and therefore cannot be in L, but it is in L, a contradiction.