----------------------
Formal language theory  [Chapter 7]
----------------------

Basic definitions:
  - Alphabet: any finite, non-empty set of atomic symbols (meaning
    "compound" symbols like "ab" are not allowed)
    (e.g., {a,b,c}, {0,1}, {+}).
  - String: any finite sequence of symbols; the empty sequence is denoted
    "e" and called "empty string" (e.g., a, ab, cccc are strings over
    {a,b,c}; e, ++++ are strings over {+}).
  - Language: any set of strings (can be empty, finite, or infinite)
    (e.g., {bab, bbabb, bbbabbb, ...} is a language over {a,b,c}).
  - The _length_ of a string s is the number of symbols in s, denoted |s|
    (e.g., |bba| = 3, |+| = 1, |e| = 0).
  - The _concatenation_ of two strings s and t is denoted st (sometimes s.t)
    and consists of every symbol of s followed by every symbol of t
    (e.g., bba.bb = bbabb, e.+++ = +++).
  - The _reversal_ of a string x, denoted by (x)^R is the string obtained by
    listing the elements of x in revers order.


Basic conventions:
  - Alphabets cannot contain the symbol "e" for the empty string.
  - Two strings s, t are equal iff |s| = |t| and i-th symbol of s = i-th
    symbol of t for 1 <= i <= |s|.
  - For an alphabet S, let S^n denote the set of all strings of length n
    over S, and S^* denote the set of all strings over n.
    E.g., {0,1}^3 = {000,001,010,011,100,101,110,111},
    {+}^* = {e,+,++,+++,++++,...}, {a,b,c}^0 = {e}.

Operations on languages:
  - Complementation: ~L = Sigma^* - L
  - Union: L union L' = {x: x in L or x in L'}
  - Concatination: L o L' = {xy: x in L and y in L'}
  - Kleene star: L^* = {x^n: x in L and n >= 0 is an integer}
  - Language exponentiation: L^k = { {e}            if k = 0 
                                   { L^{k-1} o L    if k > 0
  - Reversal: Rev(L) = {(x)^R : x in L }

-------------------
Regular expressions
-------------------
  - Regular expressions describe sets of strings using a small number of
    basic operations.
  - The set of regular expressions (regexps) over an alphabet S is defined
    as follows, assuming that S does not contain symbols "{}" and "e":
      . {} (empty set symbol), e (empty string symbol) are regexps;
      . every element of S is a regexp;
      . if R and S are regexps, then so are:
          R+S (union) -- lowest precedence,
          RS (concatenation),
          R* (star, also called "Kleene star") -- highest precedence.
  - For each regular expression R, we define the language described by R
    (L(R)) as follows:
      . L({}) = {} (the empty set of strings)
      . L(e) = {e} (the set that contains only the empty string)
      . L(a) = {a} for every symbol a in S
      . L(R+S) = L(R) U L(S) (set union)
      . L(RS) = { rs : r in L(R) and s in L(S) }
      . L(R*) = { s in S* : s = e or s = s1...sk for some k>=1 and some
                            strings s1 in L(R), ..., sk in L(R) }
              = all possible concatenations of 0 or more strings from L(R)
              = [intuitively] L(e+R+RR+RRR+RRRR+...)
  - Examples:
      . L(a+b) = {a,b}
      . L(ab) = {ab}
      . L((a+b)a) = {aa,ba} = L(aa+ba)
      . L(a*) = {e,a,aa,aaa,...}
        (zero or more repetitions of "a")
      . L(aa*) = {a,aa,aaa,...} = L(a*a)
        (one or more repetitions of "a")
      . L((ab)*) = {e,ab,abab,ababab,...}
        (zero or more repetitions of "ab")
      . L((a+b)*) = {e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}
        (zero or more repetitions of a's or b's, i.e.,
        every string of a's and b's)
      . L(a*+b*) = {e,a,b,aa,bb,aaa,bbb,aaaa,bbbb,...}
        (every string consisting entirely of a's or entirely of b's)
      . L((a+b)(a+b)*) = {a,b,aa,ab,ba,bb,...}
        (every nonempty string of a's and b's)
      . L(a(ba+c)*) = {a,aba,ac,ababa,abac,acba,acc,...}
  - Examples of regular expressions:
     1. All strings of 0's and 1's that have the same first and last symbol:
        e + 0 + 1 + 0(0+1)*0 + 1(0+1)*1
     2. L' = {all strings of 0's and 1's that contain at least one 0}:
        (0+1)*0(0+1)*  or  1*0(0+1)*  or  (0+1)*01*
  - Regular expressions R and S are equivalent (denoted R == S) iff they
    represent the same language, i.e., L(R) = L(S).  For example,
    1*0(0+1)* == (0+1)*01*.  This is not completely obvious and needs proof.

  - General properties of regular expressions:
      . R+S == S+R  [commutativity]
      . (R+S)+T == R+(S+T)  [associativity]
      . (RS)T == R(ST)  [associativity]
      . R(S+T) == RS+RT  [left distributivity]
      . (S+T)R == SR+TR  [right distributivity]
      . R+{} == R  [identity]
      . eR == R == Re  [identity]
      . {}R == {} == R{}  [annihilator]
      . R** == R*  [idempotence]

  - Example of proof of equivalence:
    We prove that (0+1)*01* == 1*0(0+1)*, by double inclusion.
    L((0+1)*01*) subset of L(1*0(0+1)*):
        Let s be an arbitrary string in L((0+1)*01*).  Then, s = r0t for
        some strings r in L((0+1)*) and t in L(1*).  Consider two cases.
        Case 1: If r contains no 0, then r also belongs to L(1*).  Since t
            belongs to L((0+1)*) (every string of 0's and 1's does), s = r0t
            for some string r in L(1*) and some string t in L((0+1)*) so s
            belongs to L(1*0(0+1)*).
        Case 2: If r contains at least one 0, then r contains a first 0 so
            we can write r = u0v for some string u in L(1*) and some string
            v in L((0+1)*) (either u or v, or both, may be empty).  But
            then, s = u0v0t = u0w (where w = v0t) for some string u in L(1*)
            and w in L((0+1)*), so s belongs to L(1*0(0+1)*).
        In both cases, s belongs to L(1*0(0+1)*).
    L(1*0(0+1)*) subset of L((0+1)*01*): a similar argument works 
        (Note that this is not always the case: in
        general, each direction of the proof can involve different kinds of
        arguments and must be done separately.)

---------------------
Finite state automata
---------------------
  - Simple models of computing devices used to analyze strings.  A F.S.A.
    has a fixed, finite set of "states", one of which is the "initial state"
    and some of which are "accepting" (or "final") states, as well as
    "transitions" from one state to another for each possible symbol of a
    string.  The F.S.A. starts in its initial state and processes a string
    symbol-by-symbol: for each symbol processed, the F.S.A. switches states
    based on the latest input symbol and its current state, based on the
    transitions.  Once the entire string has been processed, the F.S.A. will
    either be in an accepting state (in which case the string is "accepted")
    or not (in which case the string is "rejected").
  - Example:
    Simplified control mechanism for a vending machine that accepts only
    nickels (5c), dimes (10c) and quarters (25c), where everything costs
    exactly 30c and no change is ever given.
    Alphabet S = {n,d,q} (for "nickel", "dime", "quarter"), set of states =
    {0,5,10,15,20,25,30} (for amount of money put in so far; no need to keep
    track of excess since no change will be provided), and transitions are
    defined by following table (state across the top, input symbol down the
    side), with the initial state being "0" and the only accepting state
    being "30":
          |  0  5 10 15 20 25 30
       ---+----------------------
        n |  5 10 15 20 25 30 30
        d | 10 15 20 25 30 30 30
        q | 25 30 30 30 30 30 30

    Computation of F.S.A. on input such as "dndd" proceeds as follows:
    state 0 -> process 'd' -> state 10 -> process 'n' -> state 15 -> process
    'd' -> state 25 -> process 'd' -> state 30.  Since the last state is
    accepting, the string "dndd" is accepted.

  - Formal definition: A F.S.A. is a quintuple (Q,S,d,s,F) where
        Q is a finite set of states
        S is a finite alphabet (Q intersect S is empty)
        d : Q * S -> Q is a transition function that gives the next state
            for each possible state in Q and symbol in S
        s in Q is the initial state
        F subset of Q is the set of accepting states
  - Transition function gives new state for each state and single input
    symbol.  Extended transition function d*(q,w) gives new state for F.S.A.
    after processing string w starting from state q.  It can be defined
    recursively, as follows:
                  { q              if w = e (empty),
        d*(q,w) = {
                  { d(d*(q,w'),a)  if w = w'a for some w' in S* and a in S.
    Example: d*(5,ndn) = d(d*(5,nd),n) = d(d(d*(5,n),d),n) =
    d(d(d(d*(5,e),n),d),n) = d(d(d(5,n),d),n) = d(d(10,d),n) = d(20,n) = 25.
  - A string w is "accepted" by a F.S.A. A iff d*(s,w) is in F; otherwise, w
    is "rejected".  The language accepted by a F.S.A. A is defined as
    L(A) = { w in S* : A accepts w (i.e., d*(s,w) is in F) }.

  - Example: Come up with F.S.A. that recognizes L = { w in {a,b}* : w
    contains an even number of a's }.  Use states that represent information
    about string processed so far.  In this case, only need to remember if
    number of a's seen so far is even or odd, so only need two states "even"
    and "odd".  Initial state should be "even" (since before reading any
    symbol, number of a's processed so far = 0 is even) and set of accepting
    states is simply {"even"}.
    To represent transition function, transition diagrams are a useful
    notation.  Each state represented by a node (labelled with state),
    transitions represented by directed edges labelled with input symbol
    (i.e., d(q,a) = q' represented by edge from q to q' labelled with a).
    Initial state has "dangling" in-edge, accepting states have double
    circles for nodes (in ASCII picture below, accepting states will be
    boxed and non-accepting states will have no box or circle).
                    a         _
            ----  ----->    |/ \
        -->|even|        odd    | b
            ----  <-----     \_/
           /  |\     a
           \___/
             b