Master's degree in Electronics (D.E.A.)


Abstract/Summary:

The Morphological Analysis attempts to recognize the words a text is composed of and to associate to each a set of linguistic information. The Morphological Analysis of Arabic is being confronted with various problems which make it difficult and critical when compared with the Morphological Analysis related to other languages.

  • Arabic presents agglutination of articles, prepositions and conjunctions at the beginning of the word as well as pronouns at the ending of the word, a matter which does not allow to reach identification of words solely by means of simply cutting them up, based on the identification of separations (spaces, puctuation, etc.). A segmentation of words must take place in order to subdivide the "morphological units" into proclitics, radical and enclitics.

  • Because of the agglutination, certain arabic words change form. A radical may have different forms, depending whether or not it is agglutinated to an affix un uponthe affix agglutinated to it. This is the problem of multigraphy.

  • In Arabic, a text can be fully vowelized, partially vowelized or have no vowels at all. The Morphological Analyzer must be in a state to apprehend these various texts and restore the vowels to unvowelized texts.

    In order to face such demands, it became necessary to set up dictionaries as well as various rules.

    Schematically, the Morphological Analyzer proceeds in two stages: During the first stage, the "morphological units" are identified, the cut up into proclitics, radical and enclitics. For each word, all possible cut-ups are searched for. This stage is called generation phase or identification of word candidates.

    During the second stage, all illegally generated cut-ups are searched for and eliminated. This is the intruder elimination or purification phase. Several steps are necessary in order to carry out this work. First, the text is cut up into "morphological units" by means of an automaton. Thereafter, each "morphological unit" is analyzed separately. It is vowelized so as to permit analysis of vowelized or unvowelized texts by way of the same algorithm; it is the radical and enclitics thanks to the use of a specially designed dictionary. Subsequently, rules of re-writing enable us to resolve the problem of multigraphy. The above-mentioned steps make it possible to generate a large set of cut-ups which are then vowelized; a great number of them are illicit and eliminated during the next step.

    The generated radicals are searched for in a dictionary. for cut-ups not eliminated, a comparison of vowels generated with those of the texts is to take place, if occasion arises. Thereafter, heuristics enable to check compatibility between proclitics and enclitics of a cut-up, then between proclitics and the radical, and between the radical and the enclitics. The rules of vocalic harmony then permit to eliminate other non valid cut-ups.

    Finally, following the attribution of linguistic information to each cut-up and each vowelizing, the validity of the application of re-writing rules during the first stage are then checked. The remaining cut-ups along with their linguistic information are then transmited for the syntactical analysis. If no cut-up is left, the word entered is considered erroneous and is given grammatical values by default. Thereafter, it is either transmitted to the syntactical analyser or to a corrector (human or machine).

    Paris, September 1989.


    Back to /Home/Research/