Master's degree in Electronics (D.E.A.)
- Thesis Title: Morphological Analysis of Written Arabic
- University: Université Paris XI Orsay
- Institut National des Sciences et Techniques Nucléaires
- (CEN-Saclay), France
- Research lab: CNRS Centre National de la Recherche Scientifique
- Unité de recherche Associée N#962
- (Informatique Droit Linguistique)
- Supervisor : Dr. F. Debili
- Keywords : Natural Language Processing, Morphological Analysis,
Agglutination, Vocalic Harmony, Vowelization, Transliteration,
Multigraphies, Dictionary, Linguistic Information, Natural
Language Querying.
Abstract/Summary:
The Morphological Analysis attempts to recognize the words a text is
composed of and to associate to each a set of linguistic information.
The Morphological Analysis of Arabic is being confronted with various
problems which make it difficult and critical when compared with the
Morphological Analysis related to other languages.
Arabic presents agglutination of articles, prepositions and
conjunctions at the beginning of the word as well as pronouns at the
ending of the word, a matter which does not allow to reach
identification of words solely by means of simply cutting them up, based
on the identification of separations (spaces, puctuation, etc.). A
segmentation of words must take place in order to subdivide the
"morphological units" into proclitics, radical and enclitics.
Because of the agglutination, certain arabic words change form. A
radical may have different forms, depending whether or not it is
agglutinated to an affix un uponthe affix agglutinated to it. This is
the problem of multigraphy.
In Arabic, a text can be fully vowelized, partially vowelized or have
no vowels at all. The Morphological Analyzer must be in a state to
apprehend these various texts and restore the vowels to unvowelized
texts.
In order to face such demands, it became necessary to set up
dictionaries as well as various rules.
Schematically, the Morphological Analyzer proceeds in two stages:
During the first stage, the "morphological units" are identified, the
cut up into proclitics, radical and enclitics. For each word, all
possible cut-ups are searched for. This stage is called generation phase
or identification of word candidates.
During the second stage, all illegally generated cut-ups are searched
for and eliminated. This is the intruder elimination or purification
phase. Several steps are necessary in order to carry out this work.
First, the text is cut up into "morphological units" by means of an
automaton. Thereafter, each "morphological unit" is analyzed separately.
It is vowelized so as to permit analysis of vowelized or unvowelized
texts by way of the same algorithm; it is the radical and enclitics
thanks to the use of a specially designed dictionary. Subsequently,
rules of re-writing enable us to resolve the problem of multigraphy.
The above-mentioned steps make it possible to generate a large set of
cut-ups which are then vowelized; a great number of them are illicit and
eliminated during the next step.
The generated radicals are searched for in a dictionary. for cut-ups not
eliminated, a comparison of vowels generated with those of the texts is
to take place, if occasion arises. Thereafter, heuristics enable to
check compatibility between proclitics and enclitics of a cut-up, then
between proclitics and the radical, and between the radical and the
enclitics. The rules of vocalic harmony then permit to eliminate other
non valid cut-ups.
Finally, following the attribution of linguistic information to each
cut-up and each vowelizing, the validity of the application of
re-writing rules during the first stage are then checked.
The remaining cut-ups along with their linguistic information are then
transmited for the syntactical analysis. If no cut-up is left, the word
entered is considered erroneous and is given grammatical values by
default. Thereafter, it is either transmitted to the syntactical
analyser or to a corrector (human or machine).
Paris, September 1989.
Back to /Home/Research/