Algorithms for Language Reconstruction
Genetically related languages originate from a common
proto-language. In the absence of historical records,
proto-languages have to be reconstructed from surviving
cognates, that is words that existed in the proto-language
and are still present in some form in its descendants. The
language reconstruction methods have so far been largely
based on informal and intuitive criteria. In this thesis, I
present techniques and algorithms for performing various
stages of the reconstruction process automatically.
The thesis is divided into three main parts that correspond
to the principal steps of language reconstruction. The
first part presents a new algorithm for the alignment of
cognates, which is sufficiently general to align any two
phonetic strings that exhibit some affinity. The second
part introduces a method of identifying cognates directly
from the vocabularies of related languages on the basis of
phonetic and semantic similarity. The third part describes
an approach to the determination of recurrent sound
correspondences in bilingual wordlists by inducing models
similar to those developed for statistical machine
translation.
The proposed solutions are firmly grounded in computer
science and incorporate recent advances in computational
linguistics, articulatory phonetics, and bioinformatics.
The applications of the new techniques are not limited to
diachronic phonology, but extend to other areas of
computational linguistics, such as machine translation.