Algorithms for Language Reconstruction

Genetically related languages originate from a common proto-language. In the absence of historical records, proto-languages have to be reconstructed from surviving cognates, that is words that existed in the proto-language and are still present in some form in its descendants. The language reconstruction methods have so far been largely based on informal and intuitive criteria. In this thesis, I present techniques and algorithms for performing various stages of the reconstruction process automatically.

The thesis is divided into three main parts that correspond to the principal steps of language reconstruction. The first part presents a new algorithm for the alignment of cognates, which is sufficiently general to align any two phonetic strings that exhibit some affinity. The second part introduces a method of identifying cognates directly from the vocabularies of related languages on the basis of phonetic and semantic similarity. The third part describes an approach to the determination of recurrent sound correspondences in bilingual wordlists by inducing models similar to those developed for statistical machine translation.

The proposed solutions are firmly grounded in computer science and incorporate recent advances in computational linguistics, articulatory phonetics, and bioinformatics. The applications of the new techniques are not limited to diachronic phonology, but extend to other areas of computational linguistics, such as machine translation.