A Fast Method for Parallel Document Identification
We present a fast method
to identify homogeneous parallel documents.
The method is based on collecting counts
of identical low-frequency words between possibly parallel documents.
The candidate with the most shared low-frequency words is
selected as the parallel document.
The method achieved 99.96\% accuracy when tested on the EUROPARL
corpus of parliamentary proceedings,
failing only in anomalous cases of truncated or otherwise distorted
documents.
While other work has shown similar performance on this type of
dataset,
our approach presented here is faster and does not require training.
Apart from proposing an efficient method
for parallel document identification in a restricted domain,
this paper furnishes evidence that parliamentary
proceedings may be inappropriate for testing parallel document
identification systems in general.