Europarl Corpus

The Europarl corpus is a parallel corpus of several European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. It was extracted from the proceedings of the European Parliament, with a view providing sentence aligned text for statistical machine translation systems. Alignment was performed automatically. See the website and paper listed below for more information.

If you use this corpus in a way that leads to a publication, it is considered polite to cite the following paper, which can be downloaded at the following url:

Philipp Koehn (2005) 'Europarl: A Parallel Corpus for Statistical Machine Translation', MT Summit 2005.

http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl-mtsummit05.pdf

Availability

See the europarl directory in the `usual place': /ufs/corpora/ (unix), \\corpora\corpora\ (Windows) (see the Introduction).

Documentation

http://www.statmt.org/europarl/ ; there is also a README file in the europarl directory.

Contact: doug@essex.ac.uk (Doug Arnold)