Europarl Corpus

The Europarl corpus is a parallel corpus of several European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. It was extracted from the proceedings of the European Parliament, with a view providing sentence aligned text for statistical machine translation systems. Alignment was performed automatically. See the website and paper listed below for more information.

If you use this corpus in a way that leads to a publication, it is considered polite to cite the following paper, which can be downloaded at the following url:

Philipp Koehn (2005) 'Europarl: A Parallel Corpus for Statistical Machine Translation', MT Summit 2005.


