Corpus Linguistics

[Please note: these pages are no longer maintained and may be out of date.]











These pages have been created as part of the
W3-Corpora Project
at the
University of Essex.

What is a Corpus?

The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics.

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.

Types of corpora

There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types.

'General corpora' consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus. Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'.

Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the corpus is called a Parallel Corpus. A Comparable Corpus is a collection of "similar" text

For a list of various corpora, click HERE

Corpora serve as the basis for a number of research tasks within the field of Corpus Linguistics.



W3-Corpora project.1998 Contact us.