The word "corpus", derived from the Latin word meaning "body", may be
used to refer to any text in written or spoken form.
However, in modern Linguistics this term is used to refer to
large collections of texts which
represent a sample of a particular variety or use of language(s) that are
presented in machine readable form. Other definitions, broader or stricter,
exist. See, for example, the definition in the book "Corpus
Linguistics" by Tony McEnery and Andrew Wilson or read more about
different kinds of corpora in
the Systematic Dictionary of Corpus Linguistics.
Computer-readable corpora can consist of raw text only,
i.e. plain text with no additional information. Many corpora have been provided
with some kind of linguistic information, here called mark-up or
Types of corpora
There are many different kinds of corpora. They can contain written or spoken
(transcribed) language, modern or old texts, texts from one language or several
languages. The texts can be whole books, newspapers, journals, speeches etc, or
consist of extracts of varying length. The kind of texts included and the
combination of different texts vary between different corpora and corpus types.
'General corpora' consist of general texts, texts that do not belong to a single
text type, subject field, or register. An example of a general corpus is the British National Corpus.
Some corpora contain texts that are sampled (chosen from) a particular variety
of a language, for example, from a particular dialect or from a particular
subject area. These corpora are sometimes called 'Sublanguage Corpora'.
Corpora can consist of texts in one language (or language variety) only or of
texts in more than one language. If the texts are the same in all languages,
e.i. translations, the corpus is called a Parallel
Corpus. A Comparable
Corpus is a collection of "similar" text
For a list of various corpora, click
Corpora serve as the basis for a number of research tasks
within the field of Corpus Linguistics.