Corpus Linguistics

[Please note: these pages are no longer maintained and may be out of date.]











These pages have been created as part of the
W3-Corpora Project
at the
University of Essex.


The use of collections of text in language study is not a new idea. In the Middle Ages work began on making lists of all the words in a particular texts, together with their contexts - what we today call concordancing. Other scholars counted word frequencies from single texts or from collections of texts and produced lists of the most frequent words. Areas where corpora were used include language acquisition, syntax, semantics, and comparative linguistics, among others. Even if the term 'corpus linguistics' was not used, much of the work was similar to the kind of corpus based research we do today with one great exception - they did not use computers.

You can learn more about early corpus linguistics, HERE (external link). We will move on to look at some important stages in the development of corpus linguistics by focusing on some central corpora. The presentation below is not an extensive account of all corpora or every stage, but merely meant to help you get familiar with some key corpora and concepts.

The first generation

Today, corpus linguistics is closely connected to the use of computers; so closely, actually, that the term 'Corpus Linguistics' for many scholars today means 'the use of collections of COMPUTER-READABLE text for language study'.

The Brown Corpus - worthy of imitation

The first modern, electronically readable, corpus was The Brown Corpus of Standard American English. The corpus consists of one million words of American English texts printed in 1961. To make the corpus a good standard reference, the texts were sampled in different proportions from 15 different text categories: Press (repotage, editorial, reviews), Skills and Hobbies, Religious, Learned/scientific, Fiction (various subcategories), etc.

Today, this corpus is considered small, and slightly dated. The corpus is, however, still used. Much of its usefulness lies in the fact that the Brown corpus lay-out has been copied by other corpus compilers. The LOB, Lancaster-Oslo-Bergen, corpus (British English) and the Kolhapur Corpus (Indian English) are two examples of corpora made to match the Brown corpus. They both consist of 1 million words of written language, (500 texts of 2,000 words each) sampled in the same 15 categories as the Brown Corpus.

The availability of corpora which are so similar in structure is a valuable resourse for, for example, researchers interested in comparing different language varieties.

For a long time, the Brown and LOB corpora were the only easily available computer readable corpora. Much research within the field of corpus linguistics has therefore been based on these corpora.

The London-Lund Corpus of Spoken British English

Another important "small" corpus is the London-Lund Corpus of Spoken British English (LLC). The corpus was the first computer readable corpus of spoken language, and it consists of 100 spoken texts of appr. 5,000 words each. The texts are classified into different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. The texts are ortographically transcribed and have been provided with detailed prosodic marking.

Big is beautiful?

BoE and BNC

The first generation corpora, of 500,000 and 1 million words, proved to be very useful in many ways and have been used for a number of research tasks (links to be added here). It soon turned out, however, that for certain tasks, larger collections of text were needed. Dictionary makers, for example wanted large, up-to-date collections of text where it would be possible to find not only rare words but also new words entering the language.

In 1980, COBUILD started to collect a corpus of texts on computer for dictionary making and language study (learn more here ). The compilers of the Collins Cobuild English Language Dictionary (1987) had daily access to a corpus of approximately 20 million words. New texts were added to the corpus, and in 1991 it was launched as the Bank of English (BoE). More and more data has been added to the BoE, and the latest release (1996) contains some 320 million words! New material is constantly added to the corpus to make it "reflect[s] the mainstream of current English today". A corpus of this kind, which by the new additions 'monitors' changes in the language, is called a monitor corpus. Some people prefer not to use the term corpus for text collections that are not finite but constantly changing/growing.

In 1995 another large corpus was released; the British National Corpus (BNC). This corpus consists of some 100 million words. Like the BoE it contains both written and spoken material, but unlike the BoE, the BNC is finite - no more texts are added to it after its completion. The BNC texts were selected according to carefully pre-defined selection criteria with targets set for the amount of text to be included from different text types (learn more HERE). The texts have been encoded with mark-up providing information about the texts, authors, speakers.

Specialized corpora

Historical corpora

The use of collections of text in the study of language is, as we have seen, not a new invention. Among those involved in historical linguistics were some that soon saw the potential usefulness of computerised historical corpora. A diachronic corpus with English texts from different periods was compiled at the University of Helsinki. The Helsinki Corpus of English Texts contains texts from the Old, Middle and Early Modern English periods, 1,5 million words in total.

Another historical corpus is the recently released Lampeter Corpus of Early Modern English Tracts. This collection consists of "[P]amphlets and tracts published in the century between 1640 and 1740" from six different domains. The Lampeter Corpus can be seen as one example of a corpus covering a more specialized area.

Corpora for Special Purposes

The corpora described above are general collections of text, collected to be used for research in various fields. There is a large, and growing, amount of highly specialized corpora that are created for a special purpose. Many of these are used for work on spoken language systems. Examples of such are, for example, the Air Traffic Control Corpus, ATC0 , created to be used "in the area of robust speech recognition in domains similar to air traffic control" and the TRAINS Spoken Dialogue Corpus collected as part of a project set up to create "a conversationally proficient planning assistant" (railroad freight system).

A number of highly specialized corpora are held at the Centre for Spoken Language Understanding, CSLU, in Oregon. These corpora are specialized in a different way to the ones mentioned above. They are not restricted to be used within a particular subject field, but are called specialized because their content. Many of the corpora/databases consist of recordings of people asked to perform a particular task over the telephone, such as saying and spelling their name or repeating certain words/phrases/numbers/letters (read more HERE).

International/multilingual Corpora

As we have seen above, there is a great variety of corpora in English. So far much corpus work has indeed concerned the English language, for various reasons. There are, however, a growing number of corpora available in other languages as well. Some of them are monolingual corpora - collections of text from one language. Here the Oslo Corpus of Bosnian text and the Contemporary Portuguese Corpus can be mentioned as two examples.

A number of multilingual corpora also exist. Many of these are parallel corpora; corpora with the same text in several languages. These corpora are often used in the field of Machine Translation. The English-Norwegian Parallel Corpus is one example, the English Turkish Aligned Parallel Corpora another.

The Linguistic Data Consortium (LDC) holds a collection of telephone conversations in various languages: CALLFRIEND and CALLHOME.


The increased availability and use of the Internet have made it possible to find great amounts of texts readily available in electronic format. Apart from all the web-pages containing information of different kinds, it is also possible to find whole collections of text. Among these collections can be mentioned all the on-line newspapers and journals (example), and sites where whole books can be found on-line (example). Other examples yet include dictionaries and word-lists of various kinds.

Although these collections may not be considered corpora for one reason or another (see definition of corpus), they can be analysed with corpus linguistic tools and methods. This is an area which has not yet been explored in detail, although some attempts have been made at using the Internet as one big corpus.

Further information about collections of text available on the Internet can be found on the Related Sites page.

Ongoing projects

ICE: the International Corpus of English

In twenty centres around the world, compilers are busy collecting material for the ICE corpora. Each ICE corpus will consist of 1 million words (written and spoken) of a national variety of English. The first ICE corpus to be completed is the British component, ICE-GB. On their own, the ICE corpora will be a small but valuable resources to exploit in order to learn about different varieties of English. As a whole, the 20 corpora will be useful for variational studies of various kinds. You can learn more about the ICE project at the ICE-GB site.

ICLE: the International Corpus of Learner English

Like ICE (see above) ICLE is an international project involving several countries. Unlike ICE, however, the ICLE corpora do not consist of native speaker language. Instead they are corpora of English language produced by learners in the different countries. This will constitute a valuable resource for research on second language acquisition.
You can read about some of the areas where the ICLE corpora are used HERE(external link) or in the book Learner English on Computer.


The amount and diversity of corpus related research projects and groups are great. Below is a small sample to give you an understanding of the scope and variety. You can find more information by following the links on the Related Sites page.
  • AMALGAM Automatic Mapping Among Lexico-Grammatical Annotation
    "an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora" (home page)
  • The Canterbury Tales Project
    "aims to make available ... full transcripts of the ... Canterbury Tales" (home page).
  • CSLU: The Center for Spoken Language Understanding
    "a multidisciplinary center for research in the area of spoken language understanding" (home page).
  • ETAP : Creating and annotating a parallel corpus for the recognition of translation equivalents
    This project, run at the University of Uppsala, Sweden, aims to develop a computerized multilingual corpus based on Swedish source text with translations into Dutch, English, Finnish, French, German, Italian and Spanish. (home page)
    TELRI is an initiative, funded by the European Commission, meant to facilitate work in the field of Natural Language Processing (NLP) by, among other things, supplying various language resources. Read more on the home page.

What next?

The interest for computerised corpora and corpus linguistics is growing. More and more universities offer courses in corpus linguistics and/or use corpora in their teaching and research. The number and diversity of corpora being compiled are great and corpora as used in many projects. It is not possible to go into detail and present all the corpora, all the courses, all the projects here. This has been meant as a brief introduction. More information can be found by browsing the net and reading journals and books. The electronic mailing list Corpora can be a good starting point for someone who wishes to learn about what goes on within the field of corpus linguistics at the moment.

Working with corpora


W3-Corpora project.1998 Contact us.