Corpus Linguistics

[Please note: these pages are no longer maintained and may be out of date.]


INTRODUCTION

GLOSSARY

CORPORA

COURSES

BIBLIOGRAPHY

RELATED SITES

SOFTWARE

SEARCH ENGINE

TUTORIAL

COMMENTS




These pages have been created as part of the
W3-Corpora Project
at the
University of Essex.
 


Using corpora

Which corpus should I choose?

The choice of corpus is very important for the kind of results you will get and what the results can tell you. When deciding which corpus to use, there are certain points that are good to consider.

  • What kind of material do I want?
  • How much data do I want?
  • What is available?

What kind of material do I want?

What kind of material you want will vary with the kind of study you intend to perform. Some primary points to consider can be:
  • medium (written, spoken or both)
  • text type (fiction, non-fiction, scientific writing, children's books, spoken conversational, radio broadcasts etc.)
  • time (produced in the 20th century, in the 1990's, Middle English, etc)
In the List of Corpora you will find corpora of various kinds under certain sub-headings (spoken, historical, etc.)

How much data do I want?

How much data you want depends on your study. If you want to make extensive claims about the language as a whole, you will want large amounts of (representative) data. Similarly if you want to make statistical calculations you will probably also need large amounts of data. If you are interested in finding an example or two of how a particular word/phrase can be used, you do not need much data at all, as long as you can find your example in it. There are no given definitions of how large corpus you have to use or how many examples of something you have to find for studies of this kind or another. Generally speaking, it is important to have 'enough' data, and then it has to be decided in connection to each study how much data is 'enough'.

Big or small? Which do I choose?

The bigger the corpus, the more data. However, it is important to remember that not even a very big corpus can include all varieties of a language. On the other hand, a small corpus only contains a small sample of the language as a whole. But maybe it is the kind of sample you need?

A point that can be easy to forget is that when using a big corpus you can get too much data. If you want to study modal verbs and use the BNC, you might be overwhelmed to find that there are about 250,000 occurrences of the modal 'will' alone. If you want to study a phenomen in detail it might be better to use a small corpus, or a subcorpus created from a large corpus. A small corpus can be more convenient to use, but then it is important to keep in mind that it might be a restricted sample, a sample from only a subset of the language, or a small, not necessarily representative sample of the language as a whole.

What is available?

A very important question to consider when setting out to make a corpus-based study is 'what is available?'. There is a number of corpora, but not all of them are
  • publically available
  • readily available
Publically available corpora are those which anyone can use for free. Most corpora are not publically available. Some are available to anyone who buys a copy of it or a licence to use it, which may vary in cost between a few ponds (to cover administrative costs) to several hundred pounds. Some corpora are not available to anyone but their owners, and therefore not possible to obtain.

By readily available we here mean corpora which are ready to be used at once. What is readily available varies between different institutions. Some have corpora installed on their network, or stored on CD-roms. These are then available to anyone who has access to that network/CD-rom and knows how to use the corpus. Other institutions do not have access to any corpora, or not to the corpora that is needed for the particular task/study. When this is the case, the options are to try to get access to the corpus, or to use some other data or method.

Getting a corpus usually means acquiring it (buying, down-loading, compiling), installing it, and finding the right tools to use with it. This can be a time-consuming, complicated and costly procedure. Some corpora can be accessed online, freely or at a cost. You will find a list of such corpora here.

Tools

There are a number of different programs and search engines available for use with corpora, and some are presented on the 'tools' page (to be added).


NEXT

BACK STARTING PAGE TOOLS TUTORIAL SEARCH ENGINE


W3-Corpora project. 1998 Contact us.