|
|
Using corpora
Which corpus should I choose?
The choice of corpus is very important for the kind of results you will get and
what the results can tell you. When deciding which corpus to use, there are
certain points that are good to consider.
- What kind of material do I want?
- How much data do I want?
- What is available?
What kind of material do I want?
What kind of material you want will vary with the kind of study you intend to
perform. Some primary points to consider can be:
- medium (written, spoken or both)
- text type (fiction, non-fiction, scientific writing, children's books,
spoken conversational, radio broadcasts etc.)
- time (produced in the 20th century, in the 1990's, Middle English, etc)
In the List of Corpora you will find
corpora of various kinds under certain sub-headings (spoken, historical, etc.)
How much data do I want?
How much data you want depends on your study. If you want to make extensive
claims about the language as a whole, you will want large amounts of
(representative) data. Similarly if you want to make statistical calculations
you will probably also need large amounts of data. If you are interested in
finding an example or two of how a particular word/phrase can be used, you do
not need much data at all, as long as you can find your example in it.
There are no given definitions of how large corpus you have to use or how many
examples of something you have to find for studies of this kind or another.
Generally speaking, it is important to have 'enough' data, and then it has to be
decided in connection to each study how much data is 'enough'.
Big or small? Which do I choose?
The bigger the corpus, the more data. However, it is important to remember that
not even a very big corpus can include all varieties of a language. On the other
hand, a small corpus only contains a small sample of the language as a whole.
But maybe it is the kind of sample you need?
A point that can be easy to forget is that when using a big corpus you can
get too much data. If you want to study modal verbs and use the BNC, you might
be overwhelmed to find that there are about 250,000 occurrences of the modal
'will' alone. If you want to study a phenomen in detail it might be better to
use a small corpus, or a subcorpus created from a large corpus. A small corpus
can be more convenient to use, but then it is important to keep in mind that it
might be a restricted sample, a sample from only a subset of the language, or a
small, not necessarily representative sample of the language as a whole.
What is available?
A very important question to consider when setting out to make a corpus-based
study is 'what is available?'. There is a number of corpora, but not all of them
are
- publically available
- readily available
Publically available corpora are those which anyone can use for free.
Most corpora are not publically available. Some are available to anyone who buys
a copy of it or a licence to use it, which may vary in cost between a few ponds
(to cover administrative costs) to several hundred pounds. Some corpora are not
available to anyone but their owners, and therefore not possible to obtain.
By readily available we here mean corpora which are ready to be used
at once. What is readily available varies between different institutions. Some
have corpora installed on their network, or stored on CD-roms. These are then
available to anyone who has access to that network/CD-rom and knows how to use
the corpus. Other institutions do not have access to any corpora, or not to the
corpora that is needed for the particular task/study. When this is the case, the
options are to try to get access to the corpus, or to use some other data or
method.
Getting a corpus usually means acquiring it (buying, down-loading,
compiling), installing it, and finding the right tools to use with it. This can
be a time-consuming, complicated and costly procedure. Some corpora can be
accessed online, freely or at a cost. You will find a list of such corpora
here.
Tools
There are a number of different programs and search engines available for use
with corpora, and some are presented on the 'tools' page (to be added).
|