The Corpus of Contemporary American English (COCA)


See the coca directory in the `usual place': /ufs/corpora/ (unix), \\corpora\corpora\ (Windows) (see the Introduction).

Documentation; there is also a README file in the coca directory.


The corpus is represented in three forms:

 1. database format -- in the db/ folder
 2. text format -- in text/ folder
 3. wlp format -- in the wlp/ folder (=word, lemma, part-of-speech tag)
1. Files in the database format consist of purely numeric data, in lines that look like this:
	4000161 276530393       9100161
Unless you know what you are doing, it will not be much use to you.

2. Files in text format consist of long lines of tokenized text with minimal markup -- `tokenized' means, e.g. that punctuation is separated out so that if there is a sentence ending with the word "soils", we see "soils ." rather than "soils."; minimal mark up means there is not much more than "<p>" as a paragraph marker.

3. Files in wlp (=word, lemma, part-of-speech tag format) consist of lines with three tab-separated fields, like this:

		animals animal  nn2
indicating that the word-form `animals' is a form of the lemma `animal' with the part-of-speech tag 'nn2'.

Reading the first column of a wlp file gives you the text in essentially the same way as the corresponding text file.

The files in each format are divided into subdirctories according to genre, e.g. for the wlp directory we have:

 - wlp_academic_rpe -- academic texts
 - wlp_fiction_awq  -- fiction texts
 - wlp_magazine_qim -- magazine texts
 - wlp_newspaper_lsp -- newspaper texts
 - wlp_spoken_kde -- spoken texts
Within each of these, material is divided by year. So, for example, we have:
 -  wlp/wlp_academic_rpe/wlp_acad_1990.txt  -- texts from 1990 in the academic genre in
    wlp format
 -  text/text_academic_rpe/w_acad_1990.txt  -- the same texts in full text format
 -  db/db_academic_rpe/db_acad_1990.txt -- the same in database format
There is also a 'shared' directory with the following contents:
  - shared/lexicon/lexicon.txt  -- a list of tokens, with lemmas, to be used with the
    database format.
  - shared/sources/coca-sources.txt -- a list of the sources the corpus was compiled from
    with information about year, genre, sub-genre, etc.
  - shared/subgenreCodes.txt -- what the sub-genre codes mean

See for more introductory information.

The following gives an idea of the folder structure in tree form:

  db_fiction_awq/ ...
  db_magazine_qim/ ...
  db_newspaper_lsp/ ...
  db_spoken_kde/ ...
  wlp_academic_rpe/ ...
  wlp_fiction_awq/ ...
  wlp_magazine_qim/ ...
  wlp_newspaper_lsp/ ...
  wlp_spoken_kde/ ...
  text_academic_rpe/ ...
  text_fiction_awq/ ...
  text_magazine_qim/ ...
  text_newspaper_lsp/ ...
  text_spoken_kde/ ...

Contact: (Doug Arnold)