The Corpus of Contemporary American English (COCA)
See the coca directory in the `usual place': /ufs/corpora/ (unix), \\corpora\corpora\ (Windows) (see the Introduction).
http://corpus.byu.edu/coca/; there is also a README file in the coca directory.
The corpus is represented in three forms:
1. database format -- in the db/ folder 2. text format -- in text/ folder 3. wlp format -- in the wlp/ folder (=word, lemma, part-of-speech tag)1. Files in the database format consist of purely numeric data, in lines that look like this:
4000161 276530393 9100161Unless you know what you are doing, it will not be much use to you.
2. Files in text format consist of long lines of tokenized text with minimal markup -- `tokenized' means, e.g. that punctuation is separated out so that if there is a sentence ending with the word "soils", we see "soils ." rather than "soils."; minimal mark up means there is not much more than "<p>" as a paragraph marker.
3. Files in wlp (=word, lemma, part-of-speech tag format) consist of lines with three tab-separated fields, like this:
animals animal nn2indicating that the word-form `animals' is a form of the lemma `animal' with the part-of-speech tag 'nn2'.
Reading the first column of a wlp file gives you the text in essentially the same way as the corresponding text file.
The files in each format are divided into subdirctories according to genre, e.g. for the wlp directory we have:
- wlp_academic_rpe -- academic texts - wlp_fiction_awq -- fiction texts - wlp_magazine_qim -- magazine texts - wlp_newspaper_lsp -- newspaper texts - wlp_spoken_kde -- spoken textsWithin each of these, material is divided by year. So, for example, we have:
- wlp/wlp_academic_rpe/wlp_acad_1990.txt -- texts from 1990 in the academic genre in wlp format - text/text_academic_rpe/w_acad_1990.txt -- the same texts in full text format - db/db_academic_rpe/db_acad_1990.txt -- the same in database formatThere is also a 'shared' directory with the following contents:
- shared/lexicon/lexicon.txt -- a list of tokens, with lemmas, to be used with the database format. - shared/sources/coca-sources.txt -- a list of the sources the corpus was compiled from with information about year, genre, sub-genre, etc. - shared/subgenreCodes.txt -- what the sub-genre codes mean
See http://corpus.byu.edu/full-text/intro.asp for more introductory information.
The following gives an idea of the folder structure in tree form:
db/ db_academic_rpe/ db_acad_1990.txt db_acad_1991.txt ... db_acad_2012.txt db_fiction_awq/ ... db_magazine_qim/ ... db_newspaper_lsp/ ... db_spoken_kde/ ... wlp/ wlp_academic_rpe/ ... wlp_fiction_awq/ ... wlp_magazine_qim/ ... wlp_newspaper_lsp/ ... wlp_spoken_kde/ ... text/ text_academic_rpe/ ... text_fiction_awq/ ... text_magazine_qim/ ... text_newspaper_lsp/ ... text_spoken_kde/ ...
Contact: email@example.com (Doug Arnold)