The corpus consists of one million words of American English texts printed in 1961. The texts for the corpus were sampled from 15 different text categories to make the corpus a good standard reference. Today, this corpus is considered small, and slightly dated. The corpus is, however, still used. Much of its usefulness lies in the fact that the Brown corpus lay-out has been copied by other corpus compilers. The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two examples of corpora made to match the Brown corpus. The availability of corpora which are so similar in structure is a valuable resourse for researchers interested in comparing different language varieties, for example.
For a long time, the Brown and LOB corpora were almost the only easily available computer readable corpora. Much research within the field of corpus linguistics has therefore been made using these data. By studying the same data from different angles, in different kinds of studies, researchers can compare their findings without having to take into consideration possible variation caused by the use of different data.
At the University of Freiburg, Germany, researchers are compiling new versions of the LOB and Brown corpora with texts from 1991. This will undoubtedly be a valuable resource for studies of language change in a near diachronic perspective.
The Brown corpus consists of 500 texts, each consisting of just over 2,000 words. The texts were sampled from 15 different text categories. The number of texts in each category varies (see below).
More comprehensive information about the Brown corpus can be found in the Brown Corpus Manual (external link).
The corpus is available though ICAME in various formats. For information about licensing and costs click here It is also possible to access the corpus online at the LDC site (after registering for a guest account).