Release 1 of the SUSANNE Corpus was completed on 1992.09.06.
Release 2, dated 1993.06.02, corrected a number of errors found in Release
It also contains one minor change in annotation conventions: in the parse field, from Release 2 onwards all node labels are written within square brackets (Release 1 contained a redundant distinction between square brackets for ordinary nodes and angle brackets for "trace" nodes, which are distinguished in several other ways).
Release 3 of 1994.04.04 corrects further errors which came to light during the process of finalizing the MS of the book ENGLISH FOR THE COMPUTER. One proofreading technique applied in the creation of Release 3 was to read through the entire Corpus text printed in a format which used indentation to display the parse-field bracketing structure, in order to catch structural errors such as inappropriate placement of postmodifier constituents within parse trees.
Release 4 corrects a further handful of errors discovered in checking the proofs of ENGLISH FOR THE COMPUTER and otherwise.
The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English.
The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology.
The SUSANNE analytic scheme is defined in detail in a book by myself, ENGLISH FOR THE COMPUTER, published by Oxford University Press on 23 February 1995. The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these.
The finished SUSANNE parsing scheme has been developed on the basis of samples of both British and American English. It is oriented chiefly towards written language.
The SUSANNE Corpus itself comprises an approximately 130,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The original motives for producing this database included that of providing better statistics for probabilistic parsing.
The SUSANNE scheme may be unparallelled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented -- thus defining analytic standards which permit annotation of future material to be extremely self-consistent. Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being a merely aprioristic system. The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy.
The SUSANNE Corpus consists of 64 files (apart from the documentation file), each containing an annotated version of one 2000+ word text from the Brown Corpus. Files average about 83 kilobytes in size, thus the entire Corpus totals about 5.3 megabytes.
The file names are those of the respective Brown texts.
Sixteen texts are drawn from each of the following Brown genre categories:
For more information, (as well as a copy of thr corpus, go to the Documentation file available with the corpus.