English-language corpora
Download the corpus list as pdf-document here
The following list provides
essential information on some of the most widely used corpora in
English linguistics. Most corpora are sold for off-line use. Therefore,
some of the resources mentioned in 1. to 4. below may not be available
on all campuses. Section 5 mentions corpora which provide free online
access.
-
Some important general-purpose corpora of present-day English (British and American English):
ANC (American National Corpus) [still being compiled]
- will contain 100 million words of American English, comparable to BNC
- written and spoken data from 1990 onwards, broad coverage of different genres and text-types
http://americannationalcorpus.org
BNC (British National Corpus)
- contains about 100 million words of text samples of varying length, including ca. 10 million words of transcribed speech
- data mainly from the late 1980s and 1990s (some earlier)
- for free online search facilities see 5. below
www.natcorp.ox.ac.uk
CSAE (Corpus of Spoken American English)/ Santa Barbara Corpus of Spoken American English
- first electronic corpus of spoken American English
- samples different kinds of naturally occurring speech
(spontaneous dialogues, monologues, speeches, radio broadcasts, etc.)
http://www.linguistics.ucsb.edu/research/sbcorpus.html
BROWN Corpus
- the pioneer digital corpus of present-day English assembled in the early 1960s
- one million words of edited written American English (500 samples of
text of ca. 2,000 words each, all originally published in 1961)
- various genres (e.g. press reportage, fiction, government documents)
http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM
LOB (Lancaster – Oslo/Bergen) Corpus
- written texts published in Britain in 1961, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM
F-LOB (Freiburg – Lancaster-Oslo/Bergen Corpus)
- written texts published in 1991, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM
FROWN (Freiburg-Brown Corpus) of American English
- written texts published in the US in 1992, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
LLC (London-Lund Corpus)
- contains 100 texts of 500,000 words of spoken British English
- various genres (e.g. spontaneous dialogues, radio broadcasts)
- compiled between 1975 and 1981 and 1985 and 1988
http://khnt.hit.uib.no/icame/manuals/LONDLUND/INDEX.HTM
-
Corpora documenting varieties other than, or in addition to, British and American English
ACE (Australian Corpus of English)
- written texts published in Australia in 1961, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM
FRED (Freiburg English Dialects)
- a specialised corpus of nine British English dialects
- about 2,45 million words; 370 texts, 300 hours of speech
- mainly recorded between 1968 and 2000
- the majority of informants are non-mobile old rural male speakers
http://www.freidok.uni-freiburg.de/freidok/volltexte/2006/2489/pdf/Userguide_neu.pdf
ICE (International Corpus of English)
- a range of million-word corpora of different Englishes, representing
native- and official-language national varieties of English
- 60 per cent spoken texts, 40 per cent written texts
- the following varieties of English are currently available:
ICE East Africa, Great Britain, Hong Kong, India, Jamaica, New Zealand,
Philippines, Singapore, Australia [access restricted], Ireland [access
restricted]
http://www.ucl.ac.uk/english-usage/ice/
Kolhapur Corpus of Indian English
- written texts published in India in 1961, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/kolhapur/INDEX.HTM
WSC (Wellington Corpus of Spoken New Zealand English)
- written texts published in New Zealand in 1961, sampled in the same way as Brown
http://khnt.hit.uib.no/icame/manuals/wellman/INDEX.HTM
WWC (Wellington Corpus of Written New Zealand English)
- 1 million words of spoken New Zealand English
- consists of 2,000 word extracts (75 per cent informal, 12 per cent
formal and 12 per cent semi-formal speech; dialogues and monologues)
- collected between 1988 and 1994
http://khnt.hit.uib.no/icame/manuals/wsc/INDEX.HTM
-
Specialised Corpora
CSPAE (Corpus of Spoken Professional American English)
- two subcorpora of 1 million words each
- staff meetings in educational settings and White House press conferences
- data from 1994-98
http://www.athel.com/cpsa.html
COLT (Bergen Corpus of London Teenage English)
- corpus of spontaneous speech of London teenagers aged 13-17
- contains the original sound recordings and part-of-speech tagged orthographic transcripts of the conversations
http://khnt.hit.uib.no/icame/manuals/COLT/COLT.PDF
ICLE (International Corpus of Learner English)
- 2,5 million words of English
- L1 English subcorpus (LOCNESS) and learner sub-corpora
- contains student essays of about 500-1000 words from ca. 20 different mother tongue backgrounds
http://cecl.fltr.ucl.ac.be/
-
Diachronic Corpora
DCPSE (Diachronic Corpus of Present-Day Spoken English)
- contains spoken material from two Modern British English corpora:
400,000 words from ICE-GB (collected in the early 1990s) and 400,000
words from the London-Lund Corpus (late 1960s to early 1980s)
- sociolinguistic information on texts, speakers and authors
- offers a playback facility for listening to the samples
- unique resource for examining recent change in the grammar of spoken English
http://www.ucl.ac.uk/english-usage/projects/dcpse/
Helsinki Corpus of English texts
- about 1.6 million words
- divided into three main periods: Old, Middle and Early Modern English
http://khnt.hit.uib.no/icame/manuals/HC/INDEX.HTM
-
Online resources
BNC (British National Corpus)
illustrative trial searches (upper limit of 50 returns) at: http://www.natcorp.ox.ac.uk/
“BNC_view” [website hosted by Mark Davies, Brigham Young
University] allows full searches with much expanded functionality: http://corpus.byu.edu/bnc/
Bank of English (Collins-COBUILD)
- a 500+ million-word mixed-genre resource:
http://www.collins.co.uk/Corpus/CorpusSearch.aspx
http://www.collins.co.uk/books.aspx?group=153
MICASE (Michigan Corpus of Academic Spoken English)
http://quod.lib.umich.edu/m/micase/
References and further reading
Baker, Paul, Andrew Hardie & Tony McEnery. 2006. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press.
Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge: CUP.
Also consult:
David Lee’s “Bookmarks for Corpus-Based Linguists,” at http://devoted.to/corpora
AntConc (developed by Laurence
Anthony, of Waseda University, Japan) is a sturdy, easy-to-install,
easy-to-use and freely downloadable text search and concordancing
program. See http://www.antlab.sci.waseda.ac.jp/antconc_index.html for more information.
© This survey was compiled by
Friederike Müller, University of Freiburg, on the basis of
information available as of 1 September 2007.
Download the corpus list as pdf-document here
home