English-language corpora

Download the corpus list as pdf-document here

The following list provides essential information on some of the most widely used corpora in English linguistics. Most corpora are sold for off-line use. Therefore, some of the resources mentioned in 1. to 4. below may not be available on all campuses. Section 5 mentions corpora which provide free online access.
  1. Some important general-purpose corpora of present-day English (British and American English):

    ANC (American National Corpus) [still being compiled]
    - will contain 100 million words of American English, comparable to BNC
    - written and spoken data from 1990 onwards, broad coverage of different genres and text-types
    http://americannationalcorpus.org

    BNC (British National Corpus)
    - contains about 100 million words of text samples of varying length, including ca. 10 million words of transcribed speech
    - data mainly from the late 1980s and 1990s (some earlier)
    - for free online search facilities see 5. below
    www.natcorp.ox.ac.uk

    CSAE (Corpus of Spoken American English)/ Santa Barbara Corpus of Spoken American English
    - first electronic corpus of spoken American English
    - samples different kinds of naturally occurring speech (spontaneous dialogues, monologues, speeches, radio broadcasts, etc.)
    http://www.linguistics.ucsb.edu/research/sbcorpus.html

    BROWN Corpus
    - the pioneer digital corpus of present-day English assembled in the early 1960s
    - one million words of edited written American English (500 samples of text of ca. 2,000 words each, all originally published in 1961)
    - various genres (e.g. press reportage, fiction, government documents)
    http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM

    LOB (Lancaster – Oslo/Bergen) Corpus
    - written texts published in Britain in 1961, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM

    F-LOB (Freiburg – Lancaster-Oslo/Bergen Corpus)
    - written texts published in 1991, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM

    FROWN (Freiburg-Brown Corpus) of American English
    - written texts published in the US in 1992, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM

    LLC (London-Lund Corpus)
    - contains 100 texts of 500,000 words of spoken British English
    - various genres (e.g. spontaneous dialogues, radio broadcasts)
    - compiled between 1975 and 1981 and 1985 and 1988
    http://khnt.hit.uib.no/icame/manuals/LONDLUND/INDEX.HTM

  2. Corpora documenting varieties other than, or in addition to, British and American English

    ACE (Australian Corpus of English)
    - written texts published in Australia in 1961, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM

    FRED (Freiburg English Dialects)
    - a specialised corpus of nine British English dialects
    - about 2,45 million words; 370 texts, 300 hours of speech
    - mainly recorded between 1968 and 2000
    - the majority of informants are non-mobile old rural male speakers
    http://www.freidok.uni-freiburg.de/freidok/volltexte/2006/2489/pdf/Userguide_neu.pdf

    ICE (International Corpus of English)
    - a range of million-word corpora of different Englishes, representing native- and official-language national varieties of English
    - 60 per cent spoken texts, 40 per cent written texts
    - the following varieties of English are currently available:
    ICE East Africa, Great Britain, Hong Kong, India, Jamaica, New Zealand, Philippines, Singapore, Australia [access restricted], Ireland [access restricted]
    http://www.ucl.ac.uk/english-usage/ice/

    Kolhapur Corpus of Indian English
    - written texts published in India in 1961, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/kolhapur/INDEX.HTM

    WSC (Wellington Corpus of Spoken New Zealand English)
    - written texts published in New Zealand in 1961, sampled in the same way as Brown
    http://khnt.hit.uib.no/icame/manuals/wellman/INDEX.HTM

    WWC (Wellington Corpus of Written New Zealand English)
    - 1 million words of spoken New Zealand English
    - consists of 2,000 word extracts (75 per cent informal, 12 per cent formal and 12 per cent semi-formal speech; dialogues and monologues)
    - collected between 1988 and 1994
    http://khnt.hit.uib.no/icame/manuals/wsc/INDEX.HTM

  3. Specialised Corpora

    CSPAE (Corpus of Spoken Professional American English)
    - two subcorpora of 1 million words each
    - staff meetings in educational settings and White House press conferences
    - data from 1994-98
    http://www.athel.com/cpsa.html

    COLT (Bergen Corpus of London Teenage English)
    - corpus of spontaneous speech of London teenagers aged 13-17
    - contains the original sound recordings and part-of-speech tagged orthographic transcripts of the conversations
    http://khnt.hit.uib.no/icame/manuals/COLT/COLT.PDF

    ICLE (International Corpus of Learner English)
    - 2,5 million words of English
    - L1 English subcorpus (LOCNESS) and learner sub-corpora
    - contains student essays of about 500-1000 words from ca. 20 different mother tongue backgrounds
    http://cecl.fltr.ucl.ac.be/


  4. Diachronic Corpora

    DCPSE (Diachronic Corpus of Present-Day Spoken English)
    - contains spoken material from two Modern British English corpora: 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s to early 1980s)
    - sociolinguistic information on texts, speakers and authors
    - offers a playback facility for listening to the samples
    - unique resource for examining recent change in the grammar of spoken English
    http://www.ucl.ac.uk/english-usage/projects/dcpse/

    Helsinki Corpus of English texts
    - about 1.6 million words
    - divided into three main periods: Old, Middle and Early Modern English
    http://khnt.hit.uib.no/icame/manuals/HC/INDEX.HTM


  5. Online resources

    BNC (British National Corpus)
    illustrative trial searches (upper limit of 50 returns) at: http://www.natcorp.ox.ac.uk/
    “BNC_view” [website hosted by Mark Davies, Brigham Young University] allows full searches with much expanded functionality: http://corpus.byu.edu/bnc/

    Bank of English (Collins-COBUILD)
    - a 500+ million-word mixed-genre resource:
    http://www.collins.co.uk/Corpus/CorpusSearch.aspx
    http://www.collins.co.uk/books.aspx?group=153

    MICASE (Michigan Corpus of Academic Spoken English)
    http://quod.lib.umich.edu/m/micase/


References and further reading

Baker, Paul, Andrew Hardie & Tony McEnery. 2006. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press.
Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge: CUP.

Also consult:
David Lee’s “Bookmarks for Corpus-Based Linguists,” at http://devoted.to/corpora
AntConc (developed by Laurence Anthony, of Waseda University, Japan) is a sturdy, easy-to-install, easy-to-use and freely downloadable text search and concordancing program. See http://www.antlab.sci.waseda.ac.jp/antconc_index.html for more information.

© This survey was compiled by Friederike Müller, University of Freiburg, on the basis of information available as of 1 September 2007.

Download the corpus list as pdf-document here

home