English corpora
|
This page offers short descriptions of the most widely known English language corpora. To find out more about any of these, click on the corpus title. This will take you to the homepage (or manual) of the corpus.
|
BROWN Corpus
Developer: |
Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island |
Collection date: |
1960s |
Size:
|
1 million words |
Contents: |
written language; 500 text samples of approx. 2,000 words; 15 text categories |
Annotation: |
untagged and tagged version POS tagging |
Availability: |
ICAME CD |
CPSAE - Corpus of Spoken Professional American English
Developer: |
Michael Barlow at Athelstan and Rice University, Houston/TX |
Collection date: |
1994-1998 |
Size:
|
2 million words (2 sub-corpora, approx. 1 million words each) |
Contents: |
academic discourse and White House briefings; short interchanges by 400 speakers |
Annotation: |
untagged and tagged version POS tagging on the basis of the CLAWS tagset developed at Lancaster University |
Availability: |
from Athelstan |
FROWN - Freiburg BROWN Corpus of American English
Developer: |
Christian Mair at the University of Freiburg |
Collection date: |
1990s |
Size:
|
1 million words |
Contents: |
matches the original Brown corpus |
Annotation: |
untagged |
Availability: |
ICAME CD |
MICASE - Michigan Corpus of Academic Spoken English
Developer: |
R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales at the English Language Institute, University of Michigan |
Collection date: |
since 1997, ongoing |
Size:
|
1.7 million words |
Contents: |
transcripts and audio files of academic speech |
Annotation: |
discourse annotation |
Availability:
|
freely available on the Web ( here) |
SUSANNE Corpus
Developer: |
Geoffrey Sampson at the University of Essex |
Collection date: |
1960s |
Size:
|
130,000 words |
Contents: |
subset of BROWN Corpus |
Annotation: |
POS tagged and syntactically parsed subset of BROWN |
Availability:
|
freely available on the Web ( here) |
|
ACE - Australian Corpus of English
Developer: |
Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney |
Collection date: |
1986 |
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
Contents: |
written and spoken language; modelled on LOB and BROWN |
Annotation: |
untagged |
Availability: |
ICAME CD |
|
Bank of English (Collins Cobuild)
Developer: |
John Sinclair and his team at the University of Birmingham and Harper-Collins |
Collection date: |
1980s |
Size:
|
more than 450 million words in 2005, growing |
Contents: |
one of the largest English corpora, a 'monitor' corpus (i.e. continually growing); originally collected as a basis for the creation of the COBUILD dictionary but since then continually expanded; originally containing 75 % written and 25 % spoken language, 70 % British, 20% American, 5 % other varieties; containing entire texts rather than samples; covering a wide cross-section of contemporary English ( more info) |
Annotation:
|
POS tagged |
Availability: |
free search in Collins WordBanks online, a 56 million word subset of the BoE |
BNC - British National Corpus
COLT - Bergen Corpus of London Teenage Language
Developer: |
University of Bergen, Norway |
Collection date: |
1993 |
Size:
|
500,000 words |
Contents: |
transcripts of spoken language of London teenagers (COLT is part of the BNC) |
Annotation: |
POS tagging |
Availability: |
ICAME CD |
CHRISTINE Corpus
Developer: |
Geoffrey Sampson at the University of Essex |
Collection date: |
1990s |
Size:
|
100,000 words |
Contents: |
informal spoken language (taken from BNC) |
Annotation: |
POS tagged and syntactically parsed subset of spoken part of BNC |
Availability:
|
freely available on the Web ( here) |
FLOB - Freiburg-LOB Corpus of British English
Developer: |
Christian Mair at the University of Freiburg |
Collection date: |
1990s |
Size:
|
1 million words
|
Contents: |
matches the original LOB corpus |
Annotation: |
untagged |
Availability: |
ICAME CD |
ICE-GB - International Corpus of English, British Component
Developer: |
co-ordinated by Gerald Nelson at University College London |
Collection date: |
1990-93 |
Size:
|
1 million words |
Contents: |
written and spoken language covering a variety of genres ( more info) the aim International Corpus of English (ICE) project was to build comparable corpora of 15 regional varieties of English for comparative studies of English worldwide |
Annotation: |
textual markup, discourse annotation, POS tagging, syntactic parsing ( more info) |
Availability: |
on CD |
Lancaster Parsed Corpus
Developer: |
Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster |
Collection date: |
1978 |
Size
|
140,000 words |
Contents: |
parsed subcorpus of the LOB |
Annotation: |
POS tagging, syntactic parsing |
Availability: |
ICAME CD |
LLC London-Lund Corpus of Spoken English
Developer: |
Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University |
Collection date: |
1960s, 1975-81, 1985-88 |
Size:
|
500,000 words |
Contents: |
spoken language ( more info) based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University) |
Annotation: |
prosodic and discourse annotation |
Availability: |
ICAME CD |
LOB Lancaster/Oslo-Bergen Corpus
Developer: |
compiled under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen |
Collection date: |
1970-1978 |
Size:
|
1 million words |
Contents: |
written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus |
Annotation: |
untagged and tagged version POS tagging (CLAWS tagset) |
Availability: |
ICAME CD |
POW - Polytechnic of Wales Corpus
Developer: |
The Computational Linguistics Unit at University of Wales College of Cardiff |
Collection date: |
1980s |
Size
|
65,000 words |
Contents: |
transcripts of spoken language of children |
Annotation: |
POS tagging, syntactic parsing |
Availability: |
ICAME CD |
SEC Lancaster/IBM English Corpus
Developer: |
University of Lancaster and IBM Scientific Centre |
Collection date: |
1984-87 |
Size:
|
52,000 words |
Contents: |
spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster, Open University tapes |
Annotation: |
prosodic markup, POS tagged with CLAWS |
Availability: |
ICAME CD |
|
ICE-EA - International Corpus of English, East African Component
|
ICE - International Corpus of English, Indian Component
Kolhapur Corpus
Developer: |
S. K. Verma at University of Lancaster and Shivaji University, Kolhapur |
Collection date: |
1978 |
Size:
|
1 million words, 500 text samples of approx. 2,000 words |
Contents: |
written language; modelled on BROWN and LOB |
Annotation:
|
untagged |
Availability: |
ICAME CD |
|
ICE - International Corpus of English, New Zealand Component
Wellington Corpus
Developer: |
Laurie Bauer at Victoria University, Wellington |
Collection date: |
1986-90 |
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
Contents: |
written language; modelled on BROWN and LOB |
Annotation:
|
untagged |
Availability: |
ICAME CD |
Wellington Corpus of Spoken New Zealand English
Developer: |
Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington |
Collection date: |
1988-94 |
Size:
|
1 million words; 500 text samples of approx. 2,000 words |
Contents: |
spoken language; formal, semi-formal and informal speech |
Annotation:
|
discourse markup |
Availability: |
ICAME CD |
|
ICE - International Corpus of English, Philippine Component
|
ICE - International Corpus of English, Indian Component
|
English as a Lingua Franca
|
|
VOICE Vienna Oxford International Corpus of English
Developer: |
Barbara Seidlhofer at the Universiy of Vienna |
Collection date: |
since 2001 (ongoing) |
Size:
|
250.000 words to date, to be extended |
Contents: |
spoken English; interactions in English as a lingua franca; unscripted, largely face-to-face communication among competent non-native speakers including private and public dialogues, private and public group discussions and casual conversations, and one-to-one interviews. |
Annotation:
|
conversational markup |
ELFA English as a Lingua Franca in Academic Settings
Developer: |
Anna Mauranen at Tampere University |
Collection date: |
ongoing |
Size:
|
0.5 million words |
Contents: |
spoken academic English involving non-native speakers; includes various speech events (e.g. lectures, workshops, seminars, presentations) |
Annotation:
|
|
|
ARCHER Corpus - A Representative Corpus of Historical English Registers
Developer: |
Northern Arizona University in co-operation with the Universities of Uppsala, Helsinki and Freiburg |
Sampling period: |
1650-1990 |
Size:
|
1.7 million words |
Contents: |
1,037 texts; 10 registers (e.g., drama, letters, science prose), including British and American; sampled from 7 historical periods covering Early Modern English; speech-based, popular, and specialist/academic written registers |
Annotation: |
POS tagged |
Availability: |
|
CEECS - Corpus of Early English Correspondence Sampler
Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
Sampling period: |
1418-1680 |
Size:
|
450,000 words |
Contents: |
personal letters |
Annotation: |
|
Availability: |
ICAME CD |
Helsinki Corpus of English Texts: Diachronic Part
Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
Sampling period: |
ca. 750 to 1700 |
Size:
|
1.6 million words |
Contents: |
Old, Middle and Early Modern English texts |
Annotation: |
|
Availability: |
ICAME CD |
Helsinki Corpus of Older Scots
Developer: |
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki |
Sampling period: |
1450-1700 |
Size:
|
830,000 words |
Contents: |
Old, Middle and Early Modern English texts |
Annotation: |
untagged |
Availability: |
ICAME CD |
Lampeter Corpus of Early Modern English Tracts
Developer: |
Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz |
Sampling period: |
1640 -1740 |
Size:
|
1.1 million words |
Contents: |
non-literary prose texts |
Annotation: |
textual markup |
Availability: |
ICAME CD |
Newdigate Newsletters Corpus
Developer: |
Philip Hines, Jr., Norfolk, Virginia |
Sampling period: |
1692 |
Size:
|
750,000 words |
Contents: |
a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire) |
Annotation: |
untagged |
Availability: |
ICAME CD |
|
Books and other websites containing descriptions of corpora
|
|
- Kennedy, Graeme (1998): Introduction to Corpus Linguistics. London: Longman.
- Meyer, Charles (2002): English corpus linguistics: an introduction. Cambridge: CUP.
|
|