semtracks Corpora Directory

Below are listed corpora that have been made freely available for research purposes by members of the academic community. Some of them can be queried through a web interface, others can be downloaded and processed locally.

We have also made available some of our own corpora. More information about them (e.g. how to access them) can be found here. For a comprehensive online introduction to corpus linguistics (in German), go to Noah Bubenhofer's Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge.


search for: in:

Available languages (or language subtypes) are: Ancient Greek, Arabic, Aramaic, Bulgarian, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hungarian, Icelandic, Irish, Italian, Japanese, Korean, Kurdish, Latin, Latvian, Lithuanian, Maltese, Middle English, Middle French, Middle High German, Norwegian, Old English, Old French, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Walloon, Welsh
Available language mode values are: monolingual, multilingual – comparative, multilingual – parallel
Available language medium values are: spoken, written
Available availability types are: Linguistic Data Consortium (LDC), downloadable from corpus website, query interface on corpus website
Available annotations are: error annotation, morphology, none, part of speech, pragmatics, semantics, syntax, various


A Corpus of Plagiarised Short Answers
The corpus contains answers to five different questions from the domain of Computer Science. The answers represent one of four levels of plagiarism: near copy, light revision, heavy revision, non-plagiarism.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 19559 words

Archiv für gesprochenes Deutsch (AGD)
The corpus contains examples of different varieties of German.
language(s): German
language mode: monolingual
language medium: spoken
availability: query interface on corpus website
annotation(s): none
size: 8000 recordings freely available.

British National Corpus (BNC)
The British National Corpus (BNC) contains 90% written and 10% orthographically transcribed spoken text from the later part of the 20th century. It is in XML format and has its own query software, XAIRA.
language(s): English
language mode: monolingual
language medium: spoken, written
availability: query interface on corpus website
annotation(s): part of speech
size: 100 mio. words.

British National Corpus (BNC) Baby
BNC Baby is a subcorpus of the British National Corpus (BNC) with texts from the genres fiction, newspapers, academic writing and spoken conversation.
language(s): English
language mode: monolingual
language medium: spoken, written
availability: query interface on corpus website
annotation(s): part of speech
size: 4 x 1 mio. words.

British National Corpus (BNC) Sampler
BNC Sampler is a subcorpus of the British National Corpus (BNC) which maintains the latter's genre proportions.
language(s): English
language mode: monolingual
language medium: spoken, written
availability: query interface on corpus website
annotation(s): part of speech
size: 2 mio. words.

Brown Corpus
The Brown Corpus contains informative and imaginative prose texts from the year 1961.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 1 mio. words.

Congressional speech data
The corpus contains transcripts of U.S. Congressional floor debates.
language(s): English
language mode: monolingual
language medium: spoken
availability: downloadable from corpus website
annotation(s): none
size:

Corpora of the United Nations
This parallel corpus is made up of document collections from the United Nations.
language(s): Arabic, Chinese, English, French, Russian, Spanish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 3 mio. words per language.

Corpus Berliner Zeitung
This is a balanced corpus with texts from 1994 to 2005.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 252 mio. words.

Corpus de Bitextes Anglais-Français (BAF)
This parallel corpus features mainly institutional texts.
language(s): English, French
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 0.4 mio. words.

Corpus de Espanol
The corpus features Spanish texts written from the 1200s to the 1900s.
language(s): Spanish
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 100 mio. words.

Corpus der Potsdamer Neuesten Nachrichten
This is a balanced corpus with texts from 2003 to 2005.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 15 mio. words.

Corpus do Portugues
This historic corpus contains Portuguese texts from 1300s to the 1900s.
language(s): Portuguese
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 45 mio. words.

Corpus Gesprochene Sprache
This is a balanced corpus with texts from the 20th century.
language(s): German
language mode: monolingual
language medium: spoken
availability: query interface on corpus website
annotation(s): part of speech
size: 2.5 mio. words.

Corpus Jüdischer Periodika
This is a balanced corpus with texts dating from the period between 1887 and 1938.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 26 mio. words.

Corpus of Contemporary American English (COCA)
The corpus contains spoken data, fiction, popular magazines, newspapers and academic texts. 20 mio. words are added each year.
language(s): English
language mode: monolingual
language medium: spoken, written
availability: query interface on corpus website
annotation(s): part of speech
size: 400 mio. words.

CORpus of tagged Political Speeches (CORPS)
This is a corpus of political speeches tagged that includes audience reactions like "applause" or "laughter".
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 2.2 mio. words.

Corpus Search, Management and Analysis System (COSMAS) II
COSMAS II contains the 86 corpora of the Institut für Deutsche Sprache (IDS) in Mannheim.
language(s): German
language mode:
language medium:
availability: query interface on corpus website
annotation(s): various
size: 3 bio. words.

DDR-Korpus
This is a balanced corpus with texts written between 1949 and 1990.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 9 mio. words.

deWac
The corpus consists of the content of websites in the .de domain. It is part-of-speech-tagged and lemmatized.
language(s): German
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 1.7 bio. words.

Dutch Open Subtitles Treebank (Alpino Treebank)
This is a treebank with texts from OpenSubtitles.
language(s): Dutch
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size:

DWDS-Ergänzungscorpus
This is a balanced corpus with texts from 1990 to 2000.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 1000 mio. words.

DWDS-Kernkorpus
This is a balanced corpus with texts from the 20th century.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 100 mio. words.

Ein fehlerannotiertes Lernerkorpus des Deutschen als Fremdsprache (Falko)
The corpus contains text summaries and essays written by students of German. More information about its compilation can be found here.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): error annotation, part of speech
size:

EUconst
The corpus contains bitext in 22 languages.
language(s): Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Slovak, Slovenian, Spanish, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 3 mio. words.

European Medicines Agency (EMEA) Corpus
The corpus is made up of PDF documents from the European Medicines Agency.
language(s): Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 327 mio. words.

European Parliament Proceedings Parallel Corpus 1996-2009
The Europarl Corpus is a parallel corpus that features between 34 and 55 mio. words per language. The following language pairs are available: Danish-EN, German-EN, Greek-EN, Spanish-EN, Finnish-EN, French-DN, Italian-EN, Dutch-EN, Portuguese-EN, Swedish-EN.
language(s): Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 34-55 mio. words per language.

Hansards
The "Hansards" parallel corpus consists of official records ("hansards") of the 36th Canadian Parliament.
language(s): English, French
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 1.3 mio. words.

HCRC Map Task Corpus
The corpus contains dialogues from a map task that have been recorded, transcribed, and annotated for a wide range of behaviours. The digitised forms of the maps themselves are also available.
language(s): English
language mode: monolingual
language medium: spoken
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 128 dialogues.

International Corpus of English (ICE)
The aim behind the compilation of the ICE was to capture national and regional varieties of English (Canada, East Africa, Great Britain, Hongkong, India, Ireland, Jamaica, New Zealand, Philippines, Singapore, etc.). Each variety subcorpus consists of spoken and written English produced after 1989.
language(s): English
language mode: multilingual – comparative
language medium: spoken, written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 1 mio. words each.

International Corpus of Learner English (ICLE)
The corpus contains essays written by advanced learners of English as a foreign language.
language(s): English
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): various
size: 3 mio. words.

itWac
The corpus was constructed from the content of websites in the .it domain.
language(s): Italian
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 2 bio. words.

Joint Research Centre-Acquis Communautaire (JRC-Acquis)
The JRC-Acquis contains the total body of European Union (EU) law applicable in the the EU Member States from 1950 to today.
language(s): Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 1 bio. words.

Juilland-D-Corpus
This is a balanced corpus with texts written between 1920 and 1939 .
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 0.5 mio. words.

KDEdoc
The KDEdoc corpus features data in 24 languages.
language(s): Danish, Dutch, English, Estonian, French, German, Hungarian, Italian, Japanese, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Walloon
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 3.7 mio. words.

Lancaster-Oslo/Bergen Corpus of British English (LOB)
The LOB Corpus was designed to be the British English equivalent to the Brown Corpus. As such, it also contains informative and imaginative prose texts from 1961.
language(s): English
language mode: monolingual
language medium: written
availability:
annotation(s):
size: 1 mio. words.

London-Lund Corpus of Spoken English (LLC)
The London-Lund Corpus of Spoken English contains speech data from two corpora: Survey of English Usage (SEU) and Survey of Spoken English (SSE).
language(s): English
language mode: monolingual
language medium: spoken
availability:
annotation(s):
size: 1 mio. words.

Movie Review Data
The corpus comprises movie-review documents annotated with their subjectivity status (subjective or objective) and sentiment polarity (positive or negative).
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): semantics
size:

Multi-Domain Sentiment Dataset
This corpus contains Amazon.com reviews from various product types.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): semantics
size:

Multi-Perspective Question Answering (MPQA) Opinion Corpus
The MPQA corpus features news articles manually annotated for opinions and other private states (beliefs, emotions, sentiments, speculations, etc.).
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): semantics
size:

MultiSemCor
This is a subset of the Brown Corpus in which 0.2 mio. words are tagged with their WordNet senses.
language(s): English, Italian
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, semantics
size: 0.7 mio. words.

National University of Singapore (NUS) SMS Corpus
The corpus contains short messages written mostly by Singaporean university students.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 10000 messages.

Open American National Corpus (OANC)
The aim of the ANC is to make it comparable in genres and size to the British National Corpus (BNC). The ANC contains texts as well as transcripts of spoken data from 1990 to today.
language(s): English
language mode: monolingual
language medium: spoken, written
availability: downloadable from corpus website
annotation(s): part of speech
size: 14 mio. words.

OpenOffice Corpus
The corpus contains the OpenOffice documentation in six languages.
language(s): English, French, German, Japanese, Spanish, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 2.6 mio. words.

OpenSubtitles
This parallel corpus contains subtitles in 30 languages.
language(s): Bulgarian, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hebrew, Hungarian, Icelandic, Italian, Japanese, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 149 mio. words.

Oxford Text Archive (OTA)
The OTA features a range of literary texts in more than 25 languages.
language(s): Ancient Greek, Arabic, Aramaic, Chinese, Croatian, Czech, English, French, Galician, German, Greek, Irish, Italian, Japanese, Kurdish, Latin, Middle English, Middle French, Middle High German, Old English, Old French, Polish, Portuguese, Russian, Serbian, Slovenian, Spanish, Swedish, Turkish, Welsh
language mode: multilingual – comparative
language medium: written
availability: downloadable from corpus website
annotation(s): various
size:

Penn Treebank
The Penn Treebank is a corpus of parsed sentences. Its original data derives from four sources: Wall Street Journal, Brown Corpus, Switchboard, ATIS.
language(s): English
language mode: monolingual
language medium: written
availability: Linguistic Data Consortium (LDC)
annotation(s): part of speech, syntax
size: 1 mio. words.

PHP Corpus
PHP Corpus is a corpus comprising PHP manuals in 22 languages.
language(s): Chinese, Czech, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 3.3 mio. words.

Prague Dependency Treebank (PDT)
The PDT contains dependency trees.
language(s): Czech
language mode: monolingual
language medium: written
availability: Linguistic Data Consortium (LDC), downloadable from corpus website
annotation(s): part of speech, syntax
size: 2 mio. words.

Prague English Dependency Treebank (PEDT)
The PEDT contains dependency trees derived from the "Wall Street Journal" corpus of the Penn Treebank.
language(s): English
language mode: monolingual
language medium: written
availability: Linguistic Data Consortium (LDC)
annotation(s): part of speech, syntax
size: 12440 trees

PukWac
This corpus derives its data from the same source as the ukWaC. In addition, it contains full dependency parses.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 2 bio. words.

Reuters Corpus, Volume 1 (RCV1)
The RCV1 contains English news stories from 1996 to 1997.
language(s): English
language mode: multilingual – comparative
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 810000 news stories.

Reuters Corpus, Volume 2 (RCV2)
The RCV2 comprises news stories from 1996 to 1997 in several different languages (however, it is not a parallel corpus).
language(s): Chinese, Danish, Dutch, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish, Swedish
language mode: multilingual – comparative
language medium: written
availability: downloadable from corpus website
annotation(s): none
size: 487000 news stories.

Schweizer Textkorpus (CHTK)
The CHTK contains texts from the 20th century.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 20 mio. words.

Spoken Turkish Corpus (demo version)
This corpus consists of recordings from radio conversations. The full version will be launched in October 2010.
language(s): Turkish
language mode: monolingual
language medium: spoken
availability:
annotation(s):
size:

Stockholm MULtilingual Treebank (SMULTRON)
SMULTRON is a parallel treebank. Sources: "Sophie's World", Press release ABB quarterly report Q2 2005, The Rainforest Alliance's Banana Certification Program, and SEB annual report 2004.
language(s): English, German, Swedish
language mode: multilingual – parallel
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 1000 sentences.

Tübinger Baumbank des Deutschen / Spontansprache (TüBa-D/S)
The treebank contains spontaneous dialogues.
language(s): German
language mode: monolingual
language medium: spoken
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 0.36 mio. words.

Tübinger Baumbank des Deutschen / Zeitungskorpus (TüBa-D/Z)
The treebank contains sentences from the German newspaper "die tageszeitung" ("taz").
language(s): German
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 0.7 mio. words.

Tübinger Baumbank des Englischen / Spontansprache (TüBa-E/S)
The treebank contains spontaneous dialogues.
language(s): English
language mode: monolingual
language medium: spoken
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 0.31 mio. words.

Tübinger Baumbank des Japanischen / Spontansprache (TüBa-J/S)
The treebank contains spontaneous dialogues.
language(s): Japanese
language mode: monolingual
language medium: spoken
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 0.16 mio. words.

Tübinger Partiell Geparstes Korpus des Deutschen / Schriftsprache - TüPP-D/Z
This corpus contains instances from the German newspaper "die tageszeitung" ("taz").
language(s): German
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 200 mio.

Tagesspiegel-Corpus
This is a balanced corpus with texts from1996 to 2005.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 170 mio. words.

TIGER-Baumbank für das Deutsche
The TIGER treebank features newspaper texts from the "Frankfurter Rundschau".
language(s): German
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 0.9 mio. words.

TIME Magazine Corpus
The corpus contains TIME Magazine texts from 1923 to the present.
language(s): English
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 100 mio. words.

ukWac
The corpus was constructed from the content of websites in the .uk domain.
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech
size: 2 bio. words.

Vienna-Oxford International Corpus of English (VOICE)
VOICE contains spoken interactions of 1250 speakers of English as a lingua franca (ELF). The interactions include interviews, press conferences, seminar discussions, panels, and question-answer-sessions.
language(s):
language mode: monolingual
language medium: spoken
availability: query interface on corpus website
annotation(s): none
size:

WaCkypedia_EN
The corpus contains text from the English Wikipedia version, syntactically annotated with the MaltParser (dependency parser).
language(s): English
language mode: monolingual
language medium: written
availability: downloadable from corpus website
annotation(s): part of speech, syntax
size: 800 mio. words.

ZEIT-Corpus
This is a balanced corpus with texts from the period between 1996 and 2007 as well as between 1946 and 1988.
language(s): German
language mode: monolingual
language medium: written
availability: query interface on corpus website
annotation(s): part of speech
size: 160 mio. words.