LibGuides: * Linguistics: Corpora & Datasets

Corpora

If you are interested in having the library purchase a dataset or getting access to a corpus that is not listed here, please email librarian Andrea Kingston with the relevant information.

General Corpora

Bookmarks for Corpus-Based Linguistics
Annotated links mainly meant for linguists and language teachers who work with corpora. Originally created by David Lee at the University of Wollongong (Australia), now maintained by independent researcher Martin Weisser.
BYU Google Books Viewer
Allows you to search longer strings of words from the Google Books corpus. It shows links to the books in which words appear by year. It offers the same corpora as available in N-Grams.
CALPER Corpus Portal
Corpus resources from Pennsylvania State University's Center for Advanced Language Proficiency, Education and Research.

Corpus of Contemporary American English
The Corpus of Contemporary American English (COCA) is the only large and "representative" corpus of American English. The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, acad emic texts, TV and Movies subtitles, blogs, and other web pages.

Corpus of Contemporary American English (Downloadable Version)
The River Campus Libraries provide access to downloadable files of the full COCA corpus for members of the UR community (login required). See the link above this one for the online version of COCA.
English Corpora
Links to various English corpora.
Google Books Ngram Viewer
Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500 and the present. Texts are in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese.

more... less...

See https://books.google.com/ngrams/info for more info.
International Computer Archive of Modern and Medieval English (ICAME)
An international organization of linguists and information scientists working with English machine-readable texts.
International Corpus of English (ICE)
Each ICE corpus consists of one million words of spoken and written English produced after 1989.

Linguistic Data Consortium
You'll need to create a login, selecting “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the UR Linguistics Librarian, usually within one business day, after which you will have LDC access.

Max Planck Institute for Psycholinguistics
The MPI for Psycholinguistics stores the central digital archive for the DoBeS endangered languages documentation program and the NGT corpus, a collection of data from deaf signers using Dutch sign language (NGT; Nederlandse Gebarentaal).
Open Language Archives Community (OLAC)
An international partnership of institutions and individuals who are creating a worldwide virtual library of language resources.
Pangloss
The Pangloss collection offers free access to linguistic audio documents, with a specialization in rare or less-studied languages.

Medical/Scientific Corpora

arXiv Bulk Data Access
Repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance. Owned and operated by Cornell University.
BioMed Central (includes Chemistry Central and Springer Open)
A growing corpus of peer-reviewed research articles, all of which are covered by an open access license agreement which allows free distribution and reuse of the full-text article.
The Coronavirus Corpus
Designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond. This corpus shows what people are saying about coronavirus in online newspapers and magazines in 20 different English-speaking countries.

Phonetic & Phonological Data

Linguistic Data Consortium (LDC)
You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the UR Linguistics Librarian within one business day, after which you will have LDC access.
PHOIBLE
A repository of cross-linguistic phonological inventory data that have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3175 segment types found in 2186 distinct languages.
UCLA Phonetics Lab Archive
The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
WALS Online
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.

Social Media

Documenting the Now
DocNow responds to the public's use of social media for chronicling historically significant events as well as demand from scholars, students, and archivists, among others, seeking a user-friendly means of collecting and preserving this type of digital content.

Click on Tools to access:

Hydrator - "Rehydrate" your Tweet ID sets into full tweets with metadata.
Tweet Catalog - A catalog of publicly shared Tweet ID sets. Add yours here!
Twarc - Archive Twitter JSON using this command line tool.
Diff Engine - Track changes in news articles through their RSS feeds.

The Linguist List
A forum where academic linguists can discuss linguistic issues and exchange linguistic information. Overseen by Indiana University, Department of Linguistics.