If you are interested in having the library purchase a dataset or getting access to a corpus that is not listed here, please email librarian Andrea Kingston with the relevant information.
Annotated links mainly meant for linguists and language teachers who work with corpora. Originally created by David Lee at the University of Wollongong (Australia), now maintained by independent researcher Martin Weisser.
Allows you to search longer strings of words from the Google Books corpus. It shows links to the books in which words appear by year. It offers the same corpora as available in N-Grams.
The Corpus of Contemporary American English (COCA) is the only large and "representative" corpus of American English. The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, acad emic texts, TV and Movies subtitles, blogs, and other web pages.
The River Campus Libraries provide access to downloadable files of the full COCA corpus for members of the UR community (login required). See the link above this one for the online version of COCA.
Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500 and the present. Texts are in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese.
You'll need to create a login, selecting “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the UR Linguistics Librarian, usually within one business day, after which you will have LDC access.
The MPI for Psycholinguistics stores the central digital archive for the DoBeS endangered languages documentation program and the NGT corpus, a collection of data from deaf signers using Dutch sign language (NGT; Nederlandse Gebarentaal).
Repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance. Owned and operated by Cornell University.
A growing corpus of peer-reviewed research articles, all of which are covered by an open access license agreement which allows free distribution and reuse of the full-text article.
Designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond. This corpus shows what people are saying about coronavirus in online newspapers and magazines in 20 different English-speaking countries.
You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the UR Linguistics Librarian within one business day, after which you will have LDC access.
A repository of cross-linguistic phonological inventory data that have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3175 segment types found in 2186 distinct languages.
The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.
DocNow responds to the public's use of social media for chronicling historically significant events as well as demand from scholars, students, and archivists, among others, seeking a user-friendly means of collecting and preserving this type of digital content.
Click on Tools to access:
Hydrator - "Rehydrate" your Tweet ID sets into full tweets with metadata.
Tweet Catalog - A catalog of publicly shared Tweet ID sets. Add yours here!
Twarc - Archive Twitter JSON using this command line tool.
Diff Engine - Track changes in news articles through their RSS feeds.
A forum where academic linguists can discuss linguistic issues and exchange linguistic information. Overseen by Indiana University, Department of Linguistics.