Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
If you are interested in having the library purchase a dataset or getting access to a corpus that is not listed here, email me with the relevant information.
- Bookmarks for Corpus-based Linguists by David Lee at the University of Wollongong (Australia)
- CALPER Corpus Portal - Pennsylvania State University Center for Advanced Language Proficiency, Education and Research
- corpus.byu.edu - Brigham Young University
- Google Books Ngram Viewer - Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. Texts are in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese. More info
- International Computer Archive of Modern and Medieval English (ICAME)
- International Corpus of English (ICE)
- Linguistic Data Consortium (LDC) - You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the Linguistics librarian within one business day, after which you will have LDC access
- The Linguist List: Texts & Corpora
- Max Planck Institute for Psycholinguistics
- OLAC: Open Language Archives Community
- arXiv Bulk Data Access - Repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance. Owned and operated by Cornell University.
- BioMed Central (Includes Chemistry Central and Springer Open)- A growing corpus of articles (over 315,000 in December 2017) of peer-reviewed research, all of which are covered by an open access license agreement which allows free distribution and re-use of the full-text article, including the highly structured XML ver
Phonetic & Phonological Data
- Linguistic Data Consortium (LDC) - You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the Linguistics librarian within one business day, after which you will have LDC access.
- PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3175 segment types found in 2186 distinct languages.
- UCLA Phonetics Lab Archive - The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
- WALS Online -The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.
- Documenting the Now - DocNow responds to the public's use of social media for chronicling historically significant events as well as demand from scholars, students, and archivists, among others, seeking a user-friendly means of collecting and preserving this type of digital content. Click on Tools to access:
- Hydrator - "Rehydrate" your Tweet ID sets into full tweets with metadata.
- Tweet Catalog - A catalog of publicly shared Tweet ID sets. Add yours here!
- Twarc - Archive Twitter JSON using this command line tool.
- Diff Engine - Track changes in news articles through their RSS feeds.
Modern Languages & Cultures Librarian
Subjects: Chinese Language and Culture
, Comparative Literature
, East Asian Studies
, French Language and Culture
, German Language and Culture
, Italian Language and Culture
, Japanese Language and Culture
, Korean Language and Culture
, Languages and Cultures
, Literary Translation Studies
, Portuguese Language and Culture
, Russian Language and Culture
, Spanish & Latin American Language and Culture
, Writing Program