Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

* Linguistics: Corpora & Datasets

Funding resources for students including research grants, essay competitions, travel, and more.


If you are interested in having the library purchase a dataset or getting access to a corpus that is not listed here, email me with the relevant information.

Skip to:

down arrow

General Corpora

Medical/Scientific Corpora

  • arXiv Bulk Data Access - Repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance. Owned and operated by Cornell University.
  • BioMed Central (Includes Chemistry Central and Springer Open)- A growing corpus of articles (over 315,000 in December 2017) of peer-reviewed research, all of which are covered by an open access license agreement which allows free distribution and re-use of the full-text article, including the highly structured XML ver

Phonetic & Phonological Data

  • Linguistic Data Consortium (LDC) - You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the Linguistics librarian within one business day, after which you will have LDC access.
  • PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3175 segment types found in 2186 distinct languages.
  • UCLA Phonetics Lab Archive - The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
  • WALS Online -The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.

Social Media

  • Documenting the Now - DocNow responds to the public's use of social media for chronicling historically significant events as well as demand from scholars, students, and archivists, among others, seeking a user-friendly means of collecting and preserving this type of digital content. Click on Tools to access:
    • ​Hydrator - "Rehydrate" your Tweet ID sets into full tweets with metadata.
    • Tweet Catalog - A catalog of publicly shared Tweet ID sets. Add yours here!
    • Twarc - Archive Twitter JSON using this command line tool.
    • Diff Engine - Track changes in news articles through their RSS feeds.

Modern Languages & Cultures Librarian

Profile Photo
Kristen Totleben
        Chat is unavailable, feel free to email me.      
Rush Rhees Library, room 106

Skype Contact: