Google Books Ngram Viewer - Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. Texts are in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese. More info
Linguistic Data Consortium(LDC) - You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the Linguistics librarian within one business day, after which you will have LDC access
arXiv Bulk Data Access- Repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance. Owned and operated by Cornell University.
BioMed Central (Includes Chemistry Central and Springer Open)- A growing corpus of articles (over 315,000 in December 2017) of peer-reviewed research, all of which are covered by an open access license agreement which allows free distribution and re-use of the full-text article, including the highly structured XML ver
Phonetic & Phonological Data
Linguistic Data Consortium (LDC) - You'll need to create a login, select “University of Rochester, River Campus Libraries” (start typing and the name will appear). Your account will be approved by the Linguistics librarian within one business day, after which you will have LDC access.
PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3175 segment types found in 2186 distinct languages.
UCLA Phonetics Lab Archive - The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
WALS Online -The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.
Documenting the Now - DocNow responds to the public's use of social media for chronicling historically significant events as well as demand from scholars, students, and archivists, among others, seeking a user-friendly means of collecting and preserving this type of digital content. Click on Tools to access:
Hydrator - "Rehydrate" your Tweet ID sets into full tweets with metadata.
Tweet Catalog - A catalog of publicly shared Tweet ID sets. Add yours here!
Twarc - Archive Twitter JSON using this command line tool.
Diff Engine - Track changes in news articles through their RSS feeds.