LibGuides: HathiTrust at the University of Rochester: Research Data

Research Datasets

The HTRC Extracted Features Dataset (1.0) contains page-level features for 13.7 million public-domain and in-copyright volumes, including

part-of-speech tagged term token counts
header/footer identification
marginal character counts
line information for each page, such as number of lines with text and a count of characters for starting and ending lines

Find out more at the HTRC Extracted Features Dataset webpage, which includes full documentation, a sample dataset, and links for downloading the data.

The HathiTrust Research Center links to additional tools, datasets, and information about workshops.

Bookworm

HathiTrust Research Center's Bookworm tool charts trends in word use from 1500-2015 in hundreds of thousands of texts in HathiTrust. Filters are available for subject classification, fiction/non-fiction, genres, language, format, page and word counts, and publication information. Controls allow choice of date ranges, different metrics and case sensitivity.

Bookworms based on other text collections are available at http://bookworm.culturomics.org/.

	without login	with UR login
Search full-text of all volumes
View full-text of non-copyright volumes
Download single page of non-copyright volumes (PDF image)
Download full volume of non-copyright volumes (PDF image)	X
Search within collections
Create and save your own collections	X

HathiTrust at the University of Rochester

What Is HathiTrust?

Chart courtesy of Syracuse University Libraries

Research Datasets

Bookworm