Skip to Main Content

HathiTrust at the University of Rochester

Research Datasets

The HTRC Extracted Features Dataset (1.0) contains page-level features for 13.7 million public-domain and in-copyright volumes, including

  • part-of-speech tagged term token counts
  • header/footer identification
  • marginal character counts
  • line information for each page, such as number of lines with text and a count of characters for starting and ending lines

Find out more at the HTRC Extracted Features Dataset webpage, which includes full documentation, a sample dataset, and links for downloading the data.

The HathiTrust Research Center links to additional tools, datasets, and information about workshops.


HathiTrust Research Center's Bookworm tool charts trends in word use from 1500-2015 in hundreds of thousands of texts in HathiTrust. Filters are available for subject classification, fiction/non-fiction, genres, language, format, page and word counts, and publication information. Controls allow choice of date ranges, different metrics and case sensitivity.

Bookworms based on other text collections are available at