Top of page

Selected Digitized Books Data Package

The text files in this dataset are derived from the Selected Digitized Books collection, which is growing collection of selected books and other materials from the Library of Congress General Collections.

illustration of man working a printing press

About this dataset

About: This dataset comprises 166,218 .txt and JSON files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies. The dataset was created using the loc.gov JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with the item record.

Source collection: Selected Digitized Books collection . This is a growing collection of selected books and other materials from the Library of Congress General Collections that have been made openly available. Most of the materials in this collection were published in the United States prior to the 1930s and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries.

Link to image on the right: https://www.loc.gov/resource/gdcmassbookdig.littleadventures00alls/?sp=25 .

What's included?

The data package includes:

  • A folder containing 166,218 .txt and JSON files containing full text from 90,414 selected digitized books.
    • As of 2022-09-27 there are 1,894 items from the metadata that do not have a corresponding full text extract. These will be updated as they are made available.
  • metadata.json : a JSON file containing the metadata for all 90,414 selected digitized books
  • metadata.csv: a CSV transformation of the original JSON metadata
  • manifest.txt: a text file listing the image id, MD5 hash, and location of the images in the data set
  • README.md: technical overview of how the dataset was created
  • Data cover sheet: a more substantive overview of the data and the collection from which it is derived
  • sample data: 1,000 randomly selected items from the 90,414 set and their corresponding full text extracts have been provided as sample data. Included with this are a metadata.csv, metadata.json, and manifest.txt.

Rights Statement

The books in this collection are in the public domain and are free to use and reuse.

Credit Line: Library of Congress

More about Copyright and other Restrictions.

For guidance about compiling full citations consult Citing Primary Sources.

Creator and Contributor Information

Dataset creator: Chase Dooley

README and Cover sheet creators and contributors: Eileen J. Manchester, Meghan Ferriter, Mark Cooper

Download the data

Sample the data

Get a sense of the corpus by downloading 1,000 OCRed text files randomly selected from the total 166,218 text files comprising this dataset.

Download the OCRed text

  • To bulk download 166,218 .txt and JSON files containing full text from 90,414 selected digitized books (59.3 GB), please email [email protected] or consult the dataset's manifest.

Download the documentation

Download the metadata

Questions and Feedback

For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Main Reading Room via the Library's Ask a Librarian service at https://ask.loc.gov/history-humanities-social-sciences/.

For questions about download and access, please email the LC Labs Team at [email protected].