Top of page

Selected Digitized Books Data Package

This dataset comprises 84,058 files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies.

Illustration of a man using a printing press with caption: Pulling the Great Archimedean Lever
Little adventures in newspaperdom

About this dataset

This dataset comprises 84,058 files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies. The dataset was created using the loc.gov JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with the item record.

View source collection Browse collection items

Metadata Metadata formats Data files
90,414 records .csv, .json 84,058 full text files (.txt, .json)

Data package documentation

Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:

View the documentation

Dataset at a glance

How to access and use this data package

There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.

Direct downloads

The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.

Sample the data
  • sample-data.zip (240.4 MB) - 1,000 OCRed text and JSON files randomly selected from the set. Included with this are a metadata.csv, metadata.json, and manifest.txt.
  • sample-data/metadata.json (3.6 MB) - A JSON file containing the metadata for the 1,000 sample items
  • sample-data/metadata.csv (1.3 MB) - A CSV transformation of the sample JSON metadata
  • sample-data/manifest.html - For downloading individual OCRed text files, this is a simple page that lists each text and json's file id, item id, MD5 hash (base64), file size, and URL
  • sample-data/manifest.json (147.8 KB) - A JSON file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL
Download the documentation
  • README.html - An overview of the source data or collection provenance, the contents of the data package, and how the data package was created.
  • README.md (14.0 KB) - README as a Markdown text file
  • README.pdf (20.0 KB) - README as a PDF file
  • dpp.html - The data processing plan
  • dpp.md (6.0 KB) - The data processing plan as a Markdown text file
  • dpp.pdf (10.6 KB) - The data processing plan as a PDF file
Download the metadata
Download the OCRed text
  • manifest.html - For downloading individual OCRed text, this is a simple page that lists each OCRed text's file id, item id, MD5 hash (base64), file size, and URL. For bulk downloads, refer to the following Using Python section .
  • manifest.txt (11.0 MB) - A text file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL
  • manifest.json (12.1 MB) - A JSON file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL

Using Python

While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.

For your convenience we developed a number of Jupyter Notebooks to help get you started.

View the Python notebook for this data package

Bulk downloads using Python

For bulk downloads, refer to this Python script for downloading files in bulk . Sample commands for this data package:

Download all OCR'd text data

python bulk_download.py --package "https://data.labs.loc.gov/digitized-books/" --out "output/digitized-books/"

Dataset details

Source collection

Selected Digitized Books collection

The Selected Digitized Books collection is a growing collection of selected books and other materials from the Library of Congress General Collections that have been made openly available. Most of the materials in this collection were published in the United States prior to the 1930s and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries.

Rights statement The books in this collection are in the public domain and are free to use and reuse.

Credit Line: Library of Congress

More about Copyright and other Restrictions .

For guidance about compiling full citations consult Citing Primary Sources .
Date created 2022-09-27
Date updated 2024-04-03
Creators & contributors
Dataset creator:
Chase Dooley
README and Cover sheet creators and contributors:
Eileen J. Manchester
Meghan Ferriter
Mark Cooper
Cite this dataset
Chicago citation style:
Library Of Congress. Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress, 2022] Software, E-Resource. https://data.labs.loc.gov/digitized-books/.
APA citation style:
Library Of Congress. (2022) Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress] [Software, E-Resource] Retrieved from the Library of Congress, https://data.labs.loc.gov/digitized-books/.
MLA citation style:
Library Of Congress. Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress, 2022] Software, E-Resource. Retrieved from the Library of Congress, </data.labs.loc.gov/digitized-books/>.
Curatorial questions For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Main Reading Room via the Library's Ask a Librarian service at https://ask.loc.gov/history-humanities-social-sciences/ .
Access questions For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] .
Back to top