Data For Exploration Data packages Selected Digitized Books Data Package

Selected Digitized Books Data Package

Illustration of a man using a printing press with caption: Pulling the Great Archimedean Lever — Little adventures in newspaperdom

About this dataset

This dataset comprises 84,058 files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies. The dataset was created using the loc.gov JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with the item record.

View source collection Browse collection items

Metadata	Metadata formats	Data files
90,414 records	.csv, .json	84,058 full text files (.txt, .json)

Data package documentation

Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:

View the documentation

Dataset at a glance

How to access and use this data package

There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.

Direct downloads

The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.

Sample the data	sample-data.zip (240.4 MB) - 1,000 OCRed text and JSON files randomly selected from the set. Included with this are a metadata.csv, metadata.json, and manifest.txt. sample-data/metadata.json (3.6 MB) - A JSON file containing the metadata for the 1,000 sample items sample-data/metadata.csv (1.3 MB) - A CSV transformation of the sample JSON metadata sample-data/manifest.html - For downloading individual OCRed text files, this is a simple page that lists each text and json's file id, item id, MD5 hash (base64), file size, and URL sample-data/manifest.json (147.8 KB) - A JSON file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL
Download the documentation	README.html - An overview of the source data or collection provenance, the contents of the data package, and how the data package was created. README.md (14.0 KB) - README as a Markdown text file README.pdf (20.0 KB) - README as a PDF file dpp.html - The data processing plan dpp.md (6.0 KB) - The data processing plan as a Markdown text file dpp.pdf (10.6 KB) - The data processing plan as a PDF file
Download the metadata	metadata.json (323.9 MB) - A JSON file containing the metadata for all 90,414 selected digitized books metadata.jsonl (325.0 MB) - A JSON lines version of the JSON data, with one record per line, useful for processing large files metadata.csv (118.8 MB) - A CSV transformation of the original JSON metadata README.html#dataset-field-descriptions - Metadata field descriptions
Download the OCRed text	manifest.html - For downloading individual OCRed text, this is a simple page that lists each OCRed text's file id, item id, MD5 hash (base64), file size, and URL. For bulk downloads, refer to the following Using Python section . manifest.txt (11.0 MB) - A text file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL manifest.json (12.1 MB) - A JSON file listing each OCRed text file id, their item id, MD5 hash (base64), file size, and URL

Using Python

While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.

For your convenience we developed a number of Jupyter Notebooks to help get you started.

View the Python notebook for this data package

Bulk downloads using Python

For bulk downloads, refer to this Python script for downloading files in bulk . Sample commands for this data package:

Download all OCR'd text data


                        python bulk_download.py --package
                        "https://data.labs.loc.gov/digitized-books/" --out "output/digitized-books/"

Dataset details

Source collection	Selected Digitized Books collection The Selected Digitized Books collection is a growing collection of selected books and other materials from the Library of Congress General Collections that have been made openly available. Most of the materials in this collection were published in the United States prior to the 1930s and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries.
Rights statement	The books in this collection are in the public domain and are free to use and reuse. Credit Line: Library of Congress More about Copyright and other Restrictions . For guidance about compiling full citations consult Citing Primary Sources .
Date created	2022-09-27
Date updated	2024-04-03
Creators & contributors	Dataset creator: Chase Dooley README and Cover sheet creators and contributors: Eileen J. Manchester Meghan Ferriter Mark Cooper
Cite this dataset	Chicago citation style: Library Of Congress. Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress, 2022] Software, E-Resource. https://data.labs.loc.gov/digitized-books/. APA citation style: Library Of Congress. (2022) Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress] [Software, E-Resource] Retrieved from the Library of Congress, https://data.labs.loc.gov/digitized-books/. MLA citation style: Library Of Congress. Selected Digitized Books Data Package. [Washington, D.C.: Library of Congress, 2022] Software, E-Resource. Retrieved from the Library of Congress, </data.labs.loc.gov/digitized-books/>.
Curatorial questions	For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Main Reading Room via the Library's Ask a Librarian service at https://ask.loc.gov/history-humanities-social-sciences/ .
Access questions	For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] .