Top of page
The text files in this dataset are derived from the Selected Digitized Books collection, which is growing collection of selected books and other materials from the Library of Congress General Collections.
About: This dataset comprises 166,218 .txt and JSON files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies. The dataset was created using the loc.gov JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with the item record.
Source collection: Selected Digitized Books collection . This is a growing collection of selected books and other materials from the Library of Congress General Collections that have been made openly available. Most of the materials in this collection were published in the United States prior to the 1930s and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries.
Link to image on the right: https://www.loc.gov/resource/gdcmassbookdig.littleadventures00alls/?sp=25 .
The data package includes:
The books in this collection are in the public domain and are free to use and reuse.
Credit Line: Library of Congress
More about Copyright and other Restrictions.
For guidance about compiling full citations consult Citing Primary Sources.
Dataset creator: Chase Dooley
README and Cover sheet creators and contributors: Eileen J. Manchester, Meghan Ferriter, Mark Cooper
Get a sense of the corpus by downloading 1,000 OCRed text files randomly selected from the total 166,218 text files comprising this dataset.
For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Main Reading Room via the Library's Ask a Librarian service at https://ask.loc.gov/history-humanities-social-sciences/.
For questions about download and access, please email the LC Labs Team at [email protected].