The goal of providing access to this data in the context of a data package is to make available the entirety of the collection that is currently accessible to the public, accessible in a format that is both comprehensive and more easily digestible for both computational and traditional research purposes. Note that the source collection Selected Digitized Books continues to develop and can be found on loc.gov.
The text data available in this dataset was created from the images of the Selected Digitzed Books using optical character recognition (OCR) technologies. This dataset focuses on full text from 90,414 selected digitized books. The original query that was used against the Library of Congress's API to generate the data is listed in the Compilation Methods section further down in the document.
metadata.csv
, metadata.json
, and manifest.txt
.The data package will be made publicly available through a S3/Cloudfront distribution on data.labs.loc.gov.
Selected Digitized Books
metadata.csv
, metadata.json
, and manifest.txt
.The full text in this dataset is derived from the Selected Digitized Books collection on loc.gov. This page provides contextual information to situate the images contained in the dataset in relation to the source material presented on loc.gov.
This dataset was created using the LOC JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with it. The LOC API has a maximum result of 100,000 objects. As of 2022-08-26, there were over 118,000 selected digitized books. However, only 90,414 had a date associated with it. So in ordered to get around the API's maximum result limitation, only items with a date were gathered for this initial release. In future updates to the dataset, the additional items will be added. The two queries that were used to gather this initial data were:
Dates before 1900: https://www.loc.gov/collections/selected-digitized-books/?c=150&dates=1000/1899&fa=access-restricted:false&fo=json
Dates from 1900 on: https://www.loc.gov/collections/selected-digitized-books/?c=150&dates=1900/2099&fa=access-restricted:false&fo=json
As noted above, 1,894 items from the 90,414 items in the metadata do not currently have an extract associated with them. This set will be updated when those become available.
No preprocessing was done in order to create the dataset. All work that was done, was completed and is described in the Compilation Methods section above.
The data package provides the following Content Advisory:
Please note that terminology in historical materials and in Library descriptions does not always match the language preferred by members of the communities depicted, and may include negative stereotypes or words that offend.
For questions or more information about this material, please contact Prints and Photographs Division staff through the Ask a Librarian service.
Please refer to the Content Advisory in the previous section.
The books in this collection are in the public domain and are free to use and reuse.
Credit Line: Library of Congress
More about Copyright and other Restrictions.
For guidance about compiling full citations consult Citing Primary Sources.