Selected Digitized Books Data Package README

Version information

Version 1.1 Last updated 2024-04-17.

Content Advisory

Please note that terminology in historical materials and in Library descriptions does not always match the language preferred by members of the communities depicted, and may include negative stereotypes or words that offend.

About the source data or collection

Brief description & background of collection

This is a growing collection of selected books and other materials from the Library of Congress General Collections that can be made openly available. Most of the materials in this collection were published in the United States and are in English. The collection features tens of thousands of works of nonfiction, fiction, and poetry. These works cover a wide range of subjects and topics including: American history, travel, sports, cooking, agriculture, children's literature, philsophy, government publications including speeches and addresses, local history and geneology, film, and many more esoteric subjects from beekeeping to spiritualism. There are also some materials in foreign languages that were published in other countries. The materials in this collection can be read online or downloaded.

Original format

Printed books

Library of Congress reading room

Main Reading Room, https://www.loc.gov/rr/main/

Contact

For more information please contact the specialists in the Library's Main Reading Room at https://ask.loc.gov/history-humanities-social-sciences/.

Metadata type

MARC

Scale of description

Books in this collection have been digitized and described individually (at the item level). In some cases, multi-part works are represented including multi-volume books or serial publications, where one descriptive record relates several digitized resources.

Rights information

The books in this collection are in the public domain and are free to use and reuse.

Credit Line: Library of Congress

More about Copyright and other Restrictions.

For guidance about compiling full citations consult Citing Primary Sources.

Digitization information

This broad collection contains materials from a variety of digitization sources. Please direct questions about specific items to the specialists in the Main Reading rooom.

Digitization was completed over more than a decade, representing multiple phases of priorities for what to select to digitize from the General Collections. Priority was often driven by identifying, where possible, works that were not already available online from other sources, and that may not be held by other libraries. Additionally, selection was driven by subject classification (genealogy, local history, other subjects) or to assist with support for moving collections to remote storage. For several years there was an emphasis on children’s and young adult literature (class PZ). As such, the collection is not a random sample of what was published in the US from the Library's General Collections, but represents the aggregation of several areas of digitization and preservation priorities and provides a unique cross section of the holdings of the Library.

About this exploratory data package

Selected Digitized Books consists of books and other materials from the Library of Congress General Collections that can be made openly available. Most of the materials in this collection were published in the United States and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries. The materials in this collection can be read online or downloaded.

Please note: the Selected Digitized Books digital collection continues to grow. Therefore, the text in this dataset may not constitute the entirety of text that could be derived from what may be available on loc.gov.

What's included?

The data package contains:

Computational readiness and possible uses

The text data available in this dataset was created from the images of the Selected Digitzed Books using optical character recognition (OCR) technologies. The corpus is quite amenable to computational text analysis methods including but not limited to keyword analysis, named entity recognition, sentiment analysis, and topic modeling.

How was it created?

This dataset was created using the LOC JSON/YAML API to fetch the metadata and an internal workflow processing and data management application to pull the associated full text from an LCCN. The metadata comprises all of the selected digitized books (as of 2022-08-26) that had a date associated with it. The LOC API has a maximum result of 100,000 objects. As of 2022-08-26, there were over 118,000 selected digitized books. However, only 90,414 had a date associated with it. So in ordered to get around the API's maximum result limitation, only items with a date were gathered for this initial release. In future updates to the dataset, the additional items will be added. The two queries that were used to gather this initial data were:

As noted above, 1,894 items from the 90,414 items in the metadata do not currently have an extract associated with them. This set will be updated when those become available.

Dataset field descriptions

The data fields that follow are directly translated from the metadata.json file. The JSON file is highly nested in nature, and that nested structure is not strictly carried over into the CSV. The CSV data fields contain the top level keys and, where applicable, one nested level below. In these cases, the field names are signified by the top level key.secondary key; for example: item.call_number.

All values in each column are stored as they would be found in the JSON metadata. Meaning, that if the column's value is a list or array, it is stored as a string representation of that value. For example: the aka field's value is in list format: ['http://www.loc.gov/item/2015651359/', 'http://www.loc.gov/pictures/item/2015651359/', 'http://www.loc.gov/pictures/collection/stereo/item/2015651359/', 'http://hdl.loc.gov/loc.pnp/stereo.1s04563', 'http://hdl.loc.gov/loc.pnp/stereo.2s04563', 'http://www.loc.gov/resource/stereo.1s04563/', 'http://www.loc.gov/resource/stereo.2s04563/', 'http://lccn.loc.gov/2015651359']

Each of the fields described below appears for a result under the content.results section of the API response for this query. Please note that not all elements appear for each result. Elements appearing in only some results have been marked with an asterisk.

The following item subfields of the content.results section are mainly for display of the item on the loc.gov website. These subfields may pull information from target-specific interpretations of MARC records.

Rights Statement

The books in this collection are in the public domain and are free to use and reuse.

Credit Line: Library of Congress

More about Copyright and other Restrictions.

For guidance about compiling full citations consult Citing Primary Sources.

Creator and contributor information

Creator: Chase Dooley

Contributors: Eileen J. Manchester, Meghan Ferriter, Mark Cooper

Contact information

Please contact [email protected] with any questions or suggestions!