Data for Exploration Learn How To Find or Use Our Data

Learn How To Find or Use Our Data

Explore the many ways the Library of Congress provides machine-readable access to its digital collections.

Library data guides and learning resources

Data Exploration Jupyter Notebooks The data-exploration repository includes Jupyter notebooks and example scripts using openly available Library of Congress Digital Collections or records.
- Accessing images for image analysis on Loc.gov
- Using the Loc.gov JSON to grab WWI Sheet Music
- Cats or dogs? An example of exploring the Chronicling America API
- Search the Library of Congress from your browser in one step
- Change image URLs to rotate and resize images on loc.gov
- Extracting location data from the loc.gov API for geovisualization with the Historic American Engineering Record
- Exploring the Meme Generator Metadata demonstrates some of the basic things that can be done with the set of data from memegenerator.
- Exploring the GIPHY.com Metadata demonstrates an intermediate approach to exploring the GIPHY.com data set produced by the Library of Congress.
Digital Scholarship at the Library of Congress: A Research Guide This research guide provides information about ways to access digital materials at the Library of Congress. The instructions in this guide can be used to access digital materials for purposes including, but not limited to, scholarly research, classroom instruction, creative practice, family history, civic engagement, connection to community, exploration of hobbies or passions, and more.
Datasets at the Library of Congress: A Research Guide This guide provides information about the collection of datasets at the Library of Congress, suggests tools for researchers, considers how datasets can be used for research, and provides guidance for locating datasets that may be sources for data science and machine learning projects.
Selected Datasets collection Datasets are increasingly a key digital resource used in a wide range of fields. The Library of Congress selects, preserves, and provides enduring access to datasets with the goal of cultivating a broad collection that encompasses all the areas covered by Library of Congress Collection Policy Statements.
The Signal Blog Initially created in 2011 to share about digital preservation efforts, the Signal blog has traced the evolution of digital practices at the Library over the course of a decade.
Story Maps at the Library of Congress Story Maps and web maps produced at the Library of Congress utilize geospatial technology to create curated entry points into our digital collections. We invite you to explore the incredible stories of the Library's collections through immersive narratives, multimedia, and interactive maps. Maps often include data downloads of the underlying data sets of the data maps.

Data across the Library

In addition to data packages there are many datasets that you can explore across the Library. Many of these are part of the Library’s Selected Datasets collection .

An image of a tablet, computer, and mobile device with the Congress.gov website loaded on screen. The text is displayed below: Congress.gov United States Legislative Information

Congress.gov Bill Status Bulk Data Congressional bills, bill status, and bill summaries data are available from the U.S. Government Publishing Office (GPO)'s bulk data repository and govinfo API. Bill Status data includes all data from the existing Bill Summaries data set. Bill Status data references and compliments the Congressional Bills data set.

Metadata format	Data format
.xml	.xml

Resources

Bill Status XML Bulk Data user guide (Guide)

Washington Times front cover detailing the use of new print press machines.

Chronicling America Bulk Data The Chronicling America, which provides access to information about over 12 million (and growing) digitized historic newspaper pages from almost every U.S. state and territory, also provides bulk access to the underlying data sets. They are available in image, metadata, and OCR text batches.

Metadata format	Data files
.xml	Image files (.jp2, .pdf) and OCR text (.txt, .xml)

A handwritten and scripted entry in a diary

By the People Transcription Datasets By the People invites the public to transcribe, review, and tag digitized pages from the Library's collections. All transcriptions are made and reviewed by volunteers before they are returned to loc.gov, the Library's website. Data from retired completed campaigns are also made available as datasets for bulk download.

Metadata format	Data files
.csv	Full text transcripts (.csv)

Resources

Diving into Branch Rickey: Using a dataset of crowdsourced transcriptions as a tool for open research (Blog post, 2021-06-10)

A narrow corridor lined with racks of computer servers

Dot Gov Datasets The Dot Gov Datasets are the result of exploratory work conducted by the Library's Web Archiving Team to make the Web Archives more widely accessible and usable. These five datasets each contain information related to 1,000 or more files of related media types selected from .gov domains in the Library's Web Archives (i.e., image, PowerPoint, PDF, audio, and tabular data formats).

Metadata	Metadata format	Data Files
7,000 records	.csv	7,000 files in multiple formats

Resources

The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets (Blog post, 2020-02-18)
In the Library's Web Archives: 1,000 U.S. Government PowerPoint Slide Decks (Blog post, 2019-11-21)
In the Library's Web Archives: Dig If You Will the Pictures (Blog post, 2019-10-30)
In the Library's Web Archives: Totally Tabular Data (Blog post, 2019-05-29)
In the Library's Web Archives: US Government Audio on Shuffle (Blog post, 2019-05-09)
In the Library's Web Archives: Sorting through a Set of US Government PDFs (Blog post, 2019-03-06)
The Library of Congress Web Archives: Dipping a Toe in a Lake of Data (Blog post, 2019-01-09)

A screenshot that contains five thumbnail images of websites and a title that reads: About this collection

Giphy: collected datasets Collection includes a data set created on May 5, 2018, from crawls of the Library of Congress's Web Cultures Web Archive. This GIPHY dataset includes data for 14,787 total GIFs, of which 10,972 are unique.

Metadata	Metadata format
14,787 records	.csv

Resources

Exploring the GIPHY.com Metadata (Jupyter Notebook, 2022-06-17)
Data Mining Memes in the Digital Culture Web Archive (Blog post, 2018-10-11)

Meme Generator: collected datasets Collection includes a data set created on May 5, 2018, from crawls of the Library of Congress's Web Cultures Web Archive. This dataset, memes-5-17, includes data for 57,652 memes. The Meme Generator dataset includes 86,310 total memes images which represent 57,652 unique memes.

Metadata	Metadata format
86,310 records	.csv

Resources

Exploring the Memegenerator Metadata (Blog post, 2022-10-26)
Data Mining Memes in the Digital Culture Web Archive (Blog post, 2018-10-11)

A six panel comic that features computer-illustrated dinosaurs entitled: Compressed Self-Realization Comics

Dinosaur comics This dataset was generated from content harvested from the Library of Congress's web archive of qwantz.com (Dinosaur Comics!): https://www.loc.gov/item/lcwaN0009953/ It includes minimal metadata about 3,325 image objects from the Dinosaur Comics! web archive as well as the files themselves. This dataset was created as apart of exploratory work done by the Library of Congress's Web Archiving Team.

Metadata	Metadata format	Data files
3,325 records	.csv

Resources

Let's Talk Comics: Comics as Data (Blog post, 2020-03-31)

The MARC Distribution Services Dataset The MARC Distribution Services Dataset is an export of MDSConnect, an openly available set of nearly 25 million MARC records that is split into 9 subsets: 1) serials, 2) maps, 3) music, 4) classification, 5) subjects, 6) books all, 7) computer files, 8) name authorities, and (9) visual materials. The records are available in two file formats: UTF8 and XML

Data files
MARC records (.xml, .txt)

Resources

Hack-to-Learn at the Library of Congress (Blog post, 2017-06-20)
Sample MARC dataset
README for MARC dataset
Library of Congress Lists (Blog post, 2017-05-22)