Explore the many ways the Library of Congress provides machine-readable
access to its digital collections.
Library data guides and learning resources
Data Exploration Jupyter Notebooks
The data-exploration repository includes Jupyter notebooks and
example scripts using openly available Library of Congress Digital
Collections or records.
Exploring the GIPHY.com Metadata
demonstrates an intermediate approach to exploring the GIPHY.com
data set produced by the Library of Congress.
Digital Scholarship at the Library of Congress: A Research
Guide
This research guide provides information about ways to access
digital materials at the Library of Congress. The instructions in
this guide can be used to access digital materials for purposes
including, but not limited to, scholarly research, classroom
instruction, creative practice, family history, civic engagement,
connection to community, exploration of hobbies or passions, and
more.
Datasets at the Library of Congress: A Research Guide
This guide provides information about the collection of datasets at
the Library of Congress, suggests tools for researchers, considers
how datasets can be used for research, and provides guidance for
locating datasets that may be sources for data science and machine
learning projects.
Selected Datasets collection
Datasets are increasingly a key digital resource used in a wide
range of fields. The Library of Congress selects, preserves, and
provides enduring access to datasets with the goal of cultivating a
broad collection that encompasses all the areas covered by Library
of Congress Collection Policy Statements.
The Signal Blog
Initially created in 2011 to share about digital preservation
efforts, the Signal blog has traced the evolution of digital
practices at the Library over the course of a decade.
Story Maps at the Library of Congress
Story Maps and web maps produced at the Library of Congress utilize
geospatial technology to create curated entry points into our
digital collections. We invite you to explore the incredible stories
of the Library's collections through immersive narratives,
multimedia, and interactive maps. Maps often include data downloads
of the underlying data sets of the data maps.
Data across the Library
In addition to
data packages
there are many datasets that you can explore across the Library. Many of
these are part of the Library’s
Selected Datasets collection
.
Congress.gov Bill Status Bulk Data
Congressional bills, bill status, and bill summaries data
are available from the U.S. Government Publishing Office
(GPO)'s bulk data repository and govinfo API. Bill Status
data includes all data from the existing Bill Summaries data
set. Bill Status data references and compliments the
Congressional Bills data set.
Chronicling America Bulk Data
The Chronicling America, which provides access to
information about over 12 million (and growing) digitized
historic newspaper pages from almost every U.S. state and
territory, also provides bulk access to the underlying data
sets. They are available in image, metadata, and OCR text
batches.
Metadata format
Data files
.xml
Image files (.jp2, .pdf) and OCR text (.txt, .xml)
By the People Transcription Datasets
By the People invites the public to transcribe, review, and
tag digitized pages from the Library's collections. All
transcriptions are made and reviewed by volunteers before
they are returned to loc.gov, the Library's website. Data
from retired completed campaigns are also made available as
datasets for bulk download.
Dot Gov Datasets
The Dot Gov Datasets are the result of exploratory work
conducted by the Library's Web Archiving Team to make the
Web Archives more widely accessible and usable. These five
datasets each contain information related to 1,000 or more
files of related media types selected from .gov domains in
the Library's Web Archives (i.e., image, PowerPoint, PDF,
audio, and tabular data formats).
Giphy: collected datasets
Collection includes a data set created on May 5, 2018, from
crawls of the Library of Congress's Web Cultures Web
Archive. This GIPHY dataset includes data for 14,787 total
GIFs, of which 10,972 are unique.
Meme Generator: collected datasets
Collection includes a data set created on May 5, 2018, from
crawls of the Library of Congress's Web Cultures Web
Archive. This dataset, memes-5-17, includes data for 57,652
memes. The Meme Generator dataset includes 86,310 total
memes images which represent 57,652 unique memes.
Dinosaur comics
This dataset was generated from content harvested from the
Library of Congress's web archive of qwantz.com (Dinosaur
Comics!): https://www.loc.gov/item/lcwaN0009953/ It includes
minimal metadata about 3,325 image objects from the Dinosaur
Comics! web archive as well as the files themselves. This
dataset was created as apart of exploratory work done by the
Library of Congress's Web Archiving Team.
The MARC Distribution Services Dataset
The MARC Distribution Services Dataset is an export of
MDSConnect, an openly available set of nearly 25 million
MARC records that is split into 9 subsets: 1) serials, 2)
maps, 3) music, 4) classification, 5) subjects, 6) books
all, 7) computer files, 8) name authorities, and (9) visual
materials. The records are available in two file formats:
UTF8 and XML