Top of page

United States Elections, Web Archives Data Package

The data package is comprised of 396,117 CDX index files from the United States Elections Web Archive , which includes campaign websites and related web content documenting presidential, congressional, and gubernatorial elections that were archived weekly during general election seasons. The data package currently includes years 2000 – 2016.

Screenshot of sample CDX file opened in a text editor
Example of a CDX file

About this dataset

This data package is maintained by the Library of Congress's Web Archiving Program. It is an update of a time-limited dataset initially released in 2022 and described in more detail in its accompanying blog post . This data package is comprised of 396,117 CDX index files from the United States Elections Web Archive , which includes campaign websites and related web content documenting presidential, congressional, and gubernatorial elections that were archived weekly during general election seasons. The data package currently includes years 2000 – 2016. Some of the data in this collection (metadata.csv) is publicly available via the loc.gov API. The CDX files in this data package are available only through this data package.

View source collection Browse collection items

Metadata Metadata formats Data files
396,117 index files, 1 descriptive metadata file .cdx.gz, .csv Web archived documents are not included within the data package, but the CDX files provide pointers for download.

Data package documentation

Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:

View the documentation

Dataset at a glance

How to access and use this data package

There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.

Direct downloads

The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.

Sample the data
  • sample-data.zip (65.7 MB) - 100 randomly selected items from the 396,117 CDX index files have been provided as sample data.
  • sample-data/manifest.html - For downloading individual CDX files, this is a simple page that lists each CDX's file id, item id, MD5 hash (base64), file size, and URL
  • sample-data/manifest.json (34.9 KB) - A JSON file listing each CDX file id, their item id, MD5 hash (base64), file size, and URL
Download the documentation
  • README.html - An overview of the source data or collection provenance, the contents of the data package, and how the data package was created.
  • README.md (37.9 KB) - README as a Markdown text file
  • README.pdf (63.3 KB) - README as a PDF file
Download the descriptive metadata
  • metadata.csv (8.2 MB) - A CSV file of general election candidate websites in the United States Elections Web Archive collection 2000 - 2016, pulled from publicly-accessible metadata on loc.gov
  • metadata.json (12.6 MB) - A JSON file containing the metadata for all 2,610 items
  • metadata.jsonl (12.6 MB) - A JSON lines version of the JSON data, with one record per line, useful for processing large files
  • README.html#metadatacsv - Field descriptions for metadata.csv
  • README.html#cdx-files - Field descriptions for the CDX index files
Download the CDX index files
  • manifest.txt (127.2 MB) - A text file listing each CDX filename MD5 hash (base64), file size, and URL
  • manifest.json (132.9 MB) - A JSON file listing each CDX filename, MD5 hash (base64), file size, and URL. For bulk downloads, refer to the following Using Python section .
Browse CDX index files by year

Using Python

While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.

For your convenience we developed a number of Jupyter Notebooks to help get you started.

View the Python notebook for this data package

Bulk downloads using Python

For bulk downloads, refer to this Python script for downloading files in bulk . Sample commands for this data package:

Download all CDX files by from 2004

python bulk_download.py --package "https://data.labs.loc.gov/us-elections/by-year/2004/" --out "output/2004/"

Download all CDX files by from 2016

python bulk_download.py --package "https://data.labs.loc.gov/us-elections/by-year/2016/" --out "output/2016/"

Dataset details

Source collection

United States Elections Web Archive

The United States Elections Web Archive includes campaign websites documenting presidential, congressional, and gubernatorial elections that were archived weekly during general election seasons 2000 - present. Prior to election 2020, the sites archived in the collection often include web-harvested social media content, in order to provide a fuller representation of how candidates presented themselves via the Internet to the electorate and with varying capture success rates. In the early years of the collections, websites were also included of political parties, government, advocacy groups, bloggers, and other individuals and groups producing content relevant to the election. These sites have generally been moved into the Public Policy Topics Web Archive or into the general web archives. However, because these sites were originally collected with candidate websites, they remain in the CDX index files in this data package. This data package reflects the scope of the United States Elections Web Archive at the time of its collection each general election year. Because this scope has evolved, the scope of the CDX index files also varies over time.

Rights statement

Rights for this data package - The README, CDX files, and metadata.csv contained within this data package have no known copyright restrictions and are free to use and reuse.

Rights for the source United States Election Web Archive collection - See the full Rights & Access statement for the United States Election Web Archive at https://www.loc.gov/collections/united-states-elections-web-archive/about-this-collection/rights-and-access/ .

Date created 2024-11-15
Creators & contributors
Creator:
Web Archives Program, Digital Content Management Section, Library of Congress
Cite this dataset
Chicago citation style:
Library Of Congress. United States Elections, Web Archives Data Package. [Washington, D.C.: Library of Congress, 2024] Software, E-Resource. https://data.labs.loc.gov/us-elections/.
APA citation style:
Library Of Congress. (2024) United States Elections, Web Archives Data Package. [Washington, D.C.: Library of Congress] [Software, E-Resource] Retrieved from the Library of Congress, https://data.labs.loc.gov/us-elections/.
MLA citation style:
Library Of Congress. United States Elections, Web Archives Data Package. [Washington, D.C.: Library of Congress, 2024] Software, E-Resource. Retrieved from the Library of Congress, </data.labs.loc.gov/us-elections/>.
Curatorial questions Please direct curatorial questions to the Web Archiving Program at [email protected].
Access questions For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] .
Back to top