Data For Exploration Data packages Selected Dot Gov Media Types, Web Archives Data Package

Selected Dot Gov Media Types, Web Archives Data Package

The Dot Gov Datasets are the result of exploratory work conducted by the Library's Web Archiving Program to make the Web Archives more widely accessible and usable. This data package consists of seven datasets, each containing information related to 1,000 or more files of related media types selected from .gov domains in the Library's Web Archives (i.e. audio, CSV, image, PDF, Powerpoint, TSV, and XLS data formats).

A collage of many small, colorful, and low-resolution graphics that look like they are in the style of 1990s-era web graphics — A sampling of the many small images (i.e., infrastructural images) that make up the majority of the content in the Dot Gov Image dataset.

About this dataset

In 2019, the Library's Web Archiving Program released seven web archive file datasets. Each dataset consists of 1,000 files generated from indexes of the web archives, which were used to derive a random list of 1,000 files identified by specific media types and hosted on .gov domains, along with associated metadata extracted by Apache Tika and other tools. The seven media types included are audio, CSV, image, PDF, Powerpoint, TSV, and XLS files.

View source collection

Metadata	Metadata format	Data Files
7,000 records	.csv	7,000 files in multiple formats

Data package documentation

Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:

View the documentation

How to access and use this data package

There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.

Direct downloads

The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.

Sample the data	sample-data.zip (1.1 GB) - 700 randomly selected items, where 100 items were selected from each of the 1000-item datasets by media type (audio, CSV, image, PDF, Powerpoint, TSV, and XLS). Included with this are a metadata.csv, metadata.json, and manifest.json. sample-data/metadata.json (477.5 KB) - A JSON file containing the metadata for the 100 sample items sample-data/metadata.csv (348.4 KB) - A CSV transformation of the sample JSON metadata sample-data/manifest.html - For downloading individual images, this is a simple page that lists each image's file id, item id, MD5 hash (base64), file size, and URL sample-data/manifest.json (17.6 KB) - A JSON file listing each image file id, their item id, MD5 hash (base64), file size, and URL
Download the documentation	README ( HTML , Markdown , PDF , ) - An overview of the source media files and collection provenance, the contents of the data package, and how the data package was created. There are also media-type-specific documentation listed below audio_data/README.txt (12.5 KB) - An overview of the source audio and collection provenance, the contents of the audio set, and how the audio set was created. csv_data/README.txt (7.8 KB) - An overview of the source CSV files and collection provenance, the contents of the CSV set, and how the CSV set was created. image_data/README.txt (6.3 KB) - An overview of the source images and collection provenance, the contents of the image set, and how the image set was created. pdf_data/README.txt (10.3 KB) - An overview of the source PDF files and collection provenance, the contents of the PDF set, and how the PDF set was created. powerpoint_data/README.txt (9.1 KB) - An overview of the source Powerpoint files and collection provenance, the contents of the Powerpoint set, and how the Powerpoint set was created. tsv_data/README.txt (8.0 KB) - An overview of the source TSV files and collection provenance, the contents of the TSV set, and how the TSV set was created. xls_data/README.txt (8.7 KB) - An overview of the source Excel files and collection provenance, the contents of the Excel set, and how the Excel set was created.
Download the metadata by media type	audio_data/metadata.csv (499.3 KB) - A .csv file containing the metadata for all 1000 audio files csv_data/metadata.csv (413.2 KB) - A .csv file containing the metadata for all 1000 CSV files image_data/metadata.csv (840.7 KB) - A .csv file containing the metadata for all 1000 image files pdf_data/metadata.csv (517.0 KB) - A .csv file containing the metadata for all 1000 PDF files powerpoint_data/metadata.csv (1.0 MB) - A .csv file containing the metadata for all 1000 Powerpoint files tsv_data/metadata.csv (434.3 KB) - A .csv file containing the metadata for all 1000 TSV files xls_data/metadata.csv (505.8 KB) - A .csv file containing the metadata for all 1000 Excel files
Download data package by media type	audio_data.zip (4.6 GB) - All 1000 audio files with accompanying metadata and manifest files, zipped csv_data.zip (178.9 MB) - All 1000 CSV files with accompanying metadata and manifest files, zipped image_data.zip (128.6 MB) - All 1000 image files with accompanying metadata and manifest files, zipped pdf_data.zip (674.0 MB) - All 1000 PDF files with accompanying metadata and manifest files, zipped powerpoint_data.zip (3.2 GB) - All 1000 Powerpoint files with accompanying metadata and manifest files, zipped tsv_data.zip (5.9 MB) - All 1000 TSV files with accompanying metadata and manifest files, zipped xls_data.zip (87.0 MB) - All 1000 Excel files with accompanying metadata and manifest files, zipped
Browse data package by media type	audio_data/manifest.html - For downloading individual audio files, this is a simple page that lists each audio's file id, item id, MD5 hash (base64), file size, and URL csv_data/manifest.html - For downloading individual CSV files, this is a simple page that lists each CSV's file id, item id, MD5 hash (base64), file size, and URL image_data/manifest.html - For downloading individual image files, this is a simple page that lists each image's file id, item id, MD5 hash (base64), file size, and URL pdf_data/manifest.html - For downloading individual PDF files, this is a simple page that lists each PDF's file id, item id, MD5 hash (base64), file size, and URL powerpoint_data/manifest.html - For downloading individual Powerpoint files, this is a simple page that lists each Powerpoint's file id, item id, MD5 hash (base64), file size, and URL tsv_data/manifest.html - For downloading individual TSV files, this is a simple page that lists each TSV's file id, item id, MD5 hash (base64), file size, and URL xls_data/manifest.html - For downloading individual Excel files, this is a simple page that lists each Excel file's file id, item id, MD5 hash (base64), file size, and URL

Using Python

While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.

For your convenience we developed a number of Jupyter Notebooks to help get you started.

View the Python notebook for this data package

Dataset details

Source collection	Library of Congress Web Archive The Library of Congress Web Archive manages, preserves, and provides access to archived web content selected by subject experts from across the Library, so that it will be available for researchers today and in the future. Websites are ephemeral and often considered at-risk born-digital content. New websites form constantly, URLs change, content changes, and websites sometimes disappear entirely. Websites document current events, organizations, public reactions, government information, and cultural and scholarly information on a wide variety of topics. Materials that used to appear in print are increasingly published online.
Rights statement	This dataset was derived from content in the Library's web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page . Files were extracted from a variety of archived United States government websites collected in a number of event and thematic archives. See a general Rights & Access statement for a sample collection which applies to all of the content in this dataset.
Date created	2024-06-01
Date updated	2024-06-14
Creators & contributors	Creator: Web Archiving Program Contributors: LC Labs
Cite this dataset	Chicago citation style: Library Of Congress. Selected Dot Gov Media Types, Web Archives Data Package. [Washington, D.C.: Library of Congress, 2024] Software, E-Resource. https://data.labs.loc.gov/dot-gov/. APA citation style: Library Of Congress. (2024) Selected Dot Gov Media Types, Web Archives Data Package. [Washington, D.C.: Library of Congress] [Software, E-Resource] Retrieved from the Library of Congress, https://data.labs.loc.gov/dot-gov/. MLA citation style: Library Of Congress. Selected Dot Gov Media Types, Web Archives Data Package. [Washington, D.C.: Library of Congress, 2024] Software, E-Resource. Retrieved from the Library of Congress, </data.labs.loc.gov/dot-gov/>.
Curatorial questions	For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Web Archiving Program at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/ .
Access questions	For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] .