Selected Dot Gov Media Types, Web Archives Data Package
The Dot Gov Datasets are the result of exploratory work conducted by the
Library's
Web Archiving Program
to make the Web Archives more widely accessible and usable. This data
package consists of seven datasets, each containing information related to
1,000 or more files of related media types selected from .gov domains in the
Library's
Web Archives
(i.e. audio, CSV, image, PDF, Powerpoint, TSV, and XLS data formats).
In 2019, the Library's
Web Archiving Program
released seven web archive file datasets. Each dataset consists of 1,000
files generated from indexes of the web archives, which were used to derive
a random list of 1,000 files identified by specific media types and hosted
on .gov domains, along with associated metadata extracted by Apache Tika and
other tools. The seven media types included are audio, CSV, image, PDF,
Powerpoint, TSV, and XLS files.
Included in this data package is comprehensive documentation of source data
or collection provenance, the contents of the data package, and how the data
package was created. Here are some particular sections of interest as well
as a link to the full documentation:
There are two main options for accessing and using this data package: (1)
Directly downloading files
from this page and (2)
using Python
for more advanced usage.
Direct downloads
The following list outlines the contents of this data package. Many of the
individual files inside the data package are linked directly on this page
which you can download and immediately use. Zipped files are available for
bulk download of the entire or parts of the data package.
Sample the data
sample-data.zip
(1.1 GB)
- 700 randomly selected items, where 100 items were selected
from each of the 1000-item datasets by media type (audio,
CSV, image, PDF, Powerpoint, TSV, and XLS). Included with
this are a metadata.csv, metadata.json, and manifest.json.
sample-data/manifest.html
- For downloading individual images, this is a simple page
that lists each image's file id, item id, MD5 hash (base64),
file size, and URL
sample-data/manifest.json
(17.6 KB)
- A JSON file listing each image file id, their item id, MD5
hash (base64), file size, and URL
Download the documentation
README (
HTML
,
Markdown
,PDF,
)
- An overview of the source media files and collection
provenance, the contents of the data package, and how the
data package was created. There are also media-type-specific
documentation listed below
audio_data/README.txt
(12.5 KB)
- An overview of the source audio and collection provenance,
the contents of the audio set, and how the audio set was
created.
csv_data/README.txt
(7.8 KB)
- An overview of the source CSV files and collection
provenance, the contents of the CSV set, and how the CSV set
was created.
image_data/README.txt
(6.3 KB)
- An overview of the source images and collection
provenance, the contents of the image set, and how the image
set was created.
pdf_data/README.txt
(10.3 KB)
- An overview of the source PDF files and collection
provenance, the contents of the PDF set, and how the PDF set
was created.
powerpoint_data/README.txt
(9.1 KB)
- An overview of the source Powerpoint files and collection
provenance, the contents of the Powerpoint set, and how the
Powerpoint set was created.
tsv_data/README.txt
(8.0 KB)
- An overview of the source TSV files and collection
provenance, the contents of the TSV set, and how the TSV set
was created.
xls_data/README.txt
(8.7 KB)
- An overview of the source Excel files and collection
provenance, the contents of the Excel set, and how the Excel
set was created.
tsv_data/metadata.csv
(434.3 KB)
- A .csv file containing the metadata for all 1000 TSV files
xls_data/metadata.csv
(505.8 KB)
- A .csv file containing the metadata for all 1000 Excel
files
Download data package by media type
audio_data.zip
(4.6 GB)
- All 1000 audio files with accompanying metadata and
manifest files, zipped
csv_data.zip
(178.9 MB)
- All 1000 CSV files with accompanying metadata and manifest
files, zipped
image_data.zip
(128.6 MB)
- All 1000 image files with accompanying metadata and
manifest files, zipped
pdf_data.zip
(674.0 MB)
- All 1000 PDF files with accompanying metadata and manifest
files, zipped
powerpoint_data.zip
(3.2 GB)
- All 1000 Powerpoint files with accompanying metadata and
manifest files, zipped
tsv_data.zip
(5.9 MB)
- All 1000 TSV files with accompanying metadata and manifest
files, zipped
xls_data.zip
(87.0 MB)
- All 1000 Excel files with accompanying metadata and
manifest files, zipped
Browse data package by media type
audio_data/manifest.html
- For downloading individual audio files, this is a simple
page that lists each audio's file id, item id, MD5 hash
(base64), file size, and URL
csv_data/manifest.html
- For downloading individual CSV files, this is a simple
page that lists each CSV's file id, item id, MD5 hash
(base64), file size, and URL
image_data/manifest.html
- For downloading individual image files, this is a simple
page that lists each image's file id, item id, MD5 hash
(base64), file size, and URL
pdf_data/manifest.html
- For downloading individual PDF files, this is a simple
page that lists each PDF's file id, item id, MD5 hash
(base64), file size, and URL
powerpoint_data/manifest.html
- For downloading individual Powerpoint files, this is a
simple page that lists each Powerpoint's file id, item id,
MD5 hash (base64), file size, and URL
tsv_data/manifest.html
- For downloading individual TSV files, this is a simple
page that lists each TSV's file id, item id, MD5 hash
(base64), file size, and URL
xls_data/manifest.html
- For downloading individual Excel files, this is a simple
page that lists each Excel file's file id, item id, MD5 hash
(base64), file size, and URL
Using Python
While direct downloads are more convenient for most activities, users with
familiarity with writing Python can perform more advanced and complex tasks
programmatically.
For your convenience we developed a number of
Jupyter Notebooks
to help get you started.
The Library of Congress Web Archive manages, preserves, and
provides access to archived web content selected by subject
experts from across the Library, so that it will be available
for researchers today and in the future. Websites are ephemeral
and often considered at-risk born-digital content. New websites
form constantly, URLs change, content changes, and websites
sometimes disappear entirely. Websites document current events,
organizations, public reactions, government information, and
cultural and scholarly information on a wide variety of topics.
Materials that used to appear in print are increasingly
published online.
Rights statement
This dataset was derived from content in the Library's web archives.
The Library follows a notification and permission process in the
acquisition of content for the web archives, and to allow researcher
access to the archived content, as described on the
web archiving program page
. Files were extracted from a variety of archived United States
government websites collected in a number of event and thematic
archives. See a
general Rights & Access statement
for a sample collection which applies to all of the content in this
dataset.
Library Of Congress.
Selected Dot Gov Media Types, Web Archives Data Package.
[Washington, D.C.: Library of Congress, 2024] Software,
E-Resource. https://data.labs.loc.gov/dot-gov/.
APA citation style:
Library Of Congress. (2024)
Selected Dot Gov Media Types, Web Archives Data Package.
[Washington, D.C.: Library of Congress] [Software,
E-Resource] Retrieved from the Library of Congress,
https://data.labs.loc.gov/dot-gov/.
MLA citation style:
Library Of Congress.
Selected Dot Gov Media Types, Web Archives Data Package.
[Washington, D.C.: Library of Congress, 2024] Software,
E-Resource. Retrieved from the Library of Congress,
</data.labs.loc.gov/dot-gov/>.