Top of page
Data For Exploration Data packages Selected Dot Gov Media Types, Web Archives Data Package
The Dot Gov Datasets are the result of exploratory work conducted by the Library's Web Archiving Program to make the Web Archives more widely accessible and usable. This data package consists of seven datasets, each containing information related to 1,000 or more files of related media types selected from .gov domains in the Library's Web Archives (i.e. audio, CSV, image, PDF, Powerpoint, TSV, and XLS data formats).
In 2019, the Library's Web Archiving Program released seven web archive file datasets. Each dataset consists of 1,000 files generated from indexes of the web archives, which were used to derive a random list of 1,000 files identified by specific media types and hosted on .gov domains, along with associated metadata extracted by Apache Tika and other tools. The seven media types included are audio, CSV, image, PDF, Powerpoint, TSV, and XLS files.
Metadata | Metadata format | Data Files |
---|---|---|
7,000 records | .csv | 7,000 files in multiple formats |
Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:
There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.
The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.
Sample the data |
|
---|---|
Download the documentation |
|
Download the metadata by media type |
|
Download data package by media type |
|
Browse data package by media type |
|
While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.
For your convenience we developed a number of Jupyter Notebooks to help get you started.
View the Python notebook for this data package
For bulk downloads, refer to this Python script for downloading files in bulk . Sample commands for this data package:
Download all audio data files
python bulk_download.py --package
"https://data.labs.loc.gov/dot-gov/audio_data/" --out
"output/dot-gov-audio/"
Download all PDF data files
python bulk_download.py --package
"https://data.labs.loc.gov/dot-gov/pdf_data/" --out "output/dot-gov-pdf/"
Download all Powerpoint data files
python bulk_download.py --package
"https://data.labs.loc.gov/dot-gov/powerpoint_data/" --out
"output/dot-gov-powerpoint/"
Online resources and explorations built using this data.
Source collection |
Library of Congress Web Archive The Library of Congress Web Archive manages, preserves, and provides access to archived web content selected by subject experts from across the Library, so that it will be available for researchers today and in the future. Websites are ephemeral and often considered at-risk born-digital content. New websites form constantly, URLs change, content changes, and websites sometimes disappear entirely. Websites document current events, organizations, public reactions, government information, and cultural and scholarly information on a wide variety of topics. Materials that used to appear in print are increasingly published online. |
---|---|
Rights statement | This dataset was derived from content in the Library's web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page . Files were extracted from a variety of archived United States government websites collected in a number of event and thematic archives. See a general Rights & Access statement for a sample collection which applies to all of the content in this dataset. |
Date created | 2024-06-01 |
Date updated | 2024-06-14 |
Creators & contributors |
|
Cite this dataset |
|
Curatorial questions | For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Web Archiving Program at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/ . |
Access questions | For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] . |