Selected Dot Gov Media Types, Web Archives Data Package README

Version information

Version 1.0 | Last updated 2024-06-11 - 1.0 (2024-06-11) First version

About the source data or collection

A. Historical background of source material

This data includes a sample of tabular, PDF, audio, image, and PowerPoint documents that were found on United States government websites from 1996 to 2017. More specifically, it includes files linked from (and embedded in) .gov websites and archived in the Library of Congress Web Archive.

The .gov top-level domain is managed by the U.S. federal government and may be used by any level of U.S. government entities, including federal, state, and local. As of 2024, registration of .gov domains is coordinated through the General Services Administration (GSA) and inventoried at https://get.gov/about/data/.

Original format

The full web archive data is stored in WARC files. For more information, see WARC, Web ARChive file format.

The web archives are indexed in CDX files, which form the basis of this data package. For more information about these files, see CDX Internet Archive Index File. For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.

B. Acquisition and Access

The Library of Congress has an expansive web archive, with content collected as early as 1996. Websites are generally selected for inclusion across a broad range of subjects, themes, and events identified by Library staff. Collecting areas include, for example, select U.S. government sites from the Legislative, Judicial, and Executive branch agencies; select foreign government sites; campaign websites and political parties documenting U.S. and select foreign elections; non-profit organizations; journalism and news; creative sites such as those documenting comics, music, authors, and art; legal sites; and international organizations. Areas of collecting focus have evolved over time, but U.S. governmental websites have been a particular focus throughout the program's history. More information about the selection process can be found at https://www.loc.gov/programs/web-archiving/about-this-program/frequently-asked-questions/.

The web archive is built using automated tools, and the technical configuration of these tools also has an important impact on the scope of the collection. When a website is selected for inclusion, it is archived using a web crawler. The crawler starts with a "seed URL" – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. These links include html, CSS, JavaScript, images, PDFs, audio and video files, and more. Scoping instructions are added to allow or restrict the crawler's ability to collect linked files hosted on third party sites or on other subdomains from the same organization. The resulting web archive contains a broad diversity of file formats, from a range of web domains not necessarily limited to those originally selected for inclusion. More information about the crawling process can be found at https://www.loc.gov/programs/web-archiving/about-this-program/frequently-asked-questions/.

Most of the collection is described and searchable at https://www.loc.gov/web-archives/. Individual web documents can be retrieved by appending the web document's original URL to the end of "https://webarchive.loc.gov/all/*/" (e.g., https://webarchive.loc.gov/all/*/http://adcvd.cbp.dhs.gov/adcvdweb/ad_cvd_msgs/8461.pdf). The web document's original URL can be found in the "original" column of the dataset CSV metadata files.

Library of Congress reading room and contact

The files in these datasets do not have any access restrictions and are made fully available online. Some content in the broader web archive is restricted to onsite access only, if the site owners do not grant permission to display offsite, and this content can be viewed in from any research center at the Library of Congress.

Questions about the web archiving program can be submitted at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/.

Metadata type

Metadata collected by the web crawler is saved into WARC files. For more information about information contained in the full archive WARC files, see WARC, Web ARChive file format.

The web archives are indexed in CDX files, which form the basis of this data package. For more information about the metadata in these files, see CDX Internet Archive Index File. For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.

This data package also includes metadata extracted from the content files themselves. These metadata fields are specific to each file format specification.

Scale of description

The source CDX index contains one line per record, and each record describes a capture (or attempted capture) of a web document file.

Rights information

This dataset was derived from content in the Library's web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. Files were extracted from a variety of archived United States government websites collected in a number of event and thematic archives. See a general Rights & Access statement for a sample collection which applies to all of the content in this dataset: https://www.loc.gov/collections/legislative-branch-web-archive/about-this-collection/rights-and-access/.

II. About the exploratory data package

What's included?

This dataset is a set of seven random samples taken from the full corpus of .gov websites collected between 1996 to 2017, available in the Library of Congress Web Archive. One thousand (1,000) documents were included for each of seven media types: PDFs, Audio, CSV, TSV, Excel, Images, and PowerPoint. Three of those media types, CSV, TSV, and Excel, are packaged together here in the "Tabular Dataset".

The samples were generated from the web archives' CDX index files . At the time of the sampling in 2018, the CDX index was 6 TB, and the full archive WARC files were nearly 1.5 PB.

Each set includes 1,000 unique files and minimal metadata about them, including links to their locations within the Library's web archive. The datasets are as follows:

audio_data/ - 1,000 .gov Audio files (4.6 GB zip file + metadata files)
csv_data/ - 1,000 .gov CSV files (179 MB zip file + metadata files)
image_data/ - 1,000 .gov Image files (128.16 MB zip file + metadata files)
pdf_data/ - 1,000 .gov PDF files (673.5 MB zip file + metadata files)
powerpoint_data/ - 1,000 .gov PowerPoint (3.2 GB zip file + metadata files)
tsv_data/ - 1,000 .gov Tab-separated (TSV) files (5.8 MB zip file + metadata files)
xls_data/ - 1,000 .gov Excel files (86.8 MB zip file + metadata files)

Each of the above media type set contains:

README.html, README.md, README.pdf - technical overview of how all assessment datasets were created and the context in which they were created
[media_type]_data.zip - compressed bag containing (1) the randomly selected files from the archive of a particular media type inside a "data" directory and (2) BagIt manifest files following the specification at http://webarchive.loc.gov/all/20160830141859/https://tools.ietf.org/html/draft-kunze-bagit-08#section-2.
manifest.html - web page with filename, item ID, MD5 hash, and direct download link for each content file from the zip package's "data" directory".
[media_type]_metadata.csv - a CSV containing metadata about each randomly selected content file. Metadata is derived from the CDX line entry for each file and from embedded metadata extracted from the files themselves. The fields and their contents are described below in the "Dataset field descriptions" section and in each media type's README file.

Additionally, there is a sample taken from the dataset as follows:

sample-data.zip (TODO) - 700 randomly selected items, where 100 items were selected from each of the 1000-item datasets by media type (PDF, image, csv, tsv, xls, audio, Powerpoint). Included with this are a metadata.csv, metadata.json, and manifest.json.
sample-data/metadata.json (477.5 KB) - A JSON file containing the metadata for the 100 sample items
sample-data/metadata.csv (348.4 KB) - A CSV transformation of the sample JSON metadata
sample-data/manifest.html - For downloading individual images, this is a simple page that lists each image's file id, item id, MD5 hash (base64), file size, and URL
sample-data/manifest.json (17.6 KB) - A JSON file listing each image file id, their item id, MD5 hash (base64), file size, and URL

Composition

The data package contains a total of 7,000 content files.

Metadata for these items is found in 7 CSV metadata files, one for each media type.

Potential risks to people, communities, and organizations and strategies for risk mitigation

The files in this sample data package were publicly shared on U.S. federal, state, and local government websites. The data package was created by randomly sampling from the web archive, and the contents of the sampled files have not been individually reviewed by Library of Congress.

Computational readiness and possible uses

This data includes a sample of the types of documents attached to U.S. government websites, and may be especially helpful for initial testing of analytical methods that could be scaled in the future to a larger corpus of files extracted from .gov or other websites.

The included CSV metadata files can serve as an entry point to analyzing filenames and embedded information about files, and the files themselves can be text mined, parsed as structured data, analyzed by computer vision tools, and audio analysis tools.

III. How was it created?

Compilation Methods

Overview

Datasets originally created 11/6/2018.

Before sampling, the CDX index was filtered so that: each web documented was only counted once (in order to not over-sample from documents that were visited more frequently by the crawler or found at more than one URL), web documents were excluded if they had never been successfully captured (e.g., if the crawler encountered 404 errors), and web documents were only included if they were collected while archiving a .gov website. Formats were identified using their mimetype values, which are provided by the source websites.

Details

The bulk of these datasets were created using CDX Line Extraction, which first filtered and sorted the CDX lines based on the following fields from the CDX line entries:

digest: a unique cryptographic hash of the web object's payload at the time of the crawl, which provides a distinct fingerprint for that object; it is a Base32 encoded SHA-1 hash.
mimetype: two-part designation (type/subtype) that describes the nature and format of the web object, as reported by the server at the time of the crawl.
- CSV: mimetype contains the string "csv"
- TSV: mimetype contains the string "tab-separated"
- Excel: mimetype contains the string "excel"
- PDF: mimetype contains the string "pdf"
- Audio: mimetype contains the string "audio"
- Image: mimetype contains the string "image"
- PowerPoint: mimetype contains the string "powerpoint"
status code: represents the HTTP response code from the server at the time of the crawl, e.g. 200, 404, etc.
original URL: the URL that was captured during the web harvesting process.

The CDX Line Extraction process involved multiple phases. First, a MapReduce job filtered out lines from the 6TB corpus that were:

of the mime type requested
had a status code of 200
and whose top level domain from the original URL was ".gov"

The query results wrote matching lines to new CDX files, which were stored in an S3 bucket to be used in the next step. In this step, another MapReduce job pulled out all the digests from the filtered CDX files created by the first job and wrote them to a list that was also stored in an S3 bucket. Then, a Python script was used to randomly select one thousand digests from the digest list that was created by the second MapReduce job. Finally, a third MapReduce job took this subsection of one thousand digests as an input and extracted the CDX Line(s) each digest was referenced in, and wrote them to a CDX file that was stored in an S3 bucket. The CDX file from the final MapReduce Job was downloaded, converted to a CSV using a Python script, then used as the basis for any additional metadata extracted from the files or other computational methods of exploration.

For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.

Preprocessing steps

Additional information not extracted from the CDX index was also created for use in tracking file integrity and basic metadata. This included the use of the Apache Tika tool to extract metadata from each of the files (see https://tika.apache.org/), which was then recorded in the accompanying CSV. All information derived through Tika is noted in each media type's README file. Response headers information stored with the web archives (content length) furnished the data for the file_size field. If the header did not include the content length, the file size (in bytes) was obtained after the PDF was retrieved. For PDFs, a derivative image of the first page from each PDF was created, then run through Tika to ascertain the page width, height, and surface area. Those derivative images are not included in this dataset.

IV. Dataset field descriptions

This section lists and describes each of the fields included in lcwa_gov_[media_type]_metadata.csv that are common to all of the datasets. Unique fields for each media type are specified in the README.txt accompanying that particular dataset.

Field	Datatype	Definition	Metadata Source
urlkey	string	the url of the captured web object, without the protocol (http://) or the leading www. This information is extracted from the CDX index file.	CDX index file
timestamp	integer	timestamp in the form YYYYMMDDhhmmss. The time represents the point at which the web object was captured, as recorded in the CDX index file.	CDX index file
original	string	the url of the captured web object, including the protocol (http://) and the leading www, if applicable, extracted from the CDX index file.	CDX index file
mimetype	string	the mimetype as recorded in the CDX.	CDX index file
statuscode	integer	the HTTP response code received from the server at the time of capture, e.g., 200, 404. In this case, only codes that matched "200" were selected.* * Not included for the PowerPoint dataset.	CDX index file
digest	string	a unique, cryptographic hash function of the web object's payload at the time of the crawl. This provides a distinct fingerprint for the object; it is a Base32 encoded SHA-1 hash, derived from the CDX index file.	CDX index file
file_size	integer	the size of the web object, in bytes. If the response headers information stored with the web archives contained a content length value, that was used. Otherwise, Apache Tika was used to calculate file size.* * This field appears as content_length in the image dataset.	CDX index file where possible, otherwise during dataset generation
sha256	string	a unique cryptographic hash of the downloaded web object, computed using the SHA256 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process.	dataset generation
sha512	string	a unique cryptographic hash of the downloaded web object, computed using the SHA512 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process.	dataset generation

V. Rights Statement

VI. Creator and contributor information

Creator: Chase Dooley

Contributors: Aly DesRochers, Abbie Grotke, Charlie Hosale, Grace Bicho, Jesse Johnston, Kate Murray, Lauren Baker, Pedro Gonzalez-Fernandez, Rachel Trent, Trevor Owens

VII. Feedback

Please contact us with feedback or questions at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/!