Version 1.0 | Last updated 2024-06-11 - 1.0 (2024-06-11) First version
This data includes a sample of tabular, PDF, audio, image, and PowerPoint documents that were found on United States government websites from 1996 to 2017. More specifically, it includes files linked from (and embedded in) .gov websites and archived in the Library of Congress Web Archive.
The .gov top-level domain is managed by the U.S. federal government and may be used by any level of U.S. government entities, including federal, state, and local. As of 2024, registration of .gov domains is coordinated through the General Services Administration (GSA) and inventoried at https://get.gov/about/data/.
The full web archive data is stored in WARC files. For more information, see WARC, Web ARChive file format.
The web archives are indexed in CDX files, which form the basis of this data package. For more information about these files, see CDX Internet Archive Index File. For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.
The Library of Congress has an expansive web archive, with content collected as early as 1996. Websites are generally selected for inclusion across a broad range of subjects, themes, and events identified by Library staff. Collecting areas include, for example, select U.S. government sites from the Legislative, Judicial, and Executive branch agencies; select foreign government sites; campaign websites and political parties documenting U.S. and select foreign elections; non-profit organizations; journalism and news; creative sites such as those documenting comics, music, authors, and art; legal sites; and international organizations. Areas of collecting focus have evolved over time, but U.S. governmental websites have been a particular focus throughout the program's history. More information about the selection process can be found at https://www.loc.gov/programs/web-archiving/about-this-program/frequently-asked-questions/.
The web archive is built using automated tools, and the technical configuration of these tools also has an important impact on the scope of the collection. When a website is selected for inclusion, it is archived using a web crawler. The crawler starts with a "seed URL" – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. These links include html, CSS, JavaScript, images, PDFs, audio and video files, and more. Scoping instructions are added to allow or restrict the crawler's ability to collect linked files hosted on third party sites or on other subdomains from the same organization. The resulting web archive contains a broad diversity of file formats, from a range of web domains not necessarily limited to those originally selected for inclusion. More information about the crawling process can be found at https://www.loc.gov/programs/web-archiving/about-this-program/frequently-asked-questions/.
Most of the collection is described and searchable at https://www.loc.gov/web-archives/. Individual web documents can be retrieved by appending the web document's original URL to the end of "https://webarchive.loc.gov/all/*/" (e.g., https://webarchive.loc.gov/all/*/http://adcvd.cbp.dhs.gov/adcvdweb/ad_cvd_msgs/8461.pdf). The web document's original URL can be found in the "original" column of the dataset CSV metadata files.
The files in these datasets do not have any access restrictions and are made fully available online. Some content in the broader web archive is restricted to onsite access only, if the site owners do not grant permission to display offsite, and this content can be viewed in from any research center at the Library of Congress.
Questions about the web archiving program can be submitted at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/.
Metadata collected by the web crawler is saved into WARC files. For more information about information contained in the full archive WARC files, see WARC, Web ARChive file format.
The web archives are indexed in CDX files, which form the basis of this data package. For more information about the metadata in these files, see CDX Internet Archive Index File. For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.
This data package also includes metadata extracted from the content files themselves. These metadata fields are specific to each file format specification.
The source CDX index contains one line per record, and each record describes a capture (or attempted capture) of a web document file.
This dataset was derived from content in the Library's web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. Files were extracted from a variety of archived United States government websites collected in a number of event and thematic archives. See a general Rights & Access statement for a sample collection which applies to all of the content in this dataset: https://www.loc.gov/collections/legislative-branch-web-archive/about-this-collection/rights-and-access/.
This dataset is a set of seven random samples taken from the full corpus of .gov websites collected between 1996 to 2017, available in the Library of Congress Web Archive. One thousand (1,000) documents were included for each of seven media types: PDFs, Audio, CSV, TSV, Excel, Images, and PowerPoint. Three of those media types, CSV, TSV, and Excel, are packaged together here in the "Tabular Dataset".
The samples were generated from the web archives' CDX index files . At the time of the sampling in 2018, the CDX index was 6 TB, and the full archive WARC files were nearly 1.5 PB.
Each set includes 1,000 unique files and minimal metadata about them, including links to their locations within the Library's web archive. The datasets are as follows:
Each of the above media type set contains:
Additionally, there is a sample taken from the dataset as follows:
The data package contains a total of 7,000 content files.
Metadata for these items is found in 7 CSV metadata files, one for each media type.
The files in this sample data package were publicly shared on U.S. federal, state, and local government websites. The data package was created by randomly sampling from the web archive, and the contents of the sampled files have not been individually reviewed by Library of Congress.
This data includes a sample of the types of documents attached to U.S. government websites, and may be especially helpful for initial testing of analytical methods that could be scaled in the future to a larger corpus of files extracted from .gov or other websites.
The included CSV metadata files can serve as an entry point to analyzing filenames and embedded information about files, and the files themselves can be text mined, parsed as structured data, analyzed by computer vision tools, and audio analysis tools.
Overview
Datasets originally created 11/6/2018.
Before sampling, the CDX index was filtered so that: each web documented was only counted once (in order to not over-sample from documents that were visited more frequently by the crawler or found at more than one URL), web documents were excluded if they had never been successfully captured (e.g., if the crawler encountered 404 errors), and web documents were only included if they were collected while archiving a .gov website. Formats were identified using their mimetype values, which are provided by the source websites.
Details
The bulk of these datasets were created using CDX Line Extraction, which first filtered and sorted the CDX lines based on the following fields from the CDX line entries:
The CDX Line Extraction process involved multiple phases. First, a MapReduce job filtered out lines from the 6TB corpus that were:
The query results wrote matching lines to new CDX files, which were stored in an S3 bucket to be used in the next step. In this step, another MapReduce job pulled out all the digests from the filtered CDX files created by the first job and wrote them to a list that was also stored in an S3 bucket. Then, a Python script was used to randomly select one thousand digests from the digest list that was created by the second MapReduce job. Finally, a third MapReduce job took this subsection of one thousand digests as an input and extracted the CDX Line(s) each digest was referenced in, and wrote them to a CDX file that was stored in an S3 bucket. The CDX file from the final MapReduce Job was downloaded, converted to a CSV using a Python script, then used as the basis for any additional metadata extracted from the files or other computational methods of exploration.
For more information about the nine-field 2006 version of the CDX specification used at the time that the dataset was created, see https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.
Additional information not extracted from the CDX index was also created for use in tracking file integrity and basic metadata. This included the use of the Apache Tika tool to extract metadata from each of the files (see https://tika.apache.org/), which was then recorded in the accompanying CSV. All information derived through Tika is noted in each media type's README file. Response headers information stored with the web archives (content length) furnished the data for the file_size field. If the header did not include the content length, the file size (in bytes) was obtained after the PDF was retrieved. For PDFs, a derivative image of the first page from each PDF was created, then run through Tika to ascertain the page width, height, and surface area. Those derivative images are not included in this dataset.
This section lists and describes each of the fields included in lcwa_gov_[media_type]_metadata.csv that are common to all of the datasets. Unique fields for each media type are specified in the README.txt accompanying that particular dataset.
| Field | Datatype | Definition | Metadata Source |
|---|---|---|---|
| urlkey | string | the url of the captured web object, without the protocol (http://) or the leading www. This information is extracted from the CDX index file. | CDX index file |
| timestamp | integer | timestamp in the form YYYYMMDDhhmmss. The time represents the point at which the web object was captured, as recorded in the CDX index file. | CDX index file |
| original | string | the url of the captured web object, including the protocol (http://) and the leading www, if applicable, extracted from the CDX index file. | CDX index file |
| mimetype | string | the mimetype as recorded in the CDX. | CDX index file |
| statuscode | integer | the HTTP response code received from the server at the time of capture, e.g., 200, 404. In this case, only codes that matched "200" were selected.* * Not included for the PowerPoint dataset. |
CDX index file |
| digest | string | a unique, cryptographic hash function of the web object's payload at the time of the crawl. This provides a distinct fingerprint for the object; it is a Base32 encoded SHA-1 hash, derived from the CDX index file. | CDX index file |
| file_size | integer | the size of the web object, in bytes. If the response headers information stored with the web archives contained a content length value, that was used. Otherwise, Apache Tika was used to calculate file size.* * This field appears as content_length in the image dataset. |
CDX index file where possible, otherwise during dataset generation |
| sha256 | string | a unique cryptographic hash of the downloaded web object, computed using the SHA256 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process. | dataset generation |
| sha512 | string | a unique cryptographic hash of the downloaded web object, computed using the SHA512 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process. | dataset generation |
This dataset was derived from content in the Library's web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. Files were extracted from a variety of archived United States government websites collected in a number of event and thematic archives. See a general Rights & Access statement for a sample collection which applies to all of the content in this dataset: https://www.loc.gov/collections/legislative-branch-web-archive/about-this-collection/rights-and-access/.
Creator: Chase Dooley
Contributors: Aly DesRochers, Abbie Grotke, Charlie Hosale, Grace Bicho, Jesse Johnston, Kate Murray, Lauren Baker, Pedro Gonzalez-Fernandez, Rachel Trent, Trevor Owens
Please contact us with feedback or questions at https://www.loc.gov/programs/web-archiving/about-this-program/contact-us/!