Top of page

Computing Cultural Heritage in the Cloud Derivative Datasets

Welcome to data.labs.loc.gov, an experimental sandbox for sharing data packages. Right now, it only features data packages compiled as part of LC Labs' Mellon Foundation-funded Computing Cultural Heritage in the Cloud (CCHC) initiative. These data packages will be used as part of an invitation-only event in October 2022 to investigate access and engagement with large public domain datasets using cloud services.

LC Labs continues to seek feedback from users on the information presented in this space; please get in touch with us below with your comments and questions.

Data Packages

  • woman operating hand drill on a dive bomber

    Free to Use and Reuse (Metadata) Data Package

    About: This dataset contains metadata records and images for 2,688 curated selections featured in the Library of Congress' Free to Use and Reuse Sets as well as links to images for the full digital objects represented in the sets. This dataset includes only those items which are accessible via the Library of Congress' API.

    Source collection: Free to Use and Reuse is a collection of themed sets curated by Library staff. Themes are intentionally varied, to illustrate the depth and breadth of the Library’s collections. Themes have included skyscrapers, natural disasters, birds, shoes, games and more. These sets are just a small sample of the Library's digital collections that are free to use and reuse. The Library believes that this content is either in the public domain, has no known copyright, or has been cleared by the copyright owner for public use. The digital collections comprise millions of items including books, newspapers, manuscripts, prints and photos, maps, musical scores, films, sound recordings and more. A new set is usually added every month.

    For more information about the source material, please contact the Prints and Photographs Division .

    Link to data: To access the dataset and documentation, please visit this page .

    Link to image on the left: https://www.loc.gov/resource/fsac.1a35371/ .

  • 1911 recording labeled Columbia Records

    National Jukebox (Metadata) Data Package

    About: This dataset contains metadata records for 5,882 audio recordings in the National Jukebox collection . The records range in date from 1900-1922.

    Source collection: Recordings in the National Jukebox come from the Recorded Sound Section of the Library of Congress, the University of California–Santa Barbara, and a private collection, though the recordings selected for this dataset all come from the Recorded Sound Section of the Library of Congress. (This dataset does not include the roughly 8000 recordings from the other two repositories from the selected time period.) All recordings included in the Jukebox were issued on record labels now owned by Sony Music Entertainment, which granted the Library of Congress a license to make the recordings available online. All of the recordings digitized so far under this license, those included in this dataset, were made by the Victor Talking Machine Company. These recordings were originally made on wax discs. In cases where multiple copies of the same recording were available, the disc in the best condition was selected for digitization.

    For more information about the source material, please contact the Recorded Sound Section .

    Link to data: To access the dataset and documentation, please visit this page .

    Link to image on the left: https://www.loc.gov/item/jukebox-645498/ .

  • portion of a cover from a 1938-1953 Belle Glade through Pahokee, Florida, Telephone directory

    Digitized Telephone Directories, 1891-1988 (Metadata) Data Package

    About: This dataset contains metadata records for a subset of 3,513 reels of US telephone directories, digitized from microfilm, from the Digitized Telephone Directories collection on loc.gov. These records are included in both CSV and JSON formats.

    Source collection: The Library of Congress makes available to the public an extensive collection of past and present city, telephone, and reverse telephone (criss-cross) directories for the United States and many foreign countries. These directories are available in a variety of formats and locations. The collection spans most of the 20th century, and includes directories from Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, the District of Columbia, Florida, Georgia, Hawaii, Iowa, Maryland, Pennsylvania, and the city of Chicago. All the directories and their metadata records are in English. There is not a one-to-one correspondence between metadata records and directories, as some microfilm reels contained multiple directories when they were digitized. There is a mix of white pages and yellow pages. Some directories may be missing pages due to damage. More information can be found at: https://www.loc.gov/collections/united-states-telephone-directory-collection/

    For more information about the source material, please contact the Local History and Genealogy Reference Services staff.

    Link to data: To access the dataset and documentation, please visit this page .

    Link to image on the left: https://www.loc.gov/resource/usteledirec.usteledirec06767/?sp=41&r=-0.377,0.313,1.722,0.682,0 .

  • 20th century telephone in front of historic telephone book

    Directory Holdings (Metadata) Data Package

    About: The Directory Holdings Data Package consists of metadata describing the Library of Congress inventoried holdings of United States Telephone Directories, City Directories, and Criss-cross directories.

    Source collection: The image files in this dataset are derived from the inventory tables listed on the Library's Directories by Address: Inventories of Library Collections and United States City and Telephone Directories guides. For more information about the source material, please contact the Local History and Genealogy Section .

    Link to data: The data is presented two ways: by Directory Type and by state/region .

    Link to image on the left: https://www.loc.gov/item/2017809964/ .

  • historic stereograph image of woman peering through stereoview

    Stereograph Card Images Data Package

    About: The Stereograph Card dataset consists of 39,526 stereograph card images from the 1850s through 1924, a subset of what was available online in the collection on loc.gov in August 2022.

    Source collection: The image files in this dataset are derived from the Stereograph Cards collection on loc.gov. For more information about the source material, please contact the Prints & Photographs division .

    Link to data: To access the dataset and documentation, please visit the Stereograph Cards Data Package cover page .

    Link to image on the left: https://www.loc.gov/item/2003674057/ .

  • historic map of 19th century Austria-Hungary

    Spezialkarte der österreichisch-ungarischen Monarchie ("Austro-Hungarian map set") Data Package

    About: This experimental dataset contains 4,998 images in TIFF format representing non-georeferenced map sheets and corresponding GeoTIFF formatted images that are georeferenced, and have had the map collars (non-map portions of the image at the edge of the sheet) removed.

    Source collection: The historical maps contained in this dataset were initially prepared and issued by the Austro-Hungarian Monarchy's Militärgeographisches Institut beginning around 1875 . After the dissolution of the Austro-Hungarian Empire in 1918, parts of the set were continued by successive governments. For more information about the source material, please contact the Geography & Maps Division .

    Link to data: To access the dataset and documentation, please visit the Austro-Hungarian Map Data Package cover page.

    Link to image on the left: https://www.loc.gov/item/2018588019/ .

  • illustration of man working a printing press

    Selected Digitized Books Data Package

    About: This dataset comprises 166,218 .txt and JSON files containing full text from 90,414 books in the Selected Digitized Books collection on loc.gov. The text was created as part of digitization workflows using Optical Character Recognition (OCR) technologies.

    Source collection: The Selected Digitized Books collection is a growing collection of selected books and other materials from the Library of Congress General Collections that have been made openly available. Most of the materials in this collection were published in the United States prior to the 1930s and are in English. The collection features thousands of works of fiction, including books intended for children, young adults, and other audiences. There are also some materials in foreign languages that were published in other countries.

    Link to data: To access the dataset and documentation, please visit the Digitized Books Data Package cover page.

    Link to image on the left: https://www.loc.gov/resource/gdcmassbookdig.littleadventures00alls/?sp=25 .

Questions and Feedback

For curatorial questions about the source collection or technical questions about the dataset formats and composition, please contact the appropriate content specialist via the Library's Ask a Librarian service at https://ask.loc.gov/ .

For questions about download and access, documentation, or how a dataset was created; or to share more about your experience using a data package, please email the LC Labs team at [email protected].