Top of page

National Jukebox Data Package

This dataset contains metadata records and audio files for 5,882 audio recordings in the National Jukebox collection . The records range in date from 1900-1922.

A photograph of a vinyl record with the blue label with a hold musical note and white text that reads: Columbia Records
Hans und Liese

About this dataset

This dataset was created as part of an LC Labs experiment in collaboration with AVP to explore methods for creating a "general purpose" dataset, a format that is easily repeatable across collections that use the LOC API.

View source collection Browse collection items

Metadata Metadata formats Data files
5,882 records .csv, .json 5,882 .mp3 audio files

Data package documentation

Included in this data package is comprehensive documentation of source data or collection provenance, the contents of the data package, and how the data package was created. Here are some particular sections of interest as well as a link to the full documentation:

View the documentation

Dataset at a glance

How to access and use this data package

There are two main options for accessing and using this data package: (1) Directly downloading files from this page and (2) using Python for more advanced usage.

Direct downloads

The following list outlines the contents of this data package. Many of the individual files inside the data package are linked directly on this page which you can download and immediately use. Zipped files are available for bulk download of the entire or parts of the data package.

Sample the data
  • sample-data.zip (262.6 MB) - 100 randomly selected items from the 5,882 item set and their corresponding audio files have been provided as sample data. Included with this are a metadata.csv, metadata.json, and manifest.json.
  • sample-data/metadata.json (245.2 KB) - A JSON file containing the metadata for the 100 sample items
  • sample-data/metadata.csv (156.6 KB) - A CSV transformation of the sample JSON metadata
  • sample-data/manifest.html - For downloading individual audio files, this is a simple page that lists each audio's file id, item id, MD5 hash (base64), file size, and URL
  • sample-data/manifest.json (17.4 KB) - A JSON file listing each audio file id, their item id, MD5 hash (base64), file size, and URL
Download the documentation
  • README.html - An overview of the source data or collection provenance, the contents of the data package, and how the data package was created.
  • README.md (36.8 KB) - README as a Markdown text file
  • README.pdf (35.4 KB) - README as a PDF file
Download the metadata
Download the audio files
  • Due to the large amount of files, they are available to download in batches
  • manifest.html - For downloading individual audio files, this is a simple page that lists each audio's file id, item id, MD5 hash (base64), file size, and URL. For bulk downloads, refer to the following Using Python section .
  • manifest.txt (866.3 KB) - A text file listing each audio file id, their item id, MD5 hash (base64), file size, and URL
  • manifest.json (946.7 KB) - A JSON file listing each audio file id, their item id, MD5 hash (base64), file size, and URL

Using Python

While direct downloads are more convenient for most activities, users with familiarity with writing Python can perform more advanced and complex tasks programmatically.

For your convenience we developed a number of Jupyter Notebooks to help get you started.

View the Python notebook for this data package

Bulk downloads using Python

For bulk downloads, refer to this Python script for downloading files in bulk . Sample commands for this data package:

Download all audio files

python bulk_download.py --package "https://data.labs.loc.gov/jukebox/" --out "output/jukebox/"

Dataset details

Source collection

National Jukebox collection

Recordings in the National Jukebox come from the Recorded Sound Section of the Library of Congress, the University of California Santa Barbara, and a private collection, though the recordings selected for this dataset all come from the Recorded Sound Section of the Library of Congress. (This dataset does not include the roughly 8000 recordings from the other two repositories from the selected time period.) All recordings included in the Jukebox were issued on record labels now owned by Sony Music Entertainment, which granted the Library of Congress a license to make the recordings available online. All of the recordings digitized so far under this license, those included in this dataset, were made by the Victor Talking Machine Company. These recordings were originally made on wax discs. In cases where multiple copies of the same recording were available, the disc in the best condition was selected for digitization.

Rights statement All recordings published before January 1, 1923 entered the public domain on January 1, 2022 under the Music Modernization Act of 2018. Based on the dates in the item metadata, all recordings included in this dataset are assumed to be in the public domain.
Date created 2023-05-05
Date updated 2024-03-28
Creators & contributors
Creator:
AVP
Contributors:
LC Labs
Recorded Sound Section
Cite this dataset
Chicago citation style:
Library Of Congress. National Jukebox Data Package. [Washington, D.C.: Library of Congress, 2023] Software, E-Resource. https://data.labs.loc.gov/jukebox/.
APA citation style:
Library Of Congress. (2023) National Jukebox Data Package. [Washington, D.C.: Library of Congress] [Software, E-Resource] Retrieved from the Library of Congress, https://data.labs.loc.gov/jukebox/.
MLA citation style:
Library Of Congress. National Jukebox Data Package. [Washington, D.C.: Library of Congress, 2023] Software, E-Resource. Retrieved from the Library of Congress, </data.labs.loc.gov/jukebox/>.
Curatorial questions For curatorial questions about the content of the collection or technical questions about the dataset formats and composition, please contact the Recorded Sound Section via the Library's Ask a Librarian service at https://ask.loc.gov/recorded-sound .
Access questions For questions and technical issues about download and access, please submit a ticket on Github or email the LC Labs Team at [email protected] .
Back to top