National Jukebox Data Package README

Version information

Version 1.1 | Last updated 2024-04-17

1.1 (2024-04-17) LC Labs consolidated coversheet and README content with minor formatting updates
1.0 (2023-05-05) First version

CONTENT ADVISORY

Please note that terminology in historical materials and in Library descriptions does not always match the language preferred by members of the communities depicted, and may include negative stereotypes or words that offend.

The selected files are presented as part of the record of the past. They are historical documents which reflect the language, attitudes, perspectives, and beliefs of different times. The Library of Congress does not endorse the views expressed in these recordings.

About the source data or collection

Brief description & background of collection

The Library of Congress presents the National Jukebox, which makes historical sound recordings available to the public free of charge. Recordings in the National Jukebox come from the Recorded Sound Section of the Library of Congress, the University of California–Santa Barbara, and a private collection, though the recordings selected for this dataset all come from the Recorded Sound Section of the Library of Congress. (This dataset does not include the roughly 8000 recordings from the other two repositories from the selected time period.) All recordings included in the Jukebox were issued on record labels now owned by Sony Music Entertainment, which granted the Library of Congress a license to make the recordings available online. All of the recordings included in this dataset were made by the Victor Talking Machine Company. These recordings were originally made on shellac discs. In cases where multiple copies of the same recording were available, the disc in the best condition was selected for digitization.

At launch, the Jukebox included more than 10,000 recordings made by the Victor Talking Machine Company between 1901 and 1925. Since the launch, Jukebox content has increased regularly, with additional Victor recordings and acoustically recorded titles made by other Sony-owned U.S. labels, including Columbia and Harmony.

Collection objects include digitized audio files and, where available, digital images of the record label.

More information can be found at:
National Jukebox collection

Original format

Mostly 78rpm shellac discs

Library of Congress reading room

Recorded Sound Reference Center, https://www.loc.gov/rr/record/

Contact

For questions or more information about this material, please contact Recorded Sound Research Center staff through the Ask a Librarian service.

Metadata type

Source metadata for this dataset was drawn from the LOC API which sources the data for this collection from a custom database.

Scale of description

The recordings are described at the item (recording) level. Usually known is title, performers and composers, recording information, such as date, location, matrix number, take number, and catalog number. For most recordings, genres have been assigned by catalogers.

Rights information

All recordings published before January 1, 1923 entered the public domain on January 1, 2022 under the Music Modernization Act of 2018. Based on the dates in the item metadata, all recordings included in this dataset are assumed to be in the public domain.

Digitization information

NAVCC staff began digitization work on the recordings in March 2010. NAVCC audio engineers select the optimal stylus to be used for the specific disc. Preservation master audio files are created in 24 bit/96 KHz Broadcast Wave format. The monaural recording transfer is made in two channels to capture both walls of the record groove before the file is summed to monaural on the digital audio workstation. The disc labels are scanned and master 400 dpi tiff files, 728x728 JPEG files and 60x60 thumbnail images are generated for presentation on loc.gov. 320Kbps and 128Kbps mp3s are created from the wav masters and delivered for streaming on loc.gov. Currently, the files available for download from loc.gov are limited to the 44.1 kHz mp3s. For most of the label images, four separate jpegs at varying resolution are currently available for download.

About this exploratory data package

This dataset contains metadata records and audio files for 5,882 audio recordings from the National Jukebox collection. These records are included in CSV, JSON, and JSONL formats. The records range in date from 1900-1922. Most recordings are of popular music, but classical music, opera, jazz, musical theater, and other music genres are included, as are some monologue/dialogue recitations, speeches, and other spoken word recordings. Most recordings are in English, with some representation from Italian, French, German, Spanish, and other languages.

Recordings in the National Jukebox come from the Recorded Sound Section of the Library of Congress, the University of California–Santa Barbara, and a private collection, though the recordings selected for this dataset all come from the Recorded Sound Section of the Library of Congress. (This dataset does not include the roughly 8000 recordings from the other two repositories from the selected time period.) All recordings included in the Jukebox were issued on record labels now owned by Sony Music Entertainment, which granted the Library of Congress a license to make the recordings available online. All of the recordings digitized so far under this license, those included in this dataset, were made by the Victor Talking Machine Company. These recordings were originally made on wax discs. In cases where multiple copies of the same recording were available, the disc in the best condition was selected for digitization.

This dataset was created as part of an LC Labs experiment in collaboration with AVP to understand the benefits, risks, quality benchmarks, workflows, compilation methods, transformations, and documentation practices required to assemble datasets for public use in the cloud.

The goal of creating this dataset in particular is to explore methods for creating a "general purpose" dataset, a format that is easily repeatable across collections that use the LOC API. Through the creation of this dataset for the experiment, LC Labs and AVP aimed to learn: - how to identify metadata fields that should always be present in the dataset so users can understand each object being described and its context. - which metadata fields or other API response information can be omitted for comprehensibility. - which metadata fields can be standardized and which are easily repeatable across any collection. - what documentation is necessary and/or would enhance understanding and use.

The target audiences of this dataset are users who are already skilled in data work who want straightforward, consistent data, or who want to combine datasets from across the library.

The dataset is organized by digital object, with each row (CSV) or JSON object representing a single digital object from the collection.

What's included?

The data package contains:

README: An overview of the source data or collection provenance, the contents of the data package, and how the data package was created. Available as .md, .html, and .pdf.
metadata.json: a JSON file containing the metadata for all 5,882 audio recordings
metadata.jsonl: a JSON lines version of the JSON data, with one record per line, useful for processing large files
metadata.csv: a CSV transformation of the original JSON metadata
audio/: .mp3 audio files associated with each of the 5,882 audio recordings.
summary/: directory of summary data (.csv) and visualizations (.jpg) of the fields included in the dataset and number of records populated for each, date histograms, and location distribution of dataset records
sample data: 100 randomly selected items from the 5882 item set and their corresponding full text extracts have been provided as sample data. Included with this are a metadata.csv, metadata.json, and metadata.jsonl'. (709 KB)

Computational readiness and possible uses

The data in this dataset have been selected, structured, standardized, and enriched to make the dataset more easily comprehensible and computable through a range of methods and in a variety of environments. The standardization of contributor strings into structured names and roles could enable network analysis. Enrichment of recording locations with coordinates and structured addresses could support plotting records in mapping interfaces. Standardization of dates could enable sequencing in timelines.

How was it created?

This dataset was created through a four-stage process including data extraction, mapping and standardization to a specified schema, enrichment of certain fields with additional data, and packaging for access and use.

Extraction

This dataset was created using the LOC JSON/YAML API and comprises a scoped portion of the National Jukebox collection and not every item in the collection. Subject matter experts were consulted in the creation of a JSON API query (https://www.loc.gov/collections/national-jukebox/?fa=partof_repository:recorded+sound+section,+library+of+congress|location:united+states&fo=json&at=results&dates=1900/1922) to produce rights free audio recordings from 1900 to 1922 from the Recorded Sound Section of the Library of Congress, as a subset of what was available online in the collection on loc.gov in August 2022. This query returned 5,882 results.

Standardization

The API fields in the response returned from the query were reviewed and selected for mapping to a schema designed for anticipated possible uses of the dataset. This schema is comprised of fields from the General Purpose schema plus additional format or collection-specific fields (see "Section IV. Dataset field descriptions" below). In the mapping from the API response to the dataset schema, some data values were standardized for consistency across the dataset and interoperability with other datasets.

Standardizations include the following methods and are listed in detail in the "Dataset field descriptions" section.
- Capitalization (method): Data value has been capitalized using title or sentence case - Fill with (string): Data value has been filled with a static string. If "empty" is present, only empty values have been filled. Otherwise, all values for that field, including mapped values, have been filled or overwritten. - Lookup (table): Data value has been looked up and replaced by a value in the specified lookup table. - Select, sum (field): A specified field in an array of objects has been selected and summed. (This is currently only in use for totalling the number of files from the API resource field). - Contributor (delimiter): Name strings that include roles (ex. "Wright, Orville, 1871-1948, photographer") are split on a specified delimiter and structured as an object, with Name and Role fields. - Split on (delimiter): An array of strings has been created from a string value by splitting on a specified delimiter.

Enrichment

After standardization, some data fields were enriched to bring additional value for potential use cases. In this dataset, recording locations (string values from the API item.recording_location field) were queried in OpenStreetMap and enriched with structured location data, geocoordinates, and URLs to the structured data record in OpenStreetMap. Because of the free text format of these data values in the original metadata (ex. 'Camden, New Jersey. Church Bldg'), some enriched results may be inaccurate. Users should take care to review the quality of the results as it relates to their use case before computing on this data.

Packaging

After enrichment, the dataset was output in JSON, JSONL (JSON lines), and CSV formats. CSV files were flattened from JSON using the following rules: - Arrays were flattened to strings, with array items delimited by the pipe character
- Contributors and Creators fields were flattened using the following rules: - Contributors: Contributors.Name and Contributor.Role are concatenated with ', '' (ex. 'Egener, Minnie, Vocalist -- Contralto') in the field Contributors. Multiple contributors or creator strings are then joined into a string, delimited by the pipe character.
- Contributorsnames: Contributors.Name for all objects in a list are joined into a string and delimited with the pipe character (similar to flattening lists, above) and added to a column called Contributor_names.
- Contributors[role]: Additional columns are added for names by roles, appending the role to 'Contributors' or 'Creators', example: {'Name': 'Egener, Minnie', 'Role': 'Vocalist -- Contralto'} becomes 'Egener, Minnie' under the column named Contributors_vocalist -- contralto. - Enriched locations (Location) are included in their original JSON form in the Location column. The Location.Full_name is included in a Location_full_name column with multiple locations delimited by the pipe character. Coordinates from Location.Coordinates are listed as lat,long pairs in the Coordinates column in the same order as in Location_full_name.

Dataset field descriptions

The data fields that follow were compiled from a "General Purpose" schema designed for the Data Transformation Servies experiement and supplemented with additional fields specific to this collection and/or anticipated uses of the data. Values have been sourced from API fields or templated with static values where necessary. These mappings are indicated in the "Data source" column in the table below. Some values have been standardized for consistency across the dataset or interoperability with other datasets using similar data structures, standards, or controlled vocabularies. Types and descriptions of standardizations are listed above in the "How was it created" section and indicated in the "Standardization" column of the table below. Enrichments are also described in the "How was it created" section above and are indicated below.

The data fields that follow are directly translated from the metadata.json file. The JSON file is nested in nature, and that nested structure is not strictly carried over into the CSV. When JSON fields have been flattened or otherwise altered to fit a CSV field, the transformation is described below.

Each of the fields described below appears for an object or row in the dataset. Please note that not all elements appear for each result. The number and percentage of results populated for each field are indicated in the table below as well as in a summary.csv file in this package.

Field	Datatype	Definition	Requirement	Repeatability	Data Source (from API unless otherwise noted)	Standardization	Percent Populated
Audio_type	Text	The type of musical or non-musical content.	Optional	N	item.audio_type		100%
Contributors	Object	(JSON only) All names associated with the creation of the resource, including creators.	Optional	Y	item.contributors	Contributor ("--")	100%
Contributors.Name	Text	The name of the contributor.	Optional	N	item.contributors	Contributor ("--")
Contributors.Role	Text	The role of the contributor.	Optional	N	item.contributors	Contributor ("--")
Contributors_names	Text	(CSV only) Contributor names. Multiple names are delimited with the pipe character.	Optional	Y	item.contributors	Contributor ("--")
Contributors_text	Text	(CSV only) Contributor name concatenated with role (if available). Multiple names are delimited with the pipe character.	Optional	Y	item.contributors	Contributor ("--")
Contributors_[role]	Text	(CSV only) Contributor names organized into columns by their role. Examples: "Contributorphotographer" or "Contributorcomposer"	Optional	Y	item.contributors	Contributor ("--")
Coordinates	Text	(CSV only) The coordinate pair (lat, long) of a location matched through the OpenStreetMap location enrichment. Multiple coordinate sets are delimited with the pipe character. Corresponds to the location string in the same position in the Locationfullname column.	Optional	Y	Enrichment: Location
Date	Date (EDTF)	A structured representation of the date created.	Optional	Y	date		100%
Date_text	Text	A textual representation of the date.	Optional	Y	item.recording_date		100%
Description	Text	A description or summary of the contents of the resource.	Optional	Y	description		100%
Digitized	Boolean	Whether or not the resource described is digitized.	Optional	N	item.summary digitized		100%
Genre	Text	A genre for the resource.	Optional	Y	item.genre		99.93%
Id	Text	A unique identifier for the resource.	Required	N	id		100%
IIIF_manifest	URL	A IIIF manifest for the digital object, if available	Optional				100%
Language	Text	The language(s) of the content of the resource.	Optional	Y	item.language	Split on ("/", ";")	66.95%
Lastupdatedin_api	Timestamp	The date and time the metadata was last refreshed in the API. This may or may not reflect a change in the data.	Optional	N	timestamp
Location	Object	Structured representation of a location, including parent administrative divisions, where applicable, and geocoordinates	Optional	Y	location		99.98%
Locationfullname	Text	(CSV only) The full location string of a location matched through the OpenStreetMap location enrichment. Multiple locations are delimited with the pipe character. Corresponds to the coordinate pair in the same position in the Coordinates column.	Optional	Y	Enrichment: Location
Location_temp	Text	Temporary field for storing extracted location data for enrichment stage. This is removed during the enrichment stage.	Optional	Y			N/A
Location_text	Text	Textual representation of a location, copied directly from the source record	Optional	Y	item.recording_location		100%
Location.Address	Text	The administrative divisions that make up the full address of the location.	Required	N	Enrichment: Location
Location.Address.[place_type]	Text	An administrative division of the location, ex. Country, State, or City. [place_type] is replaced by the address type from the address block in the OpenStreetMap data. All parent administrative divisions as well as the type of location are included.	Required	N	Enrichment: Location
Location.Coordinates	List	The coordinates (lat, long) of the location.	Required	N	Enrichment: Location
Location.Full_name	Text	The full display name of the location, taken from Open Street Map.	Required	N	Enrichment: Location
Location.Osm_url	URL	The Open Street Map URL for the location.	Required	N	Enrichment: Location
Location.Short_name	Text	The name of the lowest level administrative division of the location.	Required	N	Enrichment: Location
Media_size	Text	The size of the original media.	Optional	N	item.media_size		99.88%
Mime_type	Text	The MIME type(s) of the files composing the digital object.	Optional	Y	mime_type		100%
Notes	Text	Additional information about the content, context, or physical description of the resource.	Optional	Y			0%
Numberoffiles	Integer	Number of files composing the digital object.	Optional	N	resources	Select, sum ("files")	100%
Online_format	Text	The format of the online version of the resource.	Optional	Y	online_format	Capitalization (sentence)	100%
Original_format	Text	The format the resource was digitized from.	Optional	Y	original_format	Capitalization (sentence)	100%
Part_of	Text	Groups the resource is a part of, such as source collection or repository.	Optional	Y	partof	Capitalization (title)	100%
Preview_url	URL	A url for a preview image or thumbnail for the digital object.	Optional	Y	image_url		90.41%
Recordingcatalognumber	Text	The catalog number of the recording.	Optional	N	item.recordingcatalognumber		100%
Recording_label	Text	The label that published the recording.	Optional	N	item.recording_label		100%
Recordingmatrixnumber	Text	The matrix number of the recording	Optional	N	item.recordingmatrixnumber		100%
Recordingtakeid	Text	The take ID of the recording.	Optional	N	item.recordingtakeid		100%
Recordingtakenumber	Text	The take number of the recording.	Optional	N	item.recordingtakenumber		100%
Repository	Text	The repository that holds the physical or digital resource.	Optional	Y	partof_repository	Capitalization (title)	100%
Rights	Text	Rights or access information associated with the resource.	Optional	Y	item.rights_advisory		100%
Shelf_id	Text	An identifier for finding the original physical resource.	Optional	Y	shelf_id		100%
Source_collection	Text	The collection the resource belongs to.	Optional	Y		Fill with ("National Jukebox", empty)	100%
Subjects	Text	Subjects or keywords associated with the resource.	Optional	Y	subject	Capitalization (sentence)	100%
Title	Text	The primary title or description of the resource.	Required	N	title		100%
Typeofresource	Text	A term that specifies the characteristics and general type of content of the resource, such as "Still image" or "Text." Based on MODS 3.7 enumerated list of values for typeOfResource: https://www.loc.gov/standards/mods/userguide/typeofresource.html	Required	Y	type	Lookup (typeofresource)	100%
Url	URL	The Digital Collections URL for the resource.	Optional	N	url		100%

Rights Statement

Creator and contributor information

Creator: AVP

Contributors: LC Labs

Contact information

Please contact [email protected] with any questions or suggestions!