Free to Use and Reuse Data Package README

Version information

Version 1.2 | Last updated 2024-04-17

1.2 (2024-04-17) LC Labs consolidated coversheet and README content with minor formatting updates
1.1 (2023-07-26) README was updated with a note of caution regarding the geocoded location data and explanatory notes on several data fields.
1.0 (2023-05-05) First version

CONTENT ADVISORY

Please note that terminology in historical materials and in Library descriptions does not always match the language preferred by members of the communities depicted, and may include negative stereotypes or words that offend.

About the source data or collection

Brief description & background of collection

Free to Use and Reuse is a collection of themed sets curated by Library staff. Themes are intentionally varied, to illustrate the depth and breadth of the Library’s collections. Themes have included skyscrapers, natural disasters, birds, shoes, games and more. These sets are just a small sample of the Library's digital collections that are free to use and reuse. The Library believes that this content is either in the public domain, has no known copyright, or has been cleared by the copyright owner for public use. The digital collections comprise millions of items including books, newspapers, manuscripts, prints and photos, maps, musical scores, films, sound recordings and more. A new set is usually added every month.

Original format

Wide range of formats

Library of Congress reading room

For more information please contact the Ask a Librarian Service at https://ask.loc.gov/.

Contact

For more information please contact the Ask a Librarian Service at https://ask.loc.gov/.

Metadata type

The data available through the API for this dataset is sourced and transformed from many different types of records, including, but not limited to MARC records, MODS records, bespoke database records, and more.

Scale of description

Because items are drawn from many collections, level of description may vary

Rights information

The Free to Use and Reuse Sets are curated selections from the Library's digital collections that are either in the public domain, have no known copyright, or have been cleared by the copyright owner for public use. For more information, see https://www.loc.gov/free-to-use/.

About this exploratory data package

This dataset contains metadata records and images for 2,610 curated selections featured in the Library of Congress' Free to Use and Reuse Sets as well as links to images for the full digital objects represented in the sets. This dataset includes only those items which are accessible via the Library of Congress' API.

Because of the wide range of materials selected for Free to Use, we are unable to enumerate what languages, time periods, or genres are contained in the dataset.

The dataset was created as part of an LC Labs experiment in collaboration with AVP to understand the benefits, risks, quality benchmarks, workflows, compilation methods, transformations, and documentation practices required to assemble datasets for public use in the cloud.

The goal of this dataset in particular is to explore ways of combining items from multiple collections, across media types, with different forms of existing metadata into a single, cohesive dataset. Although the Free to Use and Reuse Sets were originally created on specific themes, the comprehensive dataset will not only allow users to rearrange and reimagine the contents according to themes and priorities of their own, but to also consider the ways in which organizational schemes can tell us about the priorities of the people and institutions that created them. Through the creation of this dataset for the experiment, LC Labs and AVP aimed to learn:

how we can harmonize data from different collections into a general purpose dataset.
what fields will usually need special attention for standardization.
what fields are more likely to have greater variability in semantics or syntax.
what can we assume will be lost when combining data from multiple collections.
what potential assumptions might be unintentionally supported by combining data from multiple collections.

The dataset is organized by digital object, with each row (CSV) or JSON object representing a single digital object from the collection.

What's included?

The data package contains:

full data package, zipped
README: An overview of the source data or collection provenance, the contents of the data package, and how the data package was created. Available as .md, .html, and .pdf.
metadata.json : a JSON file containing the metadata for all 2,610 items
metadata.jsonl : a JSON lines version of the JSON data, with one record per line, useful for processing large files
metadata.csv: a CSV transformation of the original JSON metadata (9.4 MB)
manifest.txt: a text file listing each file asset id, MD5 hash (base64), file size, and location of the files in the data set
data/: directory of the image file assets
summary/: directory of summary data (.csv) and visualizations (.jpg) of the fields included in the dataset and number of records populated for each, date histograms, and location distribution of dataset records
sample-data/: 100 randomly selected items from the 2,610 item set and their corresponding full text extracts have been provided as sample data. Included with this are a metadata.csv, metadata.json, metadata.jsonl', and manifest.txt.

Computational readiness and possible uses

The data in this dataset have been selected, structured, standardized, and enriched to make the dataset more easily comprehensible and computable through a range of methods and in a variety of environments. The standardization of contributor strings into structured names and roles could enable network analysis. Enrichment of locations with coordinates and structured addresses could support plotting records in mapping interfaces. Standardization of dates could enable sequencing in timelines.

How was it created?

This dataset was created through a four-stage process including data extraction, mapping and standardization to a specified schema, enrichment of certain fields with additional data, and packaging for access and use.

Extraction

This dataset was created using the LOC JSON/YAML API and comprises all digitized selections from the Free to Use and Reuse Sets as of {date} that are available through the API. The following query was used to walk through the list of item sets: https://www.loc.gov/free-to-use/?fo=json. Then, metadata for each item was retrieved using a query in the following format, where item['link'] is a field returned by the initial API call for each item in a given set: https://loc.gov{item['link']}?fo=json. This process returned 2,610 results.

Standardization

The API fields in the response returned from the query were reviewed and selected for mapping to a schema designed for anticipated possible uses of the dataset. This schema is comprised of fields from the General Purpose schema plus additional format or collection-specific fields (see "Section IV. Dataset field descriptions" below). In the mapping from the API response to the dataset schema, some data values were standardized for consistency across the dataset and interoperability with other datasets.

Standardizations include the following methods and are listed in detail in the "Dataset field descriptions" section.

Capitalization (method): Data value has been capitalized using title or sentence case
Fill with (string): Data value has been filled with a static string. If "empty" is present, only empty values have been filled. Otherwise, all values for that field, including mapped values, have been filled or overwritten.
Lookup (table): Data value has been looked up and replaced by a value in the specified lookup table.
Select, sum (field): A specified field in an array of objects has been selected and summed. (This is currently only in use for totalling the number of files from the API resource field).
Select, array (field): A specified field in an array of objects has been selected and constructed as an array.
Contributor (delimiter): Name strings that include roles (ex. "Wright, Orville, 1871-1948, photographer") are split on a specified delimiter and structured as an object, with Name and Role fields.
Build an object (source data): A new object (or array of objects) has been created by mapping or templating values from the source data field.
Split on (delimiter): An array of strings has been created from a string value by splitting on a specified delimiter.

Enrichment

After standardization, some data fields were enriched to bring additional value for potential use cases. In this dataset, locations (string values from the API location, item.place, and item.location fields) were queried in OpenStreetMap and enriched with structured location data, geocoordinates, and URLs to the structured data record in OpenStreetMap. Because of the free text format of these data values in the original metadata (ex. 'Camden, New Jersey. Church Bldg'), some enriched results may be inaccurate. Users should take care to review the quality of the results as it relates to their use case before computing on this data.

Packaging

After enrichment, the dataset was output in JSON, JSONL (JSON lines), and CSV formats. CSV files were flattened from JSON using the following rules: - Arrays were flattened to strings, with array items delimited by | character
- Contributors and Creators fields were flattened using the following rules: - Contributors: Contributors.Name and Contributor.Role are concatenated with ', '' (ex. 'Egener, Minnie, Vocalist -- Contralto') in the field Contributors. Multiple contributors or creator strings are then joined into a string, delimited by the | character. - Contributors_names: Contributors.Name for all objects in a list are joined into a string and delimited with | (similar to flattening lists, above) and added to a column called Contributor_names
- Contributors_[role]: Additional columns are added for names by roles, appending the role to 'Contributors' or 'Creators', example: {'Name': 'Egener, Minnie', 'Role': 'Vocalist -- Contralto'} becomes 'Egener, Minnie' under the column named Contributors_vocalist -- contralto
- Other_record_field: Type and Id for each object are concatenated with ': ', ex. 'MODS record; https://lccn.loc.gov/2001696430/mods'. Objects are joined into a string, delimited by | - Enriched locations (Location) are included in their original JSON form in the Location column. The Location.Full_name is included in a Location_full_name column with multiple locations delimited by | character. Coordinates from Location.Coordinates are listed as lat,long pairs in the Coordinates column in the same order as in Location_full_name.

Dataset field descriptions

The data fields that follow were compiled from a "General Purpose" schema designed for the Data Transformation Services experiment and supplemented with additional fields specific to this collection and/or anticipated uses of the data. Values have been sourced from API fields or templated with static values where necessary. These mappings are indicated in the "Data source" column in the table below. Some values have been standardized for consistency across the dataset or interoperability with other datasets using similar data structures, standards, or controlled vocabularies. Types and descriptions of standardizations are listed above in the "How was it created" section and indicated in the "Standardization" column of the table below. Enrichments are also described in the "How was it created" section and are indicated below.

The data fields that follow are directly translated from the metadata.json file. The JSON file is nested in nature, and that nested structure is not strictly carried over into the CSV. When JSON fields have been flattened or otherwise altered to fit a CSV field, the transformation is described below.

Each of the fields described below appears for an object or row in the dataset. Please note that not all elements appear for each result. The number and percentage of results populated for each field are indicated in the table below as well as in a summary.csv file in this package.

Field	Datatype	Definition	Requirement	Repeatability	Data Source (from API unless otherwise noted)	Standardizations	Percent Populated
Call_number	Text	An identifier for finding the original physical resource.	Optional	Y	call_number		97.14%
Contributors	List of objects	(JSON only) All names associated with the creation of the resource, including creators.	Optional	Y	item.contributors creator	Contributor	98.81%
Contributors.Name	Text	The name of the contributor.	Optional	N	item.contributors creator
Contributors.Role	Text	The role of the contributor.	Optional	N	item.contributors creator
Contributors_names	Text	(CSV only) Contributor names. Multiple names are delimited with the pipe character.	Optional	Y	item.contributors creator
Contributors_text	Text	(CSV only) Contributor name concatenated with role (if available). Multiple names are delimited with the pipe character.	Optional	Y	item.contributors creator
Contributors_[role]	Text	(CSV only) Contributor names organized into columns by their role. Examples: "Contributorphotographer" or "Contributorcomposer"	Optional	Y	item.contributors creator
Coordinates	Text	(CSV only) The coordinate pair (lat, long) of a location matched through the OpenStreetMap location enrichment. Multiple coordinate sets are delimited with the pipe character. Corresponds to the location string in the same position in the Locationfullname column.	Optional	Y	Enrichment: Location
Creators	List of objects	(JSON only) Names of individuals or organizations primarily responsible for the creation of the resource.	Optional	Y	item.creators	Build an object ("Name":value.title, "Role":value.role)	66.18%
Creators.Name	Text	The name of the creator.	Optional	N	item.creators
Creators.Role	Text	The role of the creator.	Optional	N	item.creators
Date	Date (EDTF)	A structured representation of the date created.	Optional	Y	date		99.7%
Date_text	Text	A textual representation of the date.	Optional	Y	createdpublisheddate		84.19%
Description	Text	A description or summary of the contents of the resource.	Optional	Y	item.contents item.summary summary		30.69%
Digitized	Boolean	Whether or not the resource described is digitized.	Optional	N	digitized		100%
Genre	Text	A genre for the resource.	Optional	Y	genre		69.53
Id	Text	A unique identifier for the resource.	Required	N	id		100%
IIIF_manifest	URL	A IIIF manifest for the digital object, if available	Optional		iiif_manifest		100%
Index	Integer	The index location within the digital object of the image or file featured in Free to Use.	Required	N	index		100%
Language	Text	The language(s) of the content of the resource.	Optional	Y	language	Capitalization (sentence)	99.29%
Lastupdatedin_api	Timestamp	The date and time the metadata was last refreshed in the API. This may or may not reflect a change in the data.	Optional	N	timestamp
Lccn	Text	A Library of Congress Classification Number for the resource, if available.	Optional	Y	libraryofcongresscontrolnumber		88.24%
Location	Object	Structured representation of a location, including parent administrative divisions, where applicable, and geocoordinates	Optional	Y	Enrichment: Location		66.93%
Locationfullname	Text	(CSV only) The full location string of a location matched through the OpenStreetMap location enrichment. Multiple locations are delimited with the pipe character. Corresponds to the coordinate pair in the same position in the Coordinates column.	Optional	Y	Enrichment: Location
Location_temp	Text	Temporary field for storing extracted location data for enrichment stage. This is removed during the enrichment stage.	Optional	Y	location		N/A
Location_text	Text	Textual representation of a location, copied directly from the source record	Optional	Y	item.location item.place location	Capitalization (title)	70.46%
Location.Address	Text	The administrative divisions that make up the full address of the location.	Required	N	Enrichment: Location
Location.Address.[place_type]	Text	An administrative division of the location, ex. Country, State, or City. [place_type] is replaced by the address type from the address block in the OpenStreetMap data. All parent administrative divisions as well as the type of location are included.	Required	N	Enrichment: Location
Location.Coordinates	List	The coordinates (lat, long) of the location.	Required	N
Location.Full_name	Text	The full display name of the location, taken from Open Street Map.	Required	N
Location.Osm_url	URL	The Open Street Map URL for the location.	Required	N
Location.Short_name	Text	The name of the lowest level administrative division of the location.	Required	N
Medium	Text	The medium of the resource.	Optional	Y	medium		95.16%
Mime_type	Text	The MIME type(s) of the files composing the digital object.	Optional	Y	mime_type		100%
Notes	Text	Additional information about the content, context, or physical description of the resource.	Optional	Y	notes		97.4%
Numberoffiles	Integer	Number of files composing the full digital object that is featured in Free to Use.	Optional	N	resources	Select, sum ("files")	98.77%
Online_format	Text	The format of the online version of the resource.	Optional	Y	online_format		100%
Original_format	Text	The format the resource was digitized from.	Optional	Y	original_format		100%
Otherrecordformats	Object	Other descriptive record formats for the resource, such as MARC or MODS.	Optional	Y	item.marc other_formats	Build an object ("Type":"MARC Record", "Id":value) Build an object ("Type":value.label, "Id":value.link)	90.44%
Otherrecordformats.Id	URL	A URL pointing to the other record format.	Required		item.marc other_formats
Otherrecordformats.Type	Text	The type of record format available at the URL provided.	Required		item.marc other_formats
Other_title	Text	An alternative title for the resource, such as a translated title.	Optional	Y	item.alternatetitle item.othertitle item.translated_title		8.11%
Part_of	Text	Groups the resource is a part of, such as source collection or repository.	Optional	Y	partof	Select, array (title)	100%
Preview_url	URL	A url for a preview image or thumbnail for the digital object.	Optional	Y	image_url		100%
Repository	Text	The repository that holds the physical or digital resource.	Optional	Y	repository		94.83%
Rights	Text	Rights or access information associated with the resource.	Optional	Y	item.rights_advisory item.rights rights		97.73%
Set	Text	A curated set that the resource is a part of.	Optional	Y	ftu_set set	Capitalization (title)	100%
Shelf_id	Text	An identifier for finding the original physical resource.	Optional	Y	shelf_id		100%
Source_collection	Text	The collection the resource belongs to.	Optional	Y	item.sourcecollection sourcecollection		56.99%
Subject_headings	Text	LCSH subject headings	Optional	Y	subject_headings		90.33%
Subjects	Text	Subjects or keywords associated with the resource.	Optional	Y	subject		97.06%
Title	Text	The primary title or description of the resource.	Required	N	title		100%
Typeofresource	Text	A term that specifies the characteristics and general type of content of the resource, such as "Still image" or "Text." Based on MODS 3.7 enumerated list of values for typeOfResource: https://www.loc.gov/standards/mods/userguide/typeofresource.html	Required	Y	item.format	Lookup (typeofresource)	92.71%
Url	URL	The Digital Collections URL for the resource.	Optional	N	url		100%

Rights Statement

Creator and contributor information

Creator: AVP

Contributors: LC Labs

Contact information

Please contact [email protected] with any questions or suggestions!