Digitized Telephone Directories, 1891-1988 Data Package README

Version information

Version 1.1 | Last updated 2024-04-17

1.1 (2024-04-17) LC Labs consolidated coversheet and README content with minor formatting updates
1.0 (2023-05-05) First version

CONTENT ADVISORY

Please note that terminology in historical materials and in Library descriptions does not always match the language preferred by members of the communities depicted, and may include negative stereotypes or words that offend.

About the source data or collection

Brief description & background of collection

The Library of Congress makes available to the public an extensive collection of past and present city, telephone, and reverse telephone (criss-cross) directories for the United States and many foreign countries. These directories are available in a variety of formats and locations.

The collection spans most of the 20th century, and includes directories from Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, the District of Columbia, Florida, Georgia, Hawaii, Iowa, Maryland, Pennsylvania, and the city of Chicago. All the directories and their metadata records are in English. There is not a one-to-one correspondence between metadata records and directories, as some microfilm reels contained multiple directories when they were digitized. There is a mix of white pages and yellow pages. Some directories may be missing pages due to damage.

More information can be found at: * https://www.loc.gov/collections/united-states-telephone-directory-collection/

Original format

Microfilmed copies of print telephone directory books

Library of Congress reading room

Local History & Genealogy Reference Services in the Main Reading Room https://www.loc.gov/rr/genealogy/

Contact

For questions or more information about this material, please contact Local History & Genealogy Reference Services staff through the Ask a Librarian service.

Metadata type

Source metadata for this dataset was drawn from the LOC API which sources the data for this collection from a custom database.

Scale of description

The directories are described at the microfilm level. A microfilm reel may contain one or more telephone directories. Usually known is title, date, and location.

Rights information

All white pages are in the public domain, as are any pre-1964 yellow pages that were not registered and renewed for copyright. For more information, see https://www.loc.gov/collections/united-states-telephone-directory-collection/about-this-collection/rights-and-access/

Digitization information

1.7 million images scanned microfilm from the Main Reading Room

About this exploratory data package

This dataset contains metadata records for a subset of 3,513 reels of US telephone directories, digitized from microfilm, from the Digitized Telephone Directories collection on loc.gov. These records are included in both CSV and JSON formats.

The dataset spans most of the 20th century, and includes directories from Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, the District of Columbia, Florida, Georgia, Hawaii, Iowa, Maryland, Pennsylvania, and the city of Chicago. All the directories and their metadata records are in English. There is not a one-to-one correspondence between metadata records and directories, as some microfilm reels contained multiple directories when they were digitized. There is a mix of white pages and yellow pages. Some directories may be missing pages due to damage.

This dataset was created as part of an LC Labs experiment in collaboration with AVP to understand the benefits, risks, quality benchmarks, workflows, compilation methods, transformations, and documentation practices required to assemble datasets for public use in the cloud.

The goal of this dataset in particular is to explore ways of enriching existing metadata with information extracted from the images of collections items, in this case localities and dates parsed from the text of OCRed images. Through the creation of this dataset for the experiment, LC Labs and AVP aimed to learn: - how to enrich datasets with additional data that will support special purposes, such as network analysis, combination with other datasets, or other functionality. - how to measure and ensure high levels of accuracy when automating introduction of additional data into a dataset. - what documentation is needed for these enrichments.

The dataset is organized by digital object, with each row (CSV) or JSON object representing a single digital object from the collection. Each digital object represents one or more physical telephone directories.

What's included?

The data package contains:

Full data package which includes metadata files and 486 .txt files, zipped
README: An overview of the source data or collection provenance, the contents of the data package, and how the data package was created. Available as .md, .html, and .pdf.
metadata.json : a JSON file containing the metadata for all 3511 digital objects
metadata.jsonl : a JSON lines version of the JSON data, with one record per line, useful for processing large files
metadata.csv: a CSV transformation of the original JSON metadata
data/: directory of 486 full text OCR .txt files
summary/: directory of summary data (.csv) and visualizations (.jpg) of the fields included in the dataset and number of records populated for each, date histograms, and location distribution of dataset records (156 KB)
sample-data/: 100 randomly selected items from the 3511 item set has been provided as sample data. Included with this are a metadata.csv, metadata.json, and metadata.jsonl' (2.2 MB).

Computational readiness and possible uses

The data in this dataset have been selected, structured, standardized, and enriched to make the dataset more easily comprehensible and computable through a range of methods and in a variety of environments. Enrichment of locations with coordinates and structured addresses could support plotting records in mapping interfaces. Enrichment and standardization of dates could enable sequencing in timelines.

How was it created?

This dataset was created through a four-stage process including data extraction, mapping and standardization to a specified schema, enrichment of certain fields with additional data, and packaging for access and use.

Extraction

This dataset was created using the LOC JSON/YAML API and comprises all digitized and non-digitized digital object records, retrieved through the following API query: https://www.loc.gov/collections/united-states-telephone-directory-collection/?fo=json . This process returned 3511 results.

Standardization

The API fields in the response returned from the query were reviewed and selected for mapping to a schema designed for anticipated possible uses of the dataset. This schema is comprised of fields from the General Purpose schema plus additional format or collection-specific fields (see "Section IV. Dataset field descriptions" below). In the mapping from the API response to the dataset schema, some data values were standardized for consistency across the dataset and interoperability with other datasets.

Standardizations include the following methods and are listed in detail in the "Dataset field descriptions" section.
- Capitalization (method): Data value has been capitalized using title or sentence case - Fill with (string): Data value has been filled with a static string. If "empty" is present, only empty values have been filled. Otherwise, all values for that field, including mapped values, have been filled or overwritten. - Lookup (table): Data value has been looked up and replaced by a value in the specified lookup table. - Select, sum (field): A specified field in an array of objects has been selected and summed. (This is currently only in use for totalling the number of files from the API resource field). - Split on (delimiter): An array of strings has been created from a string value by splitting on a specified delimiter.

Enrichment

After standardization, some data fields were enriched to bring additional value for potential use cases. Because of the way that the directories were inventoried or cataloged, not all locations covered by the directories on a given reel are represented in the metadata retrieved via the API. Most reels contain informational slides, either near the start of the reel or interspersed throughout, giving the full list of locations covered by the directories on the reel. In order to more fully represent the geographic coverage of each item in this dataset, we used the smaller-than-average size of these informational images to identify which images might contain additional location information. We then used the optical character recognition (OCR) engine Tesseract to extract the text from those images, then used regular expressions to identify which images actually contain location lists and parse the lists into individual locations.

All locations included in the original metadata, as well as those identified using the OCR process above, were concatenated into city/town/village, state strings, queried in OpenStreetMapand enriched with structured location data, geocoordinates, and URLs to the structured data record in OpenStreetMap. Results were filtered to include only place types with one of the following values: "hamlet", "town", "city", "village", "county", "state", "province", "locality", "country", "suburb", "borough". If there was more than one result, the script chose the first result to encode in the output data. This approach may have produced inaccuracies in the enriched data due to idiosyncracies in recorded locations, misspellings in the original metadata, or errors in the OCR process.

Packaging

After enrichment, the dataset was output in JSON, JSONL (JSON lines), and CSV formats. CSV files were flattened from JSON using the following rules: - Arrays were flattened to strings, with array items delimited by the pipe character
- Enriched locations (Location) are included in their original JSON form in the Location column. The Location.Full_name is included in a Location_full_name column with multiple locations delimited by the pipe character. Coordinates from Location.Coordinates are listed as lat,long pairs in the Coordinates column in the same order as in Location_full_name. State_region and County columns were added for easier grouping and filtering of the data.

Dataset field descriptions

The data fields that follow were compiled from a "General Purpose" schema designed for the Data Transformation Services experiment and supplemented with additional fields specific to this collection and/or anticipated uses of the data. Values have been sourced from API fields or templated with static values where necessary. These mappings are indicated in the "Data source" column in the table below. Some values have been standardized for consistency across the dataset or interoperability with other datasets using similar data structures, standards, or controlled vocabularies. Types and descriptions of standardizations are listed above in the "How was it created" section and indicated in the "Standardization" column of the table below. Enrichments are also described in the "How was it created" section and are indicated below.

The data fields that follow are directly translated from the metadata.json file. The JSON file is nested in nature, and that nested structure is not strictly carried over into the CSV. When JSON fields have been flattened or otherwise altered to fit a CSV field, the transformation is described below.

Each of the fields described below appears for an object or row in the dataset. Please note that not all elements appear for each result. The number and percentage of results populated for each field are indicated in the table below as well as in a summary.csv file in this package.

Field	Datatype	Definition	Requirement	Repeatability	Data Source (from API unless otherwise noted)	Standardizations	Percent Populated
Call_number	Text	An identifier for finding the original physical resource.	Optional	Y	item.call_number		100%
Coordinates	Text	(CSV only) The coordinate pair (lat, long) of a location matched through the OpenStreetMap location enrichment. Multiple coordinate sets are delimited with the pipe character. Corresponds to the location string in the same position in the Locationfullname column.	Optional	Y	Enrichment: Location
County	Text	(CSV only) County(s) represented in the telephone directory	Recommended	Y	Enrichment: Location
Date	Date (EDTF)	A structured representation of the date created.	Recommended	Y	item.date		99.15%
Date_text	Text	A textual representation of the date.	Optional	Y	item.date		99.15%
Digitized	Boolean	Whether or not the resource described is digitized.	Recommended	N	digitized		100%
Genre	Text	A genre for the resource.	Recommended	Y	item.genre		100%
Id	Text	A unique identifier for the resource.	Mandatory	N	id		100%
IIIF_manifest	URL	A IIIF manifest for the digital object, if available	Recommended		iiif_manifest	Fill with ("English", empty)	100%
Language	Text	The language(s) of the content of the resource.	Optional	Y			100%
Lastupdatedin_api	Timestamp	The date and time the metadata was last refreshed in the API. This may or may not reflect a change in the data.	Recommended	N	timestamp		99.97%
Location	Object	Structured representation of a location, including parent administrative divisions, where applicable, and geocoordinates	Optional	Y	Enrichment: Location		93.62%
Locationfullname	Text	(CSV only) The full location string of a location matched through the OpenStreetMap location enrichment. Multiple locations are delimited with the pipe character. Corresponds to the coordinate pair in the same position in the Coordinates column.	Optional	Y	Enrichment: Location
Location_temp	Text	Temporary field for storing extracted location data for enrichment stage. This is removed during the enrichment stage.	Optional	Y	location		N/A
Location_text	Text	Textual representation of a location, copied directly from the source record	Recommended	Y	item.location		100%
Location.Address	Text	The administrative divisions that make up the full address of the location.	Mandatory	N	Enrichment: Location
Location.Address.[place_type]	Text	An administrative division of the location, ex. Country, State, or City. [place_type] is replaced by the address type from the address block in the OpenStreetMap data. All parent administrative divisions as well as the type of location are included.	Mandatory	N	Enrichment: Location
Location.Coordinates	List	The coordinates (lat, long) of the location.	Mandatory	N	Enrichment: Location
Location.Full_name	Text	The full display name of the location, taken from Open Street Map.	Mandatory	N	Enrichment: Location
Location.OSM_url	URL	The Open Street Map URL for the location.	Mandatory	N	Enrichment: Location
Location.Short_name	Text	The name of the lowest level administrative division of the location.	Mandatory	N	Enrichment: Location
Mime_type	Text	The MIME type(s) of the files composing the digital object.	Recommended, if digitized	Y	mime_type		99.91%
Numberoffiles	Integer	Number of files composing the digital object.	Recommended, if digitized	N	resources	Select, sum ("files")	100%
Online_format	Text	The format of the online version of the resource.	Recommended, if digitized	Y	online_format	Capitalization (sentence)	99.91%
Original_format	Text	The format the resource was digitized from.	Recommended, if digitized	Y	item.medium original_format	Capitalization (sentence)	100%
Part_of	Text	Groups the resource is a part of, such as source collection or repository.	Optional	Y	partof	Capitalization (title)	100%
Preview_url	URL	A url for a preview image or thumbnail for the digital object.	Recommended, if digitized	Y	image_url		99.91%
Repository	Text	The repository that holds the physical or digital resource.	Recommended	N	item.repository		100%
Rights	Text	Rights or access information associated with the resource.	Recommended	Y	item.rights		100%
Shelf_id	Text	An identifier for finding the original physical resource.	Optional	Y	shelf_id		100%
Source_collection	Text	The collection the resource belongs to.	Recommended	Y	item.source_collection		100%
State_Region	Text	(CSV only) State(s) or region(s) represented in the telephone directory. Multiple values are delimited with pipe character.	Recommended	Y	Enrichment: Location
Title	Text	The primary title or description of the resource.	Mandatory	N	item.title		100%
Typeofresource	Text	A term that specifies the characteristics and general type of content of the resource, such as "Still image" or "Text." Based on MODS 3.7 enumerated list of values for typeOfResource: https://www.loc.gov/standards/mods/userguide/typeofresource.html	Mandatory	Y	item.format	Lookup (typeofresource)	100%
Url	URL	The Digital Collections URL for the resource.	Optional	N	url		100%

Rights Statement

Creator and contributor information

Creator: AVP

Contributors: LC Labs

Contact information

Please contact [email protected] with any questions or suggestions!