Library of Congress General Collections Assessment Data README

Version information

Version 1.0 | Last updated 2023-11-28

1.0 (2023-11-28) First version

About this dataset

The General Collections Assessment is an ongoing program to assess the Library’s approximately 22 million books, bound serials and other materials classified under the General Collections. Assessments will be completed in segments divided by subject area (based on the Library’s Collections Policy Statements (CPS). Goals for the program include - allowing the Library to assess its effectiveness in meeting its collecting mandate; - providing action steps to address any issues identified through the assessments; and - building a data gathering process to support on-going and future assessment.

As part of this project, the Library is making the datasets publicly available here. A total of 45 segment assessments are planned.

The goal of these datasets in particular are to make available for exploration the underlying bibliographic datasets used as the primary data sources for the collection assessments. The Library is especially interested in getting feedback from users on the usefulness of sharing this source data publicly.

About the source data or collection

Brief description & background of source material

The data are limited to records of books and serials in the Library of Congress Voyager database (LC’s integrated library system) and are split into segments based on subject. For this project, the Library’s CPSs, specifically the LC classification outlined in each CPS, are used to define segment assessments by subject. Even when a CPS covers only part of a class or subclass, the entire class or subclass may still be included in the data if it is not covered in a different CPS.

The file sizes and record counts for the data files will vary due to this organization. For example, the Children’s Literature CPS covers parts of subclasses PZ and AP, and the matching assessment includes bibliographic data for 331,146 collection items. The Philosophy CPS covers subclasses B, BC, BD, BH, and part of BJ (the assessment includes the full subclass BJ) and the matching assessment includes bibliographic data for 241,677 collection items.

Library of Congress reading room

Not applicable.

Contact

For questions or more information about this material, please contact the Collection Development Office at [email protected].

What's included?

The data package contains:

README: An overview of the source data or collection provenance, the contents of the data package, and how the data package was created. Available as .md, .html, and .pdf.
chi.csv: bibliographic data for 331,146 collection items used in the Children's Literature assessment.
localhistory_us.csv: bibliographic data for 321,869 collection items used in the Local History assessment.
philosophy.csv: bibliographic data for 241,677 collection items used in the Philosophy assessment.

Note: the “General Collections” that are the focus of this assessment include books and serials that are not assigned to Library custodial divisions (such as Asian Division, African and Middle Eastern Division, Rare Book and Special Collections Division, etc.) and are in Western languages. However, the bibliographic data files include all books and serials in an effort to provide data on all such materials in a given subject area. The General Collections records (designated as Holdings Location = “GenColl”) have undergone Place of Publication standardization as described in section III, but for the most part, the non-General Collections records in the dataset are presented “as is.”

Computational readiness and possible uses

The data in this dataset is in a structured, machine-readable tabular format. Users may find the data useful for various purposes, such as comparing against their own libraries’ collections, analyzing Library of Congress collections themselves, or making use of information on place of publication and publisher, which have undergone some standardization. (The normalization of city names and publishers is done to simplify understanding and reporting; it is not intended as a cataloging standard.)

How was it created?

Each segment assessment analyzes the bibliographic data of books and serials in the subject area covered by a Collections Policy Statement.

To access data for each assessment, SQL queries are written to extract the necessary data from the Voyager database (LC’s integrated library system). For each assessment, the queries begin by identifying holding records and call numbers for the subclass(es) covered by the assessment’s subject area. The holding records are then linked to the relevant bibliographic data using a unique identifier. As more bibliographic data is added (title, author, format, language, publisher information, etc.), a unique dataset of collection materials in that subject area is created for the analysis.

Certain data fields – Place of Publication, Country, Language, and Publisher – are of particular interest in the datasets, as these data help the Library assess the breadth and diversity of its General Collections materials.

After the bibliographic data is compiled for the specific segment of assessment, certain fields are cleaned and standardized before the analysis begins:

Data fields, including Place of Publication and Publisher, are normalized to the extent possible. However, due to time limitations and the variability of city names, publisher mergers, and cataloging standards over decades (and sometimes centuries), normalization is not 100% complete. The normalization of city names and publishers is done to simplify understanding and reporting; it is not intended as a cataloging standard.
In cases where more than one city of publication is listed for a title, only the first city name is kept/normalized.
Countries of publication are divided into regions based on the United Nations Standard Country or Area Codes for Statistical Use.
The main normalization process for publishers depends on publisher codes embedded in ISBNs, and is only completed for English language materials due to the availability of reliable ISBN/publisher data.

Dataset field descriptions

The data fields that follow were compiled from the Library's bibliographic data.

Field	Datatype	Definition	Metadata Source
LCCN	Text	Library of Congress Control Number	Bibliographic record
Title	Text	Title of work	Bibliographic record
Author	Text	Author of work	Bibliographic record
Publisher	Text	Publisher of work; publishers of English-language works with ISBNs undergo a standardization process	Derived from bibliographic record
Language	Text	Name of language of publication; corresponds to MARC Code List for Languages names	Bibliographic record
Begin Publication Date	String	Year a book was published or year a serial began publication	Bibliographic record
Format	Text	Format of item (book or serial)	Bibliographic record
Country	Text	Country of publication name	Generated based on Place Code field
Display Call Number	String	Call number as displayed in Library of Congress catalog	Bibliographic record
Holdings Location Display Name	Text	Holdings location as displayed in Library of Congress catalog	Bibliographic record
Holdings Location	String	Holdings location abbreviation	Bibliographic record
Begin Publication Date (Decade)	Integer	Decade of publication; populated only for books	Generated based on Begin Publication Date field
Region	Text	Global region of publication	United Nations Standard Country or Area Codes for Statistical Use, Sub-Region Name field
Subclass	Text	Library of Congress Classification subclass	Generated based on Display Call Number field
US NonUS	Text	Indication whether place of publication is within the United States (50 states and D.C.)	Generated based on Place Code field
Isbn	String	International Standard Book Number	Bibliographic record
Issn	String	International Standard Serial Number	Bibliographic record
Language Code	String	Code for language of publication; corresponds to MARC Code List for Languages codes	Bibliographic record
Place Code	String	Code for state or country of publication; corresponds to MARC Code List for Languages codes	Bibliographic record
Place of Publication	Text	City or other locality of publication; city or town is used when available, otherwise narrowest subnational locality available (such as county) is used	Derived from bibliographic record
State or Country	Text	Name of state or country of publication; corresponds to MARC Code List for Countries names	Bibliographic record

Rights Statement

These data are free to reuse.

Creator and contributor information

Creator: Collection Development Office, Library of Congress, based on existing bibliographic records.

Feedback

Please send your feedback on using these datasets [email protected] or get in touch with any questions or suggestions!