Library of Congress General Collections Assessment Data README

Version information

Version 1.0 | Last updated 2023-11-28

About this dataset

The General Collections Assessment is an ongoing program to assess the Library’s approximately 22 million books, bound serials and other materials classified under the General Collections. Assessments will be completed in segments divided by subject area (based on the Library’s Collections Policy Statements (CPS). Goals for the program include - allowing the Library to assess its effectiveness in meeting its collecting mandate; - providing action steps to address any issues identified through the assessments; and - building a data gathering process to support on-going and future assessment.

As part of this project, the Library is making the datasets publicly available here. A total of 45 segment assessments are planned.

The goal of these datasets in particular are to make available for exploration the underlying bibliographic datasets used as the primary data sources for the collection assessments. The Library is especially interested in getting feedback from users on the usefulness of sharing this source data publicly.

About the source data or collection

Brief description & background of source material

The data are limited to records of books and serials in the Library of Congress Voyager database (LC’s integrated library system) and are split into segments based on subject. For this project, the Library’s CPSs, specifically the LC classification outlined in each CPS, are used to define segment assessments by subject. Even when a CPS covers only part of a class or subclass, the entire class or subclass may still be included in the data if it is not covered in a different CPS.

The file sizes and record counts for the data files will vary due to this organization. For example, the Children’s Literature CPS covers parts of subclasses PZ and AP, and the matching assessment includes bibliographic data for 331,146 collection items. The Philosophy CPS covers subclasses B, BC, BD, BH, and part of BJ (the assessment includes the full subclass BJ) and the matching assessment includes bibliographic data for 241,677 collection items.

Library of Congress reading room

Contact

For questions or more information about this material, please contact the Collection Development Office at [email protected].

What's included?

The data package contains:

Note: the “General Collections” that are the focus of this assessment include books and serials that are not assigned to Library custodial divisions (such as Asian Division, African and Middle Eastern Division, Rare Book and Special Collections Division, etc.) and are in Western languages. However, the bibliographic data files include all books and serials in an effort to provide data on all such materials in a given subject area. The General Collections records (designated as Holdings Location = “GenColl”) have undergone Place of Publication standardization as described in section III, but for the most part, the non-General Collections records in the dataset are presented “as is.”

Computational readiness and possible uses

The data in this dataset is in a structured, machine-readable tabular format. Users may find the data useful for various purposes, such as comparing against their own libraries’ collections, analyzing Library of Congress collections themselves, or making use of information on place of publication and publisher, which have undergone some standardization. (The normalization of city names and publishers is done to simplify understanding and reporting; it is not intended as a cataloging standard.)

How was it created?

Each segment assessment analyzes the bibliographic data of books and serials in the subject area covered by a Collections Policy Statement.

To access data for each assessment, SQL queries are written to extract the necessary data from the Voyager database (LC’s integrated library system). For each assessment, the queries begin by identifying holding records and call numbers for the subclass(es) covered by the assessment’s subject area. The holding records are then linked to the relevant bibliographic data using a unique identifier. As more bibliographic data is added (title, author, format, language, publisher information, etc.), a unique dataset of collection materials in that subject area is created for the analysis.

Certain data fields – Place of Publication, Country, Language, and Publisher – are of particular interest in the datasets, as these data help the Library assess the breadth and diversity of its General Collections materials.

After the bibliographic data is compiled for the specific segment of assessment, certain fields are cleaned and standardized before the analysis begins:

Dataset field descriptions

The data fields that follow were compiled from the Library's bibliographic data.

Field Datatype Definition Metadata Source
LCCN Text Library of Congress Control Number Bibliographic record
Title Text Title of work Bibliographic record
Author Text Author of work Bibliographic record
Publisher Text Publisher of work; publishers of English-language works with ISBNs undergo a standardization process Derived from bibliographic record
Language Text Name of language of publication; corresponds to MARC Code List for Languages names Bibliographic record
Begin Publication Date String Year a book was published or year a serial began publication Bibliographic record
Format Text Format of item (book or serial) Bibliographic record
Country Text Country of publication name Generated based on Place Code field
Display Call Number String Call number as displayed in Library of Congress catalog Bibliographic record
Holdings Location Display Name Text Holdings location as displayed in Library of Congress catalog Bibliographic record
Holdings Location String Holdings location abbreviation Bibliographic record
Begin Publication Date (Decade) Integer Decade of publication; populated only for books Generated based on Begin Publication Date field
Region Text Global region of publication United Nations Standard Country or Area Codes for Statistical Use, Sub-Region Name field
Subclass Text Library of Congress Classification subclass Generated based on Display Call Number field
US NonUS Text Indication whether place of publication is within the United States (50 states and D.C.) Generated based on Place Code field
Isbn String International Standard Book Number Bibliographic record
Issn String International Standard Serial Number Bibliographic record
Language Code String Code for language of publication; corresponds to MARC Code List for Languages codes Bibliographic record
Place Code String Code for state or country of publication; corresponds to MARC Code List for Languages codes Bibliographic record
Place of Publication Text City or other locality of publication; city or town is used when available, otherwise narrowest subnational locality available (such as county) is used Derived from bibliographic record
State or Country Text Name of state or country of publication; corresponds to MARC Code List for Countries names Bibliographic record

Rights Statement

These data are free to reuse.

Creator and contributor information

Creator: Collection Development Office, Library of Congress, based on existing bibliographic records.

Feedback

Please send your feedback on using these datasets [email protected] or get in touch with any questions or suggestions!