Version 1.0 | Last updated 2023-11-28
The General Collections Assessment is an ongoing program to assess the Library’s approximately 22 million books, bound serials and other materials classified under the General Collections. Assessments will be completed in segments divided by subject area (based on the Library’s Collections Policy Statements (CPS). Goals for the program include - allowing the Library to assess its effectiveness in meeting its collecting mandate; - providing action steps to address any issues identified through the assessments; and - building a data gathering process to support on-going and future assessment.
As part of this project, the Library is making the datasets publicly available here. A total of 45 segment assessments are planned.
The goal of these datasets in particular are to make available for exploration the underlying bibliographic datasets used as the primary data sources for the collection assessments. The Library is especially interested in getting feedback from users on the usefulness of sharing this source data publicly.
The data are limited to records of books and serials in the Library of Congress Voyager database (LC’s integrated library system) and are split into segments based on subject. For this project, the Library’s CPSs, specifically the LC classification outlined in each CPS, are used to define segment assessments by subject. Even when a CPS covers only part of a class or subclass, the entire class or subclass may still be included in the data if it is not covered in a different CPS.
The file sizes and record counts for the data files will vary due to this organization. For example, the Children’s Literature CPS covers parts of subclasses PZ and AP, and the matching assessment includes bibliographic data for 331,146 collection items. The Philosophy CPS covers subclasses B, BC, BD, BH, and part of BJ (the assessment includes the full subclass BJ) and the matching assessment includes bibliographic data for 241,677 collection items.
For questions or more information about this material, please contact the Collection Development Office at [email protected].
The data package contains:
Note: the “General Collections” that are the focus of this assessment include books and serials that are not assigned to Library custodial divisions (such as Asian Division, African and Middle Eastern Division, Rare Book and Special Collections Division, etc.) and are in Western languages. However, the bibliographic data files include all books and serials in an effort to provide data on all such materials in a given subject area. The General Collections records (designated as Holdings Location = “GenColl”) have undergone Place of Publication standardization as described in section III, but for the most part, the non-General Collections records in the dataset are presented “as is.”
The data in this dataset is in a structured, machine-readable tabular format. Users may find the data useful for various purposes, such as comparing against their own libraries’ collections, analyzing Library of Congress collections themselves, or making use of information on place of publication and publisher, which have undergone some standardization. (The normalization of city names and publishers is done to simplify understanding and reporting; it is not intended as a cataloging standard.)
Each segment assessment analyzes the bibliographic data of books and serials in the subject area covered by a Collections Policy Statement.
To access data for each assessment, SQL queries are written to extract the necessary data from the Voyager database (LC’s integrated library system). For each assessment, the queries begin by identifying holding records and call numbers for the subclass(es) covered by the assessment’s subject area. The holding records are then linked to the relevant bibliographic data using a unique identifier. As more bibliographic data is added (title, author, format, language, publisher information, etc.), a unique dataset of collection materials in that subject area is created for the analysis.
Certain data fields – Place of Publication, Country, Language, and Publisher – are of particular interest in the datasets, as these data help the Library assess the breadth and diversity of its General Collections materials.
After the bibliographic data is compiled for the specific segment of assessment, certain fields are cleaned and standardized before the analysis begins:
The data fields that follow were compiled from the Library's bibliographic data.
| Field | Datatype | Definition | Metadata Source |
|---|---|---|---|
| LCCN | Text | Library of Congress Control Number | Bibliographic record |
| Title | Text | Title of work | Bibliographic record |
| Author | Text | Author of work | Bibliographic record |
| Publisher | Text | Publisher of work; publishers of English-language works with ISBNs undergo a standardization process | Derived from bibliographic record |
| Language | Text | Name of language of publication; corresponds to MARC Code List for Languages names | Bibliographic record |
| Begin Publication Date | String | Year a book was published or year a serial began publication | Bibliographic record |
| Format | Text | Format of item (book or serial) | Bibliographic record |
| Country | Text | Country of publication name | Generated based on Place Code field |
| Display Call Number | String | Call number as displayed in Library of Congress catalog | Bibliographic record |
| Holdings Location Display Name | Text | Holdings location as displayed in Library of Congress catalog | Bibliographic record |
| Holdings Location | String | Holdings location abbreviation | Bibliographic record |
| Begin Publication Date (Decade) | Integer | Decade of publication; populated only for books | Generated based on Begin Publication Date field |
| Region | Text | Global region of publication | United Nations Standard Country or Area Codes for Statistical Use, Sub-Region Name field |
| Subclass | Text | Library of Congress Classification subclass | Generated based on Display Call Number field |
| US NonUS | Text | Indication whether place of publication is within the United States (50 states and D.C.) | Generated based on Place Code field |
| Isbn | String | International Standard Book Number | Bibliographic record |
| Issn | String | International Standard Serial Number | Bibliographic record |
| Language Code | String | Code for language of publication; corresponds to MARC Code List for Languages codes | Bibliographic record |
| Place Code | String | Code for state or country of publication; corresponds to MARC Code List for Languages codes | Bibliographic record |
| Place of Publication | Text | City or other locality of publication; city or town is used when available, otherwise narrowest subnational locality available (such as county) is used | Derived from bibliographic record |
| State or Country | Text | Name of state or country of publication; corresponds to MARC Code List for Countries names | Bibliographic record |
These data are free to reuse.
Creator: Collection Development Office, Library of Congress, based on existing bibliographic records.
Please send your feedback on using these datasets [email protected] or get in touch with any questions or suggestions!