Missing the MARC: Utilization of MARC Fields in the Search Process
I enjoyed this article and wanted to provide a “summary” lens that might present it in more concise format, as well as explain acronyms & assumptions:
- MARC — MAchine Readable Cataloging format (full info from LoC here)
- USU — Utah State University
- CMS — USU Cataloging and Metadata Services unit
- EAD — Encoded Archival Description (full info from LoC here)
- BIBFRAME —Bibliographic Framework Initiative (full info from LoC here)
- OPAC — online public access catalog (Wikipedia’s got your back here)
- RLG — Research Library Group (Wikipedia’s got your back here)
The assessment sought to understand the correlation between user search terms, the placement of MARC records in search results lists, and the performance of individual MARC fields. The overall research questions were:
- What is the frequency & placement of MARC records in search results?
- Where are search terms located in MARC records?
These may initially seem rather dry topics but, reading the paper through to the end, I became convinced of its relevance beyond the readership of Cataloguing & Classification Quarterly; the concept of Search touches us all, day to day, whether in the context of product (e.g. via Amazon), factual information (e.g. via Wikipedia), or social (e.g. consider a LinkedIn or Twitter search, in which you specify whether the individual you’re seeking is a member of some or other organisation). Having said that, 60% of the research team comprised folks with “Librarian” in their title (see acronyms list for ref):
- Head of CMS
- Metadata Librarian
- Special Collections Cataloging Librarian
- Archival Cataloging Librarian
- EAD specialist
Literature review
To demonstrate the recent emphasis on BIBFRAME & discoverability (using semantic web & linked open data) the paper includes a helpful literature review. Some of the references are MARC-specific studies and include the assessment of user search behaviors in similar library catalogs.
User-search behavior and library catalogs
- analysis of VuFind and Primo by Niu, Zhang, and Chen. 1
- library search behavior via single search box by Lown, Sierra, and Boyer. 2
- dissertation by Fredrick Lugya. 3
Cataloging practices — influences on circulation
- correlations between access points & circulation by Gunnar Knutson. 4
- value of subject headings in bibliographic records by Gross and Taylor. 5
- effect of addition of tables of contents, summary, and abstract notes by Gross, Taylor, & Joudrey. 6
- impact of enriched content in MARC fields, on the usage of books by Cherie Madarash-Hill and J.B. Hill. 7
- content-enriched bibliographic records and their effect on usage by Tosaka and Weng. 8
- correlation between OPAC searches and circulation of materials by Laura N. Kirkland. 9
MARC records and discoverability
- report produced in 2010 by the RLG Partnership MARC Tag Usage Working Group and OCLC Research. 10
Methodology and process
USU sevres 27,000 students across 9 campuses predominantly studying:
- Communicative Disorders & Deaf Education,
- Economics,
- Psychology,
- Mechanical Engineering,
- Biology,
- Elementary Education,
- Human Movement Science,
- Computer Science.
USU Libraries uses Encore, a product of Innovative Interfaces, Inc. (III) as its discovery layer for library resources. Encore pulls together records from the library’s catalog, Sierra, as well as journal articles from subscription databases into a single search interface to simplify the research process for users. Sierra houses the library’s roughly 2.5 million MARC-based records, including physical and electronic resource material. Fifty-one databases, many of them EBSCO databases, feed just over 3.6 million non-MARC records into Encore.
In order to determine how MARC records interacted with the user search process, the research team examined the logs of URLs generated by Encore. They used Google Analytics and Octoparse (a web scraping tool) plus Airtable, which is probably my favourite piece of technology at the moment.
At Tigmus I used Airtable to blend automated machine processes (e.g. formulae, scraping, rollups, and templateable format conversion) with human/manbual activity. Amongst its chief functionalities is the ability to generate highly readable API documentation for your base within seconds of creating or updating it. Indeed the paper describes how, within Airtable, the research team linked batches of item titles and hyperlinks to their corresponding dynamic URL and invited student technicians, to provide quality control by spot checking the web scraped search result lists as well as all null outcomes (e.g. where URLs returned no results at all).
Another wow factor in this paper, is the way the team handled, and benefitted from, the COVID-19 pandemic. Working remotely, the CMS unit and student technicians successfully coded all 13,312 MARC records and provided quality control on each other’s work. With unit members and student technicians working part-time on code & data, the process took ~3 months to complete.
Other products leveraged, with which readers of this blog will be familiar, included Google Scholar and Microsoft Academic. Nice to know these are delivering value to teams in institutions as well as the hobbyist & consumer!
Analysis and results
Research Questions
1: What is the frequency and placement of MARC records in search results lists?
Demoed as Table 1: Count of catalog vs. database records displayed to users.
1.2: Is there a difference between locally created records and vendor supplied records in the frequency of listing in search results?
Demoed as Table 2: Records displayed in search results and records accessed, sorted by record creator.
1.3: How are MARC records ranked in the search results list?
Demoed as Table 3: MARC record position number in results list ordered by frequency.
1.4: Where do MARC records for known items rank in the search results list?
Demoed as Table 4. Available whole object known items by search results placement.
Research Question #2: Where are search terms located in MARC records?
2.1: What fields are used most in retrieving records?
Demoed as Table 5. Prevalence of user search terms, in full or in part, in the top 20 MARC fields.
2.2: For records accessed by the patron, is there a difference in where search terms are located?
Demoed as Table 6. Prevalence of search terms, in full or in part, in the top 20 MARC fields in records viewed by patrons.
2.3: For locally created records and vendor-supplied records, is there a difference in where search terms are located?
Demoed as Table 7. Frequency and percentage of fields used in record retrieval, listed by CMS and vendor-supplied MARC records.
2.4: What fields are not present in the records?
Demoed as Table 8. Fields not present in the MARC record.
2.5: Which fields would make the greatest impact if not included in the record?
Demoed as Table 9. Frequency of records where search terms matched only one MARC field.
…and Table 10. Frequency of records where search terms matched only one MARC field in records viewed by patrons.
Discussion
The CMS unit at USU uses the insights from this research to structure local practices & procedures (I’d call this “implementing data governance”) based on the habits & needs demonstrated during the investigation.
Conclusion
Data driven decision-making is crucial here and everywhere. The benefits of having the unit work together with student volunteers to code & assess how search terms interact with MARC metadata is the opportunity to collectively and concretely reflect on how user searches connected (or did not connect) with records produced or provided by the unit. From a “non-invasive data governance” perspective this is a great example of involving stakeholders early on, and perhaps you could even say it’s been done covertly as the students won’t have realised they’re contributing to a vast empowering governance initiative that will presumably evolve and potentially transform the institution. What a great team — love the way they structured the article.