Blog Archives

Indexing a 'Green' Institutional Repository using Dublin Core

6/5/2015

I was recently employed as an indexer in cataloguing a “green” library repository[1] at the International Development Research Centre in Ottawa, Canada. My role there, along with others in the team, was to provide keywords for 8,000 research documents including records of research activities, using the Dublin Core metadata schema. Previous to this assignment I wasn’t familiar with the Dublin Core, but at the close of the project, I appreciate its scope as a vehicle for indexing and tagging collections on the web. The project was a fascinating look-in to the Dublin Core’s capacity to comprehensively index and tag the IDRC research collection, while making web searching easier and more efficient for both researchers and non-academics.

Indexers created main keywords for repository items, first by reading documents for subject classifications and type of research, and then adding several terms describing important research results along with other types of content. I read or reviewed 3200 research papers, theses, project evaluations, reports and summaries as well as peer-reviewed journal articles and other items related to research programmes in the developing world that are funded by the IDRC. It has been an amazing opportunity to read a body of mostly hidden literature, commissioned over the past 45 years.

As such, original documents exist in paper format. During the span of the project all have been digitized. Indexers worked from both archival hard copies as well as digitized material. The digitization and indexing of the International Development Library repository allows for forty-five years of development research results to be made available to a global public. At the same time, it provides accountability to programme donors and funders, allowing them to see some effects of their contributions.

The Dublin Core metadata schema was customized to the needs of the International Development Library, providing bibliographical information about the research, as well as subject heading fields for key word descriptors. Keywords added to the Dublin Core records become searchable subject headings in the collection catalogue via the IDRC library database and search engine, and as well, are made available in DSpace [2] via internet webcrawlers. Checking against the OECD macrothesaurus and other controlled vocabularies for particular subject areas further refines keywords derived by the indexer.

I learned on-the-job about controlled vocabularies. Prior to this experience it had never been required for back-of-the-book indexes, which are not web-dependent for usability. The use of controlled vocabulary is a way of particularizing and standardizing search terms so that researchers and (other) web crawlers can find the research more efficiently via the internet. At the IDRC, much of development research focuses on agriculture, climate change adaptation, global health, and poverty alleviation. Keyword vocabulary is typically drawn from the OECD Macrothesaurus, MESH (for health), CAB (for agriculture), UNESCO, UNBIS as well as others.

I discovered that problems could arise when authors are not given specific criteria for generating keywords to their own articles and reports. The situation is similar to having the author of a book create his or her own index. And while a book index is much longer, and covers more subject areas, the snapshot provided by a brief index of subject terms for a research paper can look muddled and/or quickly thrown together. Content is lost. For me, the best-case scenario happens when keywords can add up to tell a little story about what is in the paper. For example: PHILIPPINES--BOLINAO, COMMUNITY PARTICIPATION, MARINE RESOURCES, COASTAL ZONE MANAGEMENT, LIVELIHOODS, FISHING, ECOTOURISM, ENVIRONMENTAL EDUCATION, FEASIBILITY STUDIES. These keyword terms provide a snapshot for the title: Livelihood development activities in Barangay Balingasay, Bolinao, Pangasinan

The subject catalogue for the entire collection can become distorted or inflated with author-generated keywords. For example there may be multiple terms appropriate to fish farming: fish farms, aquaculture, pond culture, aquaculture techniques, fishery management etc. But if authors suggest multiple keywords from their own vocabulary, such as fish farming, fish farms, fish farm, pond farming, fisheries, fishery management, etc, while the (CAB) macrothesaurus term is ‘pond aquaculture’, the collection catalogue can become muddled with over-generalized or inappropriate search terms. Internet web crawler efficiency is also reduced.

Now that professional indexing of the collection ended (at least for this part of the project), current authors of research may forget to include in their keywords what seems obvious to them, such as the country of origin of the research, or terminology that has become part of their own thinking. As terminology changes over time, these terms can become more or less familiarly known. They also become more or less searchable.   Standardized vocabulary can eliminate a great deal of confusion.

I came across this omission in more recent papers that I reviewed towards the end of the project, where author keywords were often prescribed without providing criteria or thesauri for them to refer to. As author keywords will soon become the only means of indexing the collection, over time this will create problems. If researchers commissioned by the IDRC are not familiar with pre-existing terms and categories, the searchability of the repository will eventually become diluted. In the above example of author-generated keywords, the outsider searching in the subject catalogue for the particular country or region that has been omitted in the keywords, will of course be unable to know what hasn’t been included. It’s self-evident that you can’t find what isn’t there. But it’s these invisible gaps in information that are the proper concern of professional indexing, and not the authors of research papers.

It is understandable that author keywords are prioritized by the IDRC. After all, the authors are their own subject experts. And doing a PDF search is certainly the cheapest way to go with new material. Yet providing benchmark criteria such as the OECD macrothesaurus or other appropriate thesauri to the author, would help to keep the integrity of the subject catalogue for the collection as a whole, while retaining its efficiency for internet searchability.

The usefulness of a controlled vocabulary functions like the index in the back of the book. It creates a customized set of related concepts, increasing and improving searchability; in this instance, of a whole database.



[1] Kennison, R, Shreeves, SL, Harnad, S. (2013). Point & Counterpoint: The Purpose of Institutional Repositories: Green OA or Beyond?. Journal of Librarianship and Scholarly Communication 1(4):eP1105. http://dx.doi.org/10.7710/2162-3309.1105 See also https://www.youtube.com/watch?v=T2oCp6psqlE

[2] See ‘Google Scholar and DSpace’ http://atmire.com/website/?q=content/google-scholar-and-dspace

0 Comments

Indexing a 'Green' Institutional Repository using Dublin Core

Author

Archives

Categories