A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments (1812.08538v1)

Published 20 Dec 2018 in cs.DL

Abstract: National exercises for the evaluation of research activity by universities are becoming regular practice in ever more countries. These exercises have mainly been conducted through the application of peer-review methods. Bibliometrics has not been able to offer a valid large-scale alternative because of almost overwhelming difficulties in identifying the true author of each publication. We will address this problem by presenting a heuristic approach to author name disambiguation in bibliometric datasets for large-scale research assessments. The application proposed concerns the Italian university system, consisting of 80 universities and a research staff of over 60,000 scientists. The key advantage of the proposed approach is the ease of implementation. The algorithms are of practical application and have considerably better scalability and expandability properties than state-of-the-art unsupervised approaches. Moreover, the performance in terms of precision and recall, which can be further improved, seems thoroughly adequate for the typical needs of large-scale bibliometric research assessments.

Authors (3)

Ciriaco Andrea D'Angelo (133 papers)
Cristiano Giuffrida (8 papers)
Giovanni Abramo (133 papers)

Citations (166)

View on Semantic Scholar

Summary

The paper introduces a heuristic method integrating structured external data with bibliometric records to address author name ambiguity in large-scale research assessments.
The proposed methodology, comprising database integration, mapping, and filtering, achieved high precision (~96.4%) and recall (~94.3%) on the Italian National Citation Report dataset.
This scalable approach offers a practical solution for accurate individual researcher-level evaluation, supporting national assessments and fostering granular analysis of research productivity.

Heuristic Approach to Author Name Disambiguation in Large-Scale Research Assessments

The paper introduces a heuristic approach to address author name disambiguation in bibliometric databases, targeting a significant hindrance in large-scale research assessments—accurate identification of the true author of publications. This challenge is predominantly caused by name homonyms and variations in author name representation. To effectively tackle this issue, the authors propose integrating structured data with bibliometric databases, utilizing an external data source that records affiliations and research areas of Italian academic scientists.

Methodology Overview

The authors delineate their method into three distinct phases: database integration, mapping generation, and filtering.

Database Integration: Bibliometric records are enriched using the external database, capturing essential information on authorship. The integration leverages a reference list of author identities to complement bibliometric records, assisting in mitigating internal data noise.
Mapping Generation: This phase creates comprehensive superset mappings between bibliometric authors and candidate identities from the external database through aggressive matching strategies. The mappings are based on name similarities and organizational affiliations.
Filtering: The filtering phase employs several data-driven heuristics to remove false positives produced by homonyms. Filters are applied sequentially, ensuring maximal precision without compromising recall. Techniques involve address matching, subject category compatibility check, shared sector elimination, and evaluating maximum correspondence to refine the accuracy of author-identity pairs.

Experimental Results

The algorithm's efficacy was validated using the Italian National Citation Report dataset and cross-verified with the University of Milan's Institutional Research Archives. The results demonstrate commendable precision levels (~96.4%) and robust recall (~94.3%), with minimal false positives and negatives—a feat unattainable by traditional methods.

Implications and Future Directions

This approach offers a scalable and expandable solution for author name disambiguation in large-scale bibliometric applications. Its practical ease of implementation permits immediate application in various settings, notably supporting national research evaluations.

While the algorithm presents significant advantages over state-of-the-art unsupervised methodologies, further enhancement in precision and recall can be achieved through continuous refinement of filtering strategies and external data integration techniques.

Conclusively, establishing bibliometric databases at the level of individual researchers nurtures a more granular and precise evaluation of research productivity across diverse disciplines. The authors advocate for similar bibliometric frameworks globally to facilitate international comparisons and enrich research assessment methodologies. Future research may explore more robust data integration mechanisms or investigate alternative data sources for further improving disambiguation performance.