- The paper introduces a heuristic method integrating structured external data with bibliometric records to address author name ambiguity in large-scale research assessments.
- The proposed methodology, comprising database integration, mapping, and filtering, achieved high precision (~96.4%) and recall (~94.3%) on the Italian National Citation Report dataset.
- This scalable approach offers a practical solution for accurate individual researcher-level evaluation, supporting national assessments and fostering granular analysis of research productivity.
Heuristic Approach to Author Name Disambiguation in Large-Scale Research Assessments
The paper introduces a heuristic approach to address author name disambiguation in bibliometric databases, targeting a significant hindrance in large-scale research assessments—accurate identification of the true author of publications. This challenge is predominantly caused by name homonyms and variations in author name representation. To effectively tackle this issue, the authors propose integrating structured data with bibliometric databases, utilizing an external data source that records affiliations and research areas of Italian academic scientists.
Methodology Overview
The authors delineate their method into three distinct phases: database integration, mapping generation, and filtering.
- Database Integration: Bibliometric records are enriched using the external database, capturing essential information on authorship. The integration leverages a reference list of author identities to complement bibliometric records, assisting in mitigating internal data noise.
- Mapping Generation: This phase creates comprehensive superset mappings between bibliometric authors and candidate identities from the external database through aggressive matching strategies. The mappings are based on name similarities and organizational affiliations.
- Filtering: The filtering phase employs several data-driven heuristics to remove false positives produced by homonyms. Filters are applied sequentially, ensuring maximal precision without compromising recall. Techniques involve address matching, subject category compatibility check, shared sector elimination, and evaluating maximum correspondence to refine the accuracy of author-identity pairs.
Experimental Results
The algorithm's efficacy was validated using the Italian National Citation Report dataset and cross-verified with the University of Milan's Institutional Research Archives. The results demonstrate commendable precision levels (~96.4%) and robust recall (~94.3%), with minimal false positives and negatives—a feat unattainable by traditional methods.
Implications and Future Directions
This approach offers a scalable and expandable solution for author name disambiguation in large-scale bibliometric applications. Its practical ease of implementation permits immediate application in various settings, notably supporting national research evaluations.
While the algorithm presents significant advantages over state-of-the-art unsupervised methodologies, further enhancement in precision and recall can be achieved through continuous refinement of filtering strategies and external data integration techniques.
Conclusively, establishing bibliometric databases at the level of individual researchers nurtures a more granular and precise evaluation of research productivity across diverse disciplines. The authors advocate for similar bibliometric frameworks globally to facilitate international comparisons and enrich research assessment methodologies. Future research may explore more robust data integration mechanisms or investigate alternative data sources for further improving disambiguation performance.