Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution (2403.06434v1)

Published 11 Mar 2024 in cs.DB

Abstract: Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of LLMs like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39–48.
  2. P Christen. [n. d.]. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. 2012.
  3. J De Bruin. 2019. Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python. https://doi.org/10.5281/zenodo.3559043
  4. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1–16.
  5. Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe
  6. Probabilistic topic models: Foundation and application. Springer.
  7. On Leveraging Large Language Models for Enhancing Entity Resolution. arXiv preprint arXiv:2401.03426 (2024).
  8. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
  9. Generic entity resolution models. In NeurIPS 2022 First Table Representation Workshop.
  10. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
  11. William E Winkler. 2014. Matching and record linkage. Wiley interdisciplinary reviews: Computational statistics 6, 5 (2014), 313–325.
  12. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Citations (3)

Summary

We haven't generated a summary for this paper yet.