Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution (2403.06434v1)

Published 11 Mar 2024 in cs.DB

Abstract: Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of LLMs like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39–48.
  2. P Christen. [n. d.]. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. 2012.
  3. J De Bruin. 2019. Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python. https://doi.org/10.5281/zenodo.3559043
  4. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1–16.
  5. Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe
  6. Probabilistic topic models: Foundation and application. Springer.
  7. On Leveraging Large Language Models for Enhancing Entity Resolution. arXiv preprint arXiv:2401.03426 (2024).
  8. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
  9. Generic entity resolution models. In NeurIPS 2022 First Table Representation Workshop.
  10. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
  11. William E Winkler. 2014. Matching and record linkage. Wiley interdisciplinary reviews: Computational statistics 6, 5 (2014), 313–325.
  12. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.