Papers
Topics
Authors
Recent
2000 character limit reached

Efficiently Estimating Mutual Information Between Attributes Across Tables (2403.15553v1)

Published 22 Mar 2024 in cs.DB

Abstract: Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. ā€œNYC OpenData,ā€ https://opendata.cityofnewyork.us.
  2. ā€œCity of Chicago Data Portal,ā€ https://data.cityofchicago.org.
  3. ā€œUnited States Government Open Data,ā€ https://www.data.gov.
  4. D.Ā Brickley, M.Ā Burgess, and N.Ā Noy, ā€œGoogle dataset search: Building a search engine for datasets in an open web ecosystem,ā€ in The World Wide Web Conference, ser. WWW ’19.Ā Ā Ā New York, NY, USA: ACM, 2019, pp. 1365–1375. [Online]. Available: http://doi.acm.org/10.1145/3308558.3313685
  5. S.Ā Bapat, ā€œDiscover, understand and manage your data with Data Catalog, now GA,ā€ https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available, 2020, [Online; accessed 22-June-2020].
  6. M.Ā Grover, ā€œAmundsen — Lyft’s data discovery & metadata engine,ā€ https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9, 2019, [Online; accessed 20-October-2019].
  7. M.Ā Lan, ā€œDataHub: A generalized metadata search & discovery tool,ā€ https://engineering.linkedin.com/blog/2019/data-hub, 2019, [Online; accessed 22-June-2020].
  8. B.Ā Youngmann, M.Ā Cafarella, Y.Ā Moskovitch, and B.Ā Salimi, ā€œNexus: On explaining confounding bias,ā€ in Companion of the 2023 International Conference on Management of Data, 2023, pp. 171–174.
  9. F.Ā Chirigati, H.Ā Doraiswamy, T.Ā Damoulas, and J.Ā Freire, ā€œData polygamy: the many-many relationships among urban spatio-temporal data sets,ā€ in ACM SIGMOD, 2016, pp. 1011–1025.
  10. A.Ā Bessa, J.Ā Freire, T.Ā Dasu, and D.Ā Srivastava, ā€œEffective discovery of meaningful outlier relationships,ā€ ACM Transactions on Data Science, vol.Ā 1, no.Ā 2, pp. 1–33, 2020.
  11. A.Ā Bessa, S.Ā Castelo, R.Ā Rampin, A.Ā S.Ā R. Santos, M.Ā Shoemate, V.Ā D’Orazio, and J.Ā Freire, ā€œAn ecosystem of applications for modeling political violence,ā€ in ACM SIGMOD, 2021, pp. 2384–2388.
  12. N.Ā Chepurko, R.Ā Marcus, E.Ā Zgraggen, R.Ā C. Fernandez, T.Ā Kraska, and D.Ā Karger, ā€œArda: Automatic relational data augmentation for machine learning,ā€ Proceedings of the VLDB Endowment, vol.Ā 13, no.Ā 9, 2020.
  13. S.Ā Castelo, R.Ā Rampin, A.Ā Santos, A.Ā Bessa, F.Ā Chirigati, and J.Ā Freire, ā€œAuctus: A dataset search engine for data discovery and augmentation,ā€ Proceedings of the VLDB Endowment, vol.Ā 14, no.Ā 12, pp. 2791–2794, 2021.
  14. E.Ā Zhu, F.Ā Nargesian, K.Ā Q. Pu, and R.Ā J. Miller, ā€œLsh ensemble: Internet-scale domain search,ā€ Proc. VLDB Endow., vol.Ā 9, no.Ā 12, p. 1185–1196, Aug. 2016. [Online]. Available: https://doi.org/10.14778/2994509.2994534
  15. R.Ā Castro Fernandez, J.Ā Min, D.Ā Nava, and S.Ā Madden, ā€œLazo: A cardinality-based method for coupled estimation of jaccard similarity and containment,ā€ in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 1190–1201.
  16. R.Ā C. Fernandez, Z.Ā Abedjan, F.Ā Koko, G.Ā Yuan, S.Ā Madden, and M.Ā Stonebraker, ā€œAurum: A Data Discovery System,ā€ in ICDE ’18, 2018, pp. 1001–1012.
  17. E.Ā Zhu, D.Ā Deng, F.Ā Nargesian, and R.Ā J. Miller, ā€œJosie: Overlap set similarity search for finding joinable tables in data lakes,ā€ in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19.Ā Ā Ā New York, NY, USA: ACM, 2019, pp. 847–864. [Online]. Available: http://doi.acm.org/10.1145/3299869.3300065
  18. F.Ā Nargesian, E.Ā Zhu, K.Ā Q. Pu, and R.Ā J. Miller, ā€œTable union search on open data,ā€ Proceedings of the VLDB Endowment, vol.Ā 11, no.Ā 7, pp. 813–825, 2018.
  19. Y.Ā Dong and M.Ā Oyamada, ā€œTable enrichment system for machine learning,ā€ in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3267–3271.
  20. A.Ā D. Nobari and D.Ā Rafiei, ā€œEfficiently transforming tables for joinability,ā€ 2022.
  21. Y.Ā Yang, Y.Ā Zhang, W.Ā Zhang, and Z.Ā Huang, ā€œGb-kmv: An augmented kmv sketch for approximate containment similarity search,ā€ in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 458–469.
  22. M.Ā Esmailoghli, J.-A. QuianĆ©-Ruiz, and Z.Ā Abedjan, ā€œMate: multi-attribute table extraction,ā€ Proceedings of the VLDB Endowment, vol.Ā 15, no.Ā 8, pp. 1684–1696, 2022.
  23. A.Ā Ionescu, R.Ā Hai, M.Ā Fragkoulis, and A.Ā Katsifodimos, ā€œJoin path-based data augmentation for decision trees,ā€ in 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), 2022, pp. 84–88.
  24. J.Ā Liu, C.Ā Chai, Y.Ā Luo, Y.Ā Lou, J.Ā Feng, and N.Ā Tang, ā€œFeature augmentation with reinforcement learning,ā€ in 2022 IEEE 38th International Conference on Data Engineering (ICDE).Ā Ā Ā IEEE, 2022, pp. 3360–3372.
  25. S.Ā Galhotra, Y.Ā Gong, and R.Ā C. Fernandez, ā€œMetam: Goal-oriented data discovery,ā€ in ICDE.Ā Ā Ā IEEE, 2023.
  26. J.Ā R. Vergara and P.Ā A. EstĆ©vez, ā€œA review of feature selection methods based on mutual information,ā€ Neural computing and applications, vol.Ā 24, no.Ā 1, pp. 175–186, 2014.
  27. A.Ā Santos, A.Ā Bessa, F.Ā Chirigati, C.Ā Musco, and J.Ā Freire, ā€œCorrelation sketches for approximate join-correlation queries,ā€ in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1531–1544.
  28. M.Ā Esmailoghli, J.-A. QuianĆ©-Ruiz, and Z.Ā Abedjan, ā€œCocoa: Correlation coefficient-aware data augmentation.ā€ in EDBT, 2021, pp. 331–336.
  29. A.Ā Santos, A.Ā Bessa, C.Ā Musco, and J.Ā Freire, ā€œA sketch-based index for correlated dataset search,ā€ in 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2928–2941.
  30. C.Ā O. Daub, R.Ā Steuer, J.Ā Selbig, and S.Ā Kloska, ā€œEstimating mutual information using b-spline functions–an improved similarity measure for analysing gene expression data,ā€ BMC bioinformatics, vol.Ā 5, no.Ā 1, pp. 1–12, 2004.
  31. P.Ā Mandros, M.Ā Boley, and J.Ā Vreeken, ā€œDiscovering reliable approximate functional dependencies,ā€ in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 355–363.
  32. ——, ā€œDiscovering reliable correlations in categorical data,ā€ in 2019 IEEE International Conference on Data Mining (ICDM).Ā Ā Ā IEEE, 2019, pp. 1252–1257.
  33. P.Ā Mandros, D.Ā Kaltenpoth, M.Ā Boley, and J.Ā Vreeken, ā€œDiscovering functional dependencies from mixed-type data,ā€ in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1404–1414.
  34. F.Ā Pennerath, P.Ā Mandros, and J.Ā Vreeken, ā€œDiscovering approximate functional dependencies using smoothed mutual information,ā€ in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1254–1264.
  35. B.Ā Youngmann, M.Ā Cafarella, Y.Ā Moskovitch, and B.Ā Salimi, ā€œOn explaining confounding bias,ā€ in 2023 IEEE 39th International Conference on Data Engineering (ICDE).Ā Ā Ā IEEE, 2023.
  36. K.Ā Hlavackova-Schindler, M.Ā Palus, M.Ā Vejmelka, and J.Ā Bhattacharya, ā€œCausality detection based on information-theoretic approaches in time series analysis,ā€ Physics Reports, vol. 441 (2007) 1 – 46, 02 2007.
  37. G.Ā Chandrashekar and F.Ā Sahin, ā€œA survey on feature selection methods,ā€ Computers & Electrical Engineering, vol.Ā 40, no.Ā 1, pp. 16–28, 2014, 40th-year commemorative issue. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0045790613003066
  38. J.Ā Li, K.Ā Cheng, S.Ā Wang, F.Ā Morstatter, R.Ā P. Trevino, J.Ā Tang, and H.Ā Liu, ā€œFeature selection: A data perspective,ā€ ACM Comput. Surv., vol.Ā 50, no.Ā 6, dec 2017. [Online]. Available: https://doi.org/10.1145/3136625
  39. M.Ā Beraha, A.Ā M. Metelli, M.Ā Papini, A.Ā Tirinzoni, and M.Ā Restelli, ā€œFeature selection via mutual information: New theoretical insights,ā€ in 2019 International Joint Conference on Neural Networks (IJCNN).Ā Ā Ā IEEE, 2019, pp. 1–9.
  40. H.Ā Peng and Y.Ā Fan, ā€œFeature selection by optimizing a lower bound of conditional mutual information,ā€ Information Sciences, vol. 418, pp. 652–667, 2017.
  41. G.Ā Brown, A.Ā Pocock, M.-J. Zhao, and M.Ā LujĆ”n, ā€œConditional likelihood maximisation: a unifying framework for information theoretic feature selection,ā€ The journal of machine learning research, vol.Ā 13, pp. 27–66, 2012.
  42. M.Ā S. Roulston, ā€œEstimating the errors on measured entropy and mutual information,ā€ Physica D: Nonlinear Phenomena, vol. 125, no. 3-4, pp. 285–294, 1999.
  43. A.Ā Hacine-Gharbi and P.Ā Ravier, ā€œA binning formula of bi-histogram for joint entropy estimation using mean square error minimization,ā€ Pattern Recognition Letters, vol. 101, pp. 21–28, 2018.
  44. L.Ā Paninski, ā€œEstimation of entropy and mutual information,ā€ Neural computation, vol.Ā 15, no.Ā 6, pp. 1191–1253, 2003.
  45. ā€œscikit-learn: machine learning in python — scikit-learn 1.2.1 documentation,ā€ https://scikit-learn.org/.
  46. J.Ā Jiao, K.Ā Venkat, Y.Ā Han, and T.Ā Weissman, ā€œMinimax estimation of functionals of discrete distributions,ā€ IEEE Transactions on Information Theory, vol.Ā 61, no.Ā 5, pp. 2835–2885, 2015.
  47. A.Ā Kraskov, H.Ā Stƶgbauer, and P.Ā Grassberger, ā€œEstimating mutual information,ā€ Physical review E, vol.Ā 69, no.Ā 6, p. 066138, 2004.
  48. B.Ā C. Ross, ā€œMutual information between discrete and continuous data sets,ā€ PloS one, vol.Ā 9, no.Ā 2, p. e87357, 2014.
  49. W.Ā Gao, S.Ā Kannan, S.Ā Oh, and P.Ā Viswanath, ā€œEstimating mutual information for discrete-continuous mixtures,ā€ Advances in neural information processing systems, vol.Ā 30, 2017.
  50. D.Ā Huang, D.Ā Y. Yoon, S.Ā Pettie, and B.Ā Mozafari, ā€œJoins on samples: a theoretical guide for practitioners,ā€ Proceedings of the VLDB Endowment, vol.Ā 13, no.Ā 4, pp. 547–560, 2019.
  51. E.Ā Cohen, ā€œSampling big ideas in query optimization,ā€ in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023, pp. 361–371.
  52. D.Ā Vengerov, A.Ā C. Menck, M.Ā Zait, and S.Ā P. Chakkappen, ā€œJoin size estimation subject to filter conditions,ā€ Proc. VLDB Endow., vol.Ā 8, no.Ā 12, p. 1530–1541, Aug. 2015. [Online]. Available: https://doi.org/10.14778/2824032.2824051
  53. Y.Ā Chen and K.Ā Yi, ā€œTwo-level sampling for join size estimation,ā€ in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD ’17.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2017, p. 759–774. [Online]. Available: https://doi.org/10.1145/3035918.3035921
  54. V.Ā Shah, J.Ā Lacanlale, P.Ā Kumar, K.Ā Yang, and A.Ā Kumar, ā€œTowards benchmarking feature type inference for automl platforms,ā€ in Proceedings of the 2021 International Conference on Management of Data, ser. SIGMOD ’21.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2021, p. 1584–1596. [Online]. Available: https://doi.org/10.1145/3448016.3457274
  55. V.Ā Solo, ā€œOn causality and mutual information,ā€ in 2008 47th IEEE Conference on Decision and Control, 2008, pp. 4939–4944.
  56. G.Ā Doquire and M.Ā Verleysen, ā€œFeature selection with missing data using mutual information estimators,ā€ Neurocomputing, vol.Ā 90, pp. 3–11, 2012, advances in artificial neural networks, machine learning, and computational intelligence (ESANN 2011). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231212001841
  57. M.Ā Hutter and M.Ā Zaffalon, ā€œDistribution of mutual information from complete and incomplete data,ā€ Computational Statistics & Data Analysis, vol.Ā 48, no.Ā 3, pp. 633–657, 2005.
  58. S.Ā Acharya, P.Ā B. Gibbons, V.Ā Poosala, and S.Ā Ramaswamy, ā€œJoin synopses for approximate query answering,ā€ SIGMOD Rec., vol.Ā 28, no.Ā 2, p. 275–286, Jun. 1999. [Online]. Available: https://doi.org/10.1145/304181.304207
  59. A.Ā Bessa, M.Ā Daliri, J.Ā Freire, C.Ā Musco, C.Ā Musco, A.Ā Santos, and H.Ā Zhang, ā€œWeighted minwise hashing beats linear sketching for inner product estimation,ā€ in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023.
  60. K.Ā Beyer, P.Ā J. Haas, B.Ā Reinwald, Y.Ā Sismanis, and R.Ā Gemulla, ā€œOn synopses for distinct-value estimation under multiset operations,ā€ in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’07.Ā Ā Ā New York, NY, USA: ACM, 2007, pp. 199–210. [Online]. Available: http://doi.acm.org/10.1145/1247480.1247504
  61. E.Ā Cohen, ā€œCoordinated sampling,ā€ in Encyclopedia of Algorithms, 2016, pp. 449–454. [Online]. Available: https://doi.org/10.1007/978-1-4939-2864-4_576
  62. M.Ā Daliri, J.Ā Freire, C.Ā Musco, A.Ā Santos, and H.Ā Zhang, ā€œSampling methods for inner product sketching,ā€ arXiv preprint arXiv:2309.16157, 2023.
  63. C.Ā Estan and J.Ā F. Naughton, ā€œEnd-biased samples for join cardinality estimation,ā€ in 22nd International Conference on Data Engineering (ICDE’06), 2006, pp. 20–20.
  64. J.Ā S. Vitter, ā€œRandom sampling with a reservoir,ā€ ACM Transactions on Mathematical Software (TOMS), vol.Ā 11, no.Ā 1, pp. 37–57, 1985.
  65. C.Ā Wang and B.Ā Ding, ā€œFast approximation of empirical entropy via subsampling,ā€ in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 658–667.
  66. X.Ā Chen and S.Ā Wang, ā€œEfficient approximate algorithms for empirical entropy and mutual information,ā€ in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 274–286.
  67. N.Ā Duffield, C.Ā Lund, and M.Ā Thorup, ā€œPriority sampling for estimation of arbitrary subset sums,ā€ J. ACM, vol.Ā 54, no.Ā 6, p. 32–es, Dec. 2007. [Online]. Available: https://doi.org/10.1145/1314690.1314696
  68. Wikipedia contributors, ā€œMultinomial distribution — Wikipedia, the free encyclopedia,ā€ 2023, [Online; accessed 01-August-2023]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Multinomial_distribution&oldid=1167221208
  69. ā€œWorld Bank Group Finances,ā€ https://finances.worldbank.org.
  70. ā€œThe Socrata Open Data API,ā€ https://dev.socrata.com.
  71. ā€œThe Tablesaw Library,ā€ https://github.com/jtablesaw/tablesaw.
  72. A.Ā D. Sarma, L.Ā Fang, N.Ā Gupta, A.Ā Y. Halevy, H.Ā Lee, F.Ā Wu, R.Ā Xin, and C.Ā Yu, ā€œFinding related tables,ā€ Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012.
  73. A.Ā Kumar, J.Ā Naughton, J.Ā M. Patel, and X.Ā Zhu, ā€œTo join or not to join? thinking twice about joins before feature selection,ā€ in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 19–34.
  74. J.Ā Becktepe, M.Ā Esmailoghli, M.Ā Koch, and Z.Ā Abedjan, ā€œDemonstrating mate and cocoa for data discovery,ā€ in Companion of the 2023 International Conference on Management of Data, 2023, pp. 119–122.
  75. P.Ā Indyk and A.Ā McGregor, ā€œDeclaring independence via the sketching of sketches,ā€ in Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’08.Ā Ā Ā USA: Society for Industrial and Applied Mathematics, 2008, p. 737–745.
  76. F.Ā Keller, E.Ā Müller, and K.Ā Bƶhm, ā€œEstimating mutual information on data streams,ā€ in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM ’15.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2791347.2791348
  77. J.Ā Boidol and A.Ā Hapfelmeier, ā€œFast mutual information computation for dependency-monitoring on data streams,ā€ in Proceedings of the Symposium on Applied Computing, ser. SAC ’17.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2017, p. 830–835. [Online]. Available: https://doi.org/10.1145/3019612.3019669
  78. M.Ā Ferdosi, A.Ā Gholamidavoodi, and H.Ā Mohimani, ā€œMeasuring mutual information between all pairs of variables in subquadratic complexity,ā€ in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S.Ā Chiappa and R.Ā Calandra, Eds., vol. 108.Ā Ā Ā PMLR, 26–28 Aug 2020, pp. 4399–4409. [Online]. Available: https://proceedings.mlr.press/v108/ferdosi20a.html
  79. D.Ā McAllester and K.Ā Stratos, ā€œFormal limitations on the measurement of mutual information,ā€ in International Conference on Artificial Intelligence and Statistics.Ā Ā Ā PMLR, 2020, pp. 875–884.
  80. S.Ā Gao, G.Ā VerĀ Steeg, and A.Ā Galstyan, ā€œEfficient estimation of mutual information for strongly dependent variables,ā€ in Artificial intelligence and statistics.Ā Ā Ā PMLR, 2015, pp. 277–286.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.