Efficiently Estimating Mutual Information Between Attributes Across Tables (2403.15553v1)
Abstract: Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.
- āNYC OpenData,ā https://opendata.cityofnewyork.us.
- āCity of Chicago Data Portal,ā https://data.cityofchicago.org.
- āUnited States Government Open Data,ā https://www.data.gov.
- D.Ā Brickley, M.Ā Burgess, and N.Ā Noy, āGoogle dataset search: Building a search engine for datasets in an open web ecosystem,ā in The World Wide Web Conference, ser. WWW ā19.Ā Ā Ā New York, NY, USA: ACM, 2019, pp. 1365ā1375. [Online]. Available: http://doi.acm.org/10.1145/3308558.3313685
- S.Ā Bapat, āDiscover, understand and manage your data with Data Catalog, now GA,ā https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available, 2020, [Online; accessed 22-June-2020].
- M.Ā Grover, āAmundsen ā Lyftās data discovery & metadata engine,ā https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9, 2019, [Online; accessed 20-October-2019].
- M.Ā Lan, āDataHub: A generalized metadata search & discovery tool,ā https://engineering.linkedin.com/blog/2019/data-hub, 2019, [Online; accessed 22-June-2020].
- B.Ā Youngmann, M.Ā Cafarella, Y.Ā Moskovitch, and B.Ā Salimi, āNexus: On explaining confounding bias,ā in Companion of the 2023 International Conference on Management of Data, 2023, pp. 171ā174.
- F.Ā Chirigati, H.Ā Doraiswamy, T.Ā Damoulas, and J.Ā Freire, āData polygamy: the many-many relationships among urban spatio-temporal data sets,ā in ACM SIGMOD, 2016, pp. 1011ā1025.
- A.Ā Bessa, J.Ā Freire, T.Ā Dasu, and D.Ā Srivastava, āEffective discovery of meaningful outlier relationships,ā ACM Transactions on Data Science, vol.Ā 1, no.Ā 2, pp. 1ā33, 2020.
- A.Ā Bessa, S.Ā Castelo, R.Ā Rampin, A.Ā S.Ā R. Santos, M.Ā Shoemate, V.Ā DāOrazio, and J.Ā Freire, āAn ecosystem of applications for modeling political violence,ā in ACM SIGMOD, 2021, pp. 2384ā2388.
- N.Ā Chepurko, R.Ā Marcus, E.Ā Zgraggen, R.Ā C. Fernandez, T.Ā Kraska, and D.Ā Karger, āArda: Automatic relational data augmentation for machine learning,ā Proceedings of the VLDB Endowment, vol.Ā 13, no.Ā 9, 2020.
- S.Ā Castelo, R.Ā Rampin, A.Ā Santos, A.Ā Bessa, F.Ā Chirigati, and J.Ā Freire, āAuctus: A dataset search engine for data discovery and augmentation,ā Proceedings of the VLDB Endowment, vol.Ā 14, no.Ā 12, pp. 2791ā2794, 2021.
- E.Ā Zhu, F.Ā Nargesian, K.Ā Q. Pu, and R.Ā J. Miller, āLsh ensemble: Internet-scale domain search,ā Proc. VLDB Endow., vol.Ā 9, no.Ā 12, p. 1185ā1196, Aug. 2016. [Online]. Available: https://doi.org/10.14778/2994509.2994534
- R.Ā Castro Fernandez, J.Ā Min, D.Ā Nava, and S.Ā Madden, āLazo: A cardinality-based method for coupled estimation of jaccard similarity and containment,ā in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 1190ā1201.
- R.Ā C. Fernandez, Z.Ā Abedjan, F.Ā Koko, G.Ā Yuan, S.Ā Madden, and M.Ā Stonebraker, āAurum: A Data Discovery System,ā in ICDE ā18, 2018, pp. 1001ā1012.
- E.Ā Zhu, D.Ā Deng, F.Ā Nargesian, and R.Ā J. Miller, āJosie: Overlap set similarity search for finding joinable tables in data lakes,ā in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ā19.Ā Ā Ā New York, NY, USA: ACM, 2019, pp. 847ā864. [Online]. Available: http://doi.acm.org/10.1145/3299869.3300065
- F.Ā Nargesian, E.Ā Zhu, K.Ā Q. Pu, and R.Ā J. Miller, āTable union search on open data,ā Proceedings of the VLDB Endowment, vol.Ā 11, no.Ā 7, pp. 813ā825, 2018.
- Y.Ā Dong and M.Ā Oyamada, āTable enrichment system for machine learning,ā in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3267ā3271.
- A.Ā D. Nobari and D.Ā Rafiei, āEfficiently transforming tables for joinability,ā 2022.
- Y.Ā Yang, Y.Ā Zhang, W.Ā Zhang, and Z.Ā Huang, āGb-kmv: An augmented kmv sketch for approximate containment similarity search,ā in 2019 IEEE 35th International Conference on Data Engineering (ICDE), April 2019, pp. 458ā469.
- M.Ā Esmailoghli, J.-A. QuianĆ©-Ruiz, and Z.Ā Abedjan, āMate: multi-attribute table extraction,ā Proceedings of the VLDB Endowment, vol.Ā 15, no.Ā 8, pp. 1684ā1696, 2022.
- A.Ā Ionescu, R.Ā Hai, M.Ā Fragkoulis, and A.Ā Katsifodimos, āJoin path-based data augmentation for decision trees,ā in 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), 2022, pp. 84ā88.
- J.Ā Liu, C.Ā Chai, Y.Ā Luo, Y.Ā Lou, J.Ā Feng, and N.Ā Tang, āFeature augmentation with reinforcement learning,ā in 2022 IEEE 38th International Conference on Data Engineering (ICDE).Ā Ā Ā IEEE, 2022, pp. 3360ā3372.
- S.Ā Galhotra, Y.Ā Gong, and R.Ā C. Fernandez, āMetam: Goal-oriented data discovery,ā in ICDE.Ā Ā Ā IEEE, 2023.
- J.Ā R. Vergara and P.Ā A. EstĆ©vez, āA review of feature selection methods based on mutual information,ā Neural computing and applications, vol.Ā 24, no.Ā 1, pp. 175ā186, 2014.
- A.Ā Santos, A.Ā Bessa, F.Ā Chirigati, C.Ā Musco, and J.Ā Freire, āCorrelation sketches for approximate join-correlation queries,ā in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1531ā1544.
- M.Ā Esmailoghli, J.-A. QuianĆ©-Ruiz, and Z.Ā Abedjan, āCocoa: Correlation coefficient-aware data augmentation.ā in EDBT, 2021, pp. 331ā336.
- A.Ā Santos, A.Ā Bessa, C.Ā Musco, and J.Ā Freire, āA sketch-based index for correlated dataset search,ā in 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2928ā2941.
- C.Ā O. Daub, R.Ā Steuer, J.Ā Selbig, and S.Ā Kloska, āEstimating mutual information using b-spline functionsāan improved similarity measure for analysing gene expression data,ā BMC bioinformatics, vol.Ā 5, no.Ā 1, pp. 1ā12, 2004.
- P.Ā Mandros, M.Ā Boley, and J.Ā Vreeken, āDiscovering reliable approximate functional dependencies,ā in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 355ā363.
- āā, āDiscovering reliable correlations in categorical data,ā in 2019 IEEE International Conference on Data Mining (ICDM).Ā Ā Ā IEEE, 2019, pp. 1252ā1257.
- P.Ā Mandros, D.Ā Kaltenpoth, M.Ā Boley, and J.Ā Vreeken, āDiscovering functional dependencies from mixed-type data,ā in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1404ā1414.
- F.Ā Pennerath, P.Ā Mandros, and J.Ā Vreeken, āDiscovering approximate functional dependencies using smoothed mutual information,ā in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1254ā1264.
- B.Ā Youngmann, M.Ā Cafarella, Y.Ā Moskovitch, and B.Ā Salimi, āOn explaining confounding bias,ā in 2023 IEEE 39th International Conference on Data Engineering (ICDE).Ā Ā Ā IEEE, 2023.
- K.Ā Hlavackova-Schindler, M.Ā Palus, M.Ā Vejmelka, and J.Ā Bhattacharya, āCausality detection based on information-theoretic approaches in time series analysis,ā Physics Reports, vol. 441 (2007) 1 ā 46, 02 2007.
- G.Ā Chandrashekar and F.Ā Sahin, āA survey on feature selection methods,ā Computers & Electrical Engineering, vol.Ā 40, no.Ā 1, pp. 16ā28, 2014, 40th-year commemorative issue. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0045790613003066
- J.Ā Li, K.Ā Cheng, S.Ā Wang, F.Ā Morstatter, R.Ā P. Trevino, J.Ā Tang, and H.Ā Liu, āFeature selection: A data perspective,ā ACM Comput. Surv., vol.Ā 50, no.Ā 6, dec 2017. [Online]. Available: https://doi.org/10.1145/3136625
- M.Ā Beraha, A.Ā M. Metelli, M.Ā Papini, A.Ā Tirinzoni, and M.Ā Restelli, āFeature selection via mutual information: New theoretical insights,ā in 2019 International Joint Conference on Neural Networks (IJCNN).Ā Ā Ā IEEE, 2019, pp. 1ā9.
- H.Ā Peng and Y.Ā Fan, āFeature selection by optimizing a lower bound of conditional mutual information,ā Information Sciences, vol. 418, pp. 652ā667, 2017.
- G.Ā Brown, A.Ā Pocock, M.-J. Zhao, and M.Ā LujĆ”n, āConditional likelihood maximisation: a unifying framework for information theoretic feature selection,ā The journal of machine learning research, vol.Ā 13, pp. 27ā66, 2012.
- M.Ā S. Roulston, āEstimating the errors on measured entropy and mutual information,ā Physica D: Nonlinear Phenomena, vol. 125, no. 3-4, pp. 285ā294, 1999.
- A.Ā Hacine-Gharbi and P.Ā Ravier, āA binning formula of bi-histogram for joint entropy estimation using mean square error minimization,ā Pattern Recognition Letters, vol. 101, pp. 21ā28, 2018.
- L.Ā Paninski, āEstimation of entropy and mutual information,ā Neural computation, vol.Ā 15, no.Ā 6, pp. 1191ā1253, 2003.
- āscikit-learn: machine learning in python ā scikit-learn 1.2.1 documentation,ā https://scikit-learn.org/.
- J.Ā Jiao, K.Ā Venkat, Y.Ā Han, and T.Ā Weissman, āMinimax estimation of functionals of discrete distributions,ā IEEE Transactions on Information Theory, vol.Ā 61, no.Ā 5, pp. 2835ā2885, 2015.
- A.Ā Kraskov, H.Ā Stƶgbauer, and P.Ā Grassberger, āEstimating mutual information,ā Physical review E, vol.Ā 69, no.Ā 6, p. 066138, 2004.
- B.Ā C. Ross, āMutual information between discrete and continuous data sets,ā PloS one, vol.Ā 9, no.Ā 2, p. e87357, 2014.
- W.Ā Gao, S.Ā Kannan, S.Ā Oh, and P.Ā Viswanath, āEstimating mutual information for discrete-continuous mixtures,ā Advances in neural information processing systems, vol.Ā 30, 2017.
- D.Ā Huang, D.Ā Y. Yoon, S.Ā Pettie, and B.Ā Mozafari, āJoins on samples: a theoretical guide for practitioners,ā Proceedings of the VLDB Endowment, vol.Ā 13, no.Ā 4, pp. 547ā560, 2019.
- E.Ā Cohen, āSampling big ideas in query optimization,ā in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023, pp. 361ā371.
- D.Ā Vengerov, A.Ā C. Menck, M.Ā Zait, and S.Ā P. Chakkappen, āJoin size estimation subject to filter conditions,ā Proc. VLDB Endow., vol.Ā 8, no.Ā 12, p. 1530ā1541, Aug. 2015. [Online]. Available: https://doi.org/10.14778/2824032.2824051
- Y.Ā Chen and K.Ā Yi, āTwo-level sampling for join size estimation,ā in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD ā17.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2017, p. 759ā774. [Online]. Available: https://doi.org/10.1145/3035918.3035921
- V.Ā Shah, J.Ā Lacanlale, P.Ā Kumar, K.Ā Yang, and A.Ā Kumar, āTowards benchmarking feature type inference for automl platforms,ā in Proceedings of the 2021 International Conference on Management of Data, ser. SIGMOD ā21.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2021, p. 1584ā1596. [Online]. Available: https://doi.org/10.1145/3448016.3457274
- V.Ā Solo, āOn causality and mutual information,ā in 2008 47th IEEE Conference on Decision and Control, 2008, pp. 4939ā4944.
- G.Ā Doquire and M.Ā Verleysen, āFeature selection with missing data using mutual information estimators,ā Neurocomputing, vol.Ā 90, pp. 3ā11, 2012, advances in artificial neural networks, machine learning, and computational intelligence (ESANN 2011). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231212001841
- M.Ā Hutter and M.Ā Zaffalon, āDistribution of mutual information from complete and incomplete data,ā Computational Statistics & Data Analysis, vol.Ā 48, no.Ā 3, pp. 633ā657, 2005.
- S.Ā Acharya, P.Ā B. Gibbons, V.Ā Poosala, and S.Ā Ramaswamy, āJoin synopses for approximate query answering,ā SIGMOD Rec., vol.Ā 28, no.Ā 2, p. 275ā286, Jun. 1999. [Online]. Available: https://doi.org/10.1145/304181.304207
- A.Ā Bessa, M.Ā Daliri, J.Ā Freire, C.Ā Musco, C.Ā Musco, A.Ā Santos, and H.Ā Zhang, āWeighted minwise hashing beats linear sketching for inner product estimation,ā in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2023.
- K.Ā Beyer, P.Ā J. Haas, B.Ā Reinwald, Y.Ā Sismanis, and R.Ā Gemulla, āOn synopses for distinct-value estimation under multiset operations,ā in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ā07.Ā Ā Ā New York, NY, USA: ACM, 2007, pp. 199ā210. [Online]. Available: http://doi.acm.org/10.1145/1247480.1247504
- E.Ā Cohen, āCoordinated sampling,ā in Encyclopedia of Algorithms, 2016, pp. 449ā454. [Online]. Available: https://doi.org/10.1007/978-1-4939-2864-4_576
- M.Ā Daliri, J.Ā Freire, C.Ā Musco, A.Ā Santos, and H.Ā Zhang, āSampling methods for inner product sketching,ā arXiv preprint arXiv:2309.16157, 2023.
- C.Ā Estan and J.Ā F. Naughton, āEnd-biased samples for join cardinality estimation,ā in 22nd International Conference on Data Engineering (ICDEā06), 2006, pp. 20ā20.
- J.Ā S. Vitter, āRandom sampling with a reservoir,ā ACM Transactions on Mathematical Software (TOMS), vol.Ā 11, no.Ā 1, pp. 37ā57, 1985.
- C.Ā Wang and B.Ā Ding, āFast approximation of empirical entropy via subsampling,ā in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 658ā667.
- X.Ā Chen and S.Ā Wang, āEfficient approximate algorithms for empirical entropy and mutual information,ā in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 274ā286.
- N.Ā Duffield, C.Ā Lund, and M.Ā Thorup, āPriority sampling for estimation of arbitrary subset sums,ā J. ACM, vol.Ā 54, no.Ā 6, p. 32āes, Dec. 2007. [Online]. Available: https://doi.org/10.1145/1314690.1314696
- Wikipedia contributors, āMultinomial distribution ā Wikipedia, the free encyclopedia,ā 2023, [Online; accessed 01-August-2023]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Multinomial_distribution&oldid=1167221208
- āWorld Bank Group Finances,ā https://finances.worldbank.org.
- āThe Socrata Open Data API,ā https://dev.socrata.com.
- āThe Tablesaw Library,ā https://github.com/jtablesaw/tablesaw.
- A.Ā D. Sarma, L.Ā Fang, N.Ā Gupta, A.Ā Y. Halevy, H.Ā Lee, F.Ā Wu, R.Ā Xin, and C.Ā Yu, āFinding related tables,ā Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012.
- A.Ā Kumar, J.Ā Naughton, J.Ā M. Patel, and X.Ā Zhu, āTo join or not to join? thinking twice about joins before feature selection,ā in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 19ā34.
- J.Ā Becktepe, M.Ā Esmailoghli, M.Ā Koch, and Z.Ā Abedjan, āDemonstrating mate and cocoa for data discovery,ā in Companion of the 2023 International Conference on Management of Data, 2023, pp. 119ā122.
- P.Ā Indyk and A.Ā McGregor, āDeclaring independence via the sketching of sketches,ā in Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ā08.Ā Ā Ā USA: Society for Industrial and Applied Mathematics, 2008, p. 737ā745.
- F.Ā Keller, E.Ā Müller, and K.Ā Bƶhm, āEstimating mutual information on data streams,ā in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, ser. SSDBM ā15.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2791347.2791348
- J.Ā Boidol and A.Ā Hapfelmeier, āFast mutual information computation for dependency-monitoring on data streams,ā in Proceedings of the Symposium on Applied Computing, ser. SAC ā17.Ā Ā Ā New York, NY, USA: Association for Computing Machinery, 2017, p. 830ā835. [Online]. Available: https://doi.org/10.1145/3019612.3019669
- M.Ā Ferdosi, A.Ā Gholamidavoodi, and H.Ā Mohimani, āMeasuring mutual information between all pairs of variables in subquadratic complexity,ā in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S.Ā Chiappa and R.Ā Calandra, Eds., vol. 108.Ā Ā Ā PMLR, 26ā28 Aug 2020, pp. 4399ā4409. [Online]. Available: https://proceedings.mlr.press/v108/ferdosi20a.html
- D.Ā McAllester and K.Ā Stratos, āFormal limitations on the measurement of mutual information,ā in International Conference on Artificial Intelligence and Statistics.Ā Ā Ā PMLR, 2020, pp. 875ā884.
- S.Ā Gao, G.Ā VerĀ Steeg, and A.Ā Galstyan, āEfficient estimation of mutual information for strongly dependent variables,ā in Artificial intelligence and statistics.Ā Ā Ā PMLR, 2015, pp. 277ā286.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.