Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments (2312.06153v2)

Published 11 Dec 2023 in cs.LG, cs.AI, and cs.HC

Abstract: This paper introduces a no-code, machine-readable documentation framework for open datasets, with a focus on responsible AI (RAI) considerations. The framework aims to improve comprehensibility, and usability of open datasets, facilitating easier discovery and use, better understanding of content and context, and evaluation of dataset quality and accuracy. The proposed framework is designed to streamline the evaluation of datasets, helping researchers, data scientists, and other open data users quickly identify datasets that meet their needs and organizational policies or regulations. The paper also discusses the implementation of the framework and provides recommendations to maximize its potential. The framework is expected to enhance the quality and reliability of data used in research and decision-making, fostering the development of more responsible and trustworthy AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. clandestino. https://github.com/microsoft/Clandestino/tree/main, 2023. Accessed: 2023-12-08.
  2. clinical visit note summarization corpus. https://github.com/microsoft/clinical_visit_note_summarization_corpus, 2023. Accessed: 2023-12-08.
  3. RTP-LX. https://github.com/microsoft/RTP-LX, 2023. Accessed: 2023-12-08.
  4. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018.
  5. Open data and algorithms for open science in ai-driven molecular informatics. Current Opinion in Structural Biology, 79:102542, 2023.
  6. Bill Bruno. The True Cost Of Bad Data And How It Can Hinder The Benefits Of AI. https://www.forbes.com/sites/forbestechcouncil/2023/09/01/the-true-cost-of-bad-data-and-how-it-can-hinder-the-benefits-of-ai, 2023.
  7. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
  8. Human-centered design to address biases in artificial intelligence. Journal of Medical Internet Research, 25:e43251, 2023.
  9. The dataset nutrition label (2nd gen): Leveraging context to mitigate harms in artificial intelligence. arXiv preprint arXiv:2201.03954, 2022.
  10. Ten simple rules for improving research data discovery, 2022.
  11. Catherine Cote. WHAT IS DATA INTEGRITY AND WHY DOES IT MATTER? https://online.hbs.edu/blog/post/what-is-data-integrity, 2021.
  12. Mike Davie. The True Cost Of Bad Data And How It Can Hinder The Benefits Of AI. https://www.entrepreneur.com/en-au/growth-strategies/why-bad-data-could-cost-entrepreneurs-millions/332238, 2019.
  13. Müge Fazlioglu. Training AI on personal data scraped from the web. https://iapp.org/news/a/training-ai-on-personal-data-scraped-from-the-web, 2019.
  14. Frictionless Data Team. Frictionless Data. https://frictionlessdata.io, 2023. Accessed: 2023-12-08.
  15. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  16. Understanding machine learning practitioners’ data documentation perceptions, needs, challenges, and desiderata. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–29, 2022.
  17. The dataset nutrition label. Data Protection and Privacy, 12(12):1, 2020.
  18. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846, 2021.
  19. Open data: Unlocking innovation and performance with Liquid Information. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/open-data-unlocking-innovation-and-performance-with-liquid-information, 2013.
  20. Reusable templates and guides for documenting datasets and models for natural language processing and generation: A case study of the huggingface and gem data and model cards. arXiv preprint arXiv:2108.07374, 2021.
  21. Microsoft. AETHER DATA DOCUMENTATION TEMPLATE. https://www.microsoft.com/en-us/research/uploads/prod/2022/07/aether-datadoc-082522.pdf.
  22. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
  23. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 2021.
  24. Privacy in the age of medical big data. Nature medicine, 25(1):37–43, 2019.
  25. Data cards: Purposeful and transparent dataset documentation for responsible ai. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022.
  26. Open data on github: Unlocking the potential of ai, 2023.
  27. Manasi Sakpal. How to Improve Your Data Quality. https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality, 2021.
  28. Tackling bias in artificial intelligence (and in humans). https://www.mckinsey.com/featured-insights/artificial-intelligence/tackling-bias-in-artificial-intelligence-and-in-humans, 2019.
  29. Frictionless Data Team. Frictionless Data Specs. https://github.com/frictionlessdata/specs, 2023. Accessed: 2023-12-08.
  30. Risks of using non-verified open data: A case study on using machine learning techniques for predicting pregnancy outcomes in india. arXiv preprint arXiv:1910.02136, 2019.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets