Want to read any of the papers below, and you can’t access the online publication? Just drop me a line, and I’ll send you a preprint 🙂
2023 |
ConferenceJoan Giner-Miguelez, Abel Gómez, Jordi Cabot DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM '23 Association for Computing Machinery, Birmingham, United Kingdom, 2023, ISBN: 9798400701245. Abstract | Links | BibTeX | Tags: Datasets, explainability, Fairness, large language models, Machine learning, reverse engineering @conference{Giner-Miguelez:CIKM:2023, Recent public regulatory initiatives and relevant voices in the ML community have identified the need to document datasets according to several dimensions to ensure the fairness and trustworthiness of machine learning systems. In this sense, the data-sharing practices in the scientific field have been quickly evolving in the last years, with more and more research works publishing technical documentation together with the data for replicability purposes. However, this documentation is written in natural language, and its structure, content focus, and composition vary, making them challenging to analyze.We present DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets by extracting the details of the main dimensions required to analyze the fairness and potential biases. We believe that our tool could help improve the quality of scientific datasets, aid dataset curators during its documentation process, and be a helpful tool for empirical studies on the overall quality of the datasets used in the ML field. The tool implements an ML pipeline that uses Large Language Models at its core for information retrieval. DataDoc is open-source, and a public demo is published online. Full Text AvailableOpen Access |
Journal ArticleJoan Giner-Miguelez, Abel Gómez, Jordi Cabot A domain-specific language for describing machine learning datasets In: Journal of Computer Languages, vol. 76, pp. 101209, 2023, ISSN: 2590-1184. Abstract | Links | BibTeX | Tags: Datasets, Domain-specific languages, Fairness, Machine learning, MDE @article{Giner-Miguelez:COLA:2023, Datasets are essential for training and evaluating machine learning (ML) models. However, they are also at the root of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing a data-centric cultural shift, where data issues are given the attention they deserve and more standard practices for gathering and describing datasets are discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open-source license. Full Text AvailableOpen Access |