2023
|
ConferenceJoan Giner-Miguelez, Abel Gómez, Jordi Cabot DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM '23 Association for Computing Machinery, Birmingham, United Kingdom, 2023, ISBN: 9798400701245. Abstract | Links | BibTeX | Tags: Datasets, explainability, Fairness, large language models, Machine learning, reverse engineering @conference{Giner-Miguelez:CIKM:2023,
title = {DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
doi = {10.1145/3583780.3614737},
isbn = {9798400701245},
year = {2023},
date = {2023-10-01},
booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
pages = {5046\textendash5050},
publisher = {Association for Computing Machinery},
address = {Birmingham, United Kingdom},
series = {CIKM '23},
abstract = {Recent public regulatory initiatives and relevant voices in the ML community have identified the need to document datasets according to several dimensions to ensure the fairness and trustworthiness of machine learning systems. In this sense, the data-sharing practices in the scientific field have been quickly evolving in the last years, with more and more research works publishing technical documentation together with the data for replicability purposes. However, this documentation is written in natural language, and its structure, content focus, and composition vary, making them challenging to analyze.We present DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets by extracting the details of the main dimensions required to analyze the fairness and potential biases. We believe that our tool could help improve the quality of scientific datasets, aid dataset curators during its documentation process, and be a helpful tool for empirical studies on the overall quality of the datasets used in the ML field. The tool implements an ML pipeline that uses Large Language Models at its core for information retrieval. DataDoc is open-source, and a public demo is published online.},
keywords = {Datasets, explainability, Fairness, large language models, Machine learning, reverse engineering},
pubstate = {published},
tppubtype = {conference}
}
Recent public regulatory initiatives and relevant voices in the ML community have identified the need to document datasets according to several dimensions to ensure the fairness and trustworthiness of machine learning systems. In this sense, the data-sharing practices in the scientific field have been quickly evolving in the last years, with more and more research works publishing technical documentation together with the data for replicability purposes. However, this documentation is written in natural language, and its structure, content focus, and composition vary, making them challenging to analyze.We present DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets by extracting the details of the main dimensions required to analyze the fairness and potential biases. We believe that our tool could help improve the quality of scientific datasets, aid dataset curators during its documentation process, and be a helpful tool for empirical studies on the overall quality of the datasets used in the ML field. The tool implements an ML pipeline that uses Large Language Models at its core for information retrieval. DataDoc is open-source, and a public demo is published online. Full Text AvailableOpen Access |
Journal ArticleJoan Giner-Miguelez, Abel Gómez, Jordi Cabot DescribeML: A dataset description tool for machine learning In: Science of Computer Programming, vol. 231, pp. 103030, 2023, ISSN: 0167-6423. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Fairness, Machine Learning (ML), Model-Driven Engineering (MDE), Software @article{Giner-Miguelez:SCICO:2024,
title = {DescribeML: A dataset description tool for machine learning},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
doi = {10.1016/j.scico.2023.103030},
issn = {0167-6423},
year = {2023},
date = {2023-09-12},
urldate = {2024-01-01},
journal = {Science of Computer Programming},
volume = {231},
pages = {103030},
publisher = {Elsevier BV},
abstract = {Datasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing as a best practice the adoption of common guidelines for describing datasets. However, these guidelines are based on natural language descriptions of the dataset, hampering the automatic computation and analysis of such descriptions. To overcome this situation, we present DescribeML, a language engineering tool to precisely describe machine learning datasets in terms of their composition, provenance, and social concerns in a structured format. The tool is implemented as a Visual Studio Code extension.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Fairness, Machine Learning (ML), Model-Driven Engineering (MDE), Software},
pubstate = {published},
tppubtype = {article}
}
Datasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing as a best practice the adoption of common guidelines for describing datasets. However, these guidelines are based on natural language descriptions of the dataset, hampering the automatic computation and analysis of such descriptions. To overcome this situation, we present DescribeML, a language engineering tool to precisely describe machine learning datasets in terms of their composition, provenance, and social concerns in a structured format. The tool is implemented as a Visual Studio Code extension. Full Text AvailableOpen Access |
Journal ArticleJoan Giner-Miguelez, Abel Gómez, Jordi Cabot A domain-specific language for describing machine learning datasets In: Journal of Computer Languages, vol. 76, pp. 101209, 2023, ISSN: 2590-1184. Abstract | Links | BibTeX | Tags: Datasets, Domain-specific languages, Fairness, Machine learning, MDE @article{Giner-Miguelez:COLA:2023,
title = {A domain-specific language for describing machine learning datasets},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
doi = {https://doi.org/10.1016/j.cola.2023.101209},
issn = {2590-1184},
year = {2023},
date = {2023-08-01},
urldate = {2023-01-01},
journal = {Journal of Computer Languages},
volume = {76},
pages = {101209},
abstract = {Datasets are essential for training and evaluating machine learning (ML) models. However, they are also at the root of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing a data-centric cultural shift, where data issues are given the attention they deserve and more standard practices for gathering and describing datasets are discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open-source license.},
keywords = {Datasets, Domain-specific languages, Fairness, Machine learning, MDE},
pubstate = {published},
tppubtype = {article}
}
Datasets are essential for training and evaluating machine learning (ML) models. However, they are also at the root of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing a data-centric cultural shift, where data issues are given the attention they deserve and more standard practices for gathering and describing datasets are discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open-source license. Full Text AvailableOpen Access |
2022
|
ConferenceJoan Giner-Miguelez, Abel Gómez, Jordi Cabot DescribeML: A Tool for Describing Machine Learning Datasets Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, MODELS '22 Association for Computing Machinery, Montreal, Quebec, Canada, 2022, ISBN: 9781450394673. Abstract | Links | BibTeX | Tags: Datasets, DescribeML, Domain-Specific Languages (DSLs), Fairness, Model-Driven Engineering (MDE) @conference{Giner-Miguelez:MODELS:2022,
title = {DescribeML: A Tool for Describing Machine Learning Datasets},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
doi = {10.1145/3550356.3559087},
isbn = {9781450394673},
year = {2022},
date = {2022-11-09},
urldate = {2022-01-01},
booktitle = {Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings},
pages = {22\textendash26},
publisher = {Association for Computing Machinery},
address = {Montreal, Quebec, Canada},
series = {MODELS '22},
abstract = {Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets.In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.},
keywords = {Datasets, DescribeML, Domain-Specific Languages (DSLs), Fairness, Model-Driven Engineering (MDE)},
pubstate = {published},
tppubtype = {conference}
}
Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets.In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source. |
Conference Joan Giner-Miguelez, Abel Gómez, Jordi Cabot Enabling Content Management Systems as an Information Source in Model-Driven Projects Research Challenges in Information Science. RCIS 2022., Lecture Notes in Business Information Processing Springer International Publishing, Cham, 2022, ISBN: 978-3-031-05760-1. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs @conference{Giner-Miguelez:RCIS:2022,
title = {Enabling Content Management Systems as an Information Source in Model-Driven Projects},
author = { Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
editor = { Renata Guizzardi and Jolita Ralyt\'{e} and Xavier Franch},
doi = {10.1007/978-3-031-05760-1_30},
isbn = {978-3-031-05760-1},
year = {2022},
date = {2022-05-11},
urldate = {2022-05-11},
booktitle = {Research Challenges in Information Science. RCIS 2022.},
pages = {513--528},
publisher = {Springer International Publishing},
address = {Cham},
series = {Lecture Notes in Business Information Processing},
abstract = {Content Management Systems (CMSs) are the most popular tool when it comes to create and publish content across the web. Recently, CMSs have evolved, becoming headless. Content served by a headless CMS aims to be consumed by other applications and services through REST APIs rather than by human users through a web browser. This evolution has enabled CMSs to become a notorious source of content to be used in a variety of contexts beyond pure web navigation. As such, CMS have become an important component of many information systems. Unfortunately, we still lack the tools to properly discover and manage the information stored in a CMS, often highly customized to the needs of a specific domain. Currently, this is mostly a time-consuming and error-prone manual process.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs},
pubstate = {published},
tppubtype = {conference}
}
Content Management Systems (CMSs) are the most popular tool when it comes to create and publish content across the web. Recently, CMSs have evolved, becoming headless. Content served by a headless CMS aims to be consumed by other applications and services through REST APIs rather than by human users through a web browser. This evolution has enabled CMSs to become a notorious source of content to be used in a variety of contexts beyond pure web navigation. As such, CMS have become an important component of many information systems. Unfortunately, we still lack the tools to properly discover and manage the information stored in a CMS, often highly customized to the needs of a specific domain. Currently, this is mostly a time-consuming and error-prone manual process. |
ConferenceJoan Giner-Miguelez, Abel Gómez, Jordi Cabot Un lenguaje para definir datasets para machine learning Actas de las XXVI Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2022), SISTEDES, 2022. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs @conference{Giner-Miguelez:JISBD:2022,
title = {Un lenguaje para definir datasets para machine learning},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
editor = {A. Go\~{n}i Sarriguren},
url = {http://hdl.handle.net/11705/JISBD/2022/4368},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Actas de las XXVI Jornadas de Ingenier\'{i}a del Software y Bases de Datos (JISBD 2022)},
publisher = {SISTEDES},
abstract = {Recientes estudios han reportado efectos indeseados y nocivos en modelos de machine learning (ML), en gran parte causados por problemas o limitaciones en los datasets usados para entrenarlos. Esta situaci\'{o}n ha despertado el inter\'{e}s dentro de la comunidad de ML para mejorar los procesos de creaci\'{o}n y compartici\'{o}n de datasets. Sin embargo, hasta la fecha, las propuestas para estandarizar la descripci\'{o}n y formalizaci\'{o}n de los mismos se basan en gu\'{i}as generales en texto natural y que, como tales, presentan limitaciones (precisi\'{o}n, ambig+APw-edad, etc.) y son dif\'{i}ciles de aplicar de una forma (semi)automatizada.En este trabajo proponemos un lenguaje espec\'{i}fico de dominio para describir datasets basado en las propuestas mencionadas. Este lenguaje contribuye a estandarizar los procesos de descripci\'{o}n de los datasets, y pretende ser la base para aplicaciones de formalizaci\'{o}n, b\'{u}squeda y comparaci\'{o}n de estos. Finalmente, presentamos la implementaci\'{o}n de este lenguaje en forma de plug-in para Visual Studio Code.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs},
pubstate = {published},
tppubtype = {conference}
}
Recientes estudios han reportado efectos indeseados y nocivos en modelos de machine learning (ML), en gran parte causados por problemas o limitaciones en los datasets usados para entrenarlos. Esta situación ha despertado el interés dentro de la comunidad de ML para mejorar los procesos de creación y compartición de datasets. Sin embargo, hasta la fecha, las propuestas para estandarizar la descripción y formalización de los mismos se basan en guías generales en texto natural y que, como tales, presentan limitaciones (precisión, ambig+APw-edad, etc.) y son difíciles de aplicar de una forma (semi)automatizada.En este trabajo proponemos un lenguaje específico de dominio para describir datasets basado en las propuestas mencionadas. Este lenguaje contribuye a estandarizar los procesos de descripción de los datasets, y pretende ser la base para aplicaciones de formalización, búsqueda y comparación de estos. Finalmente, presentamos la implementación de este lenguaje en forma de plug-in para Visual Studio Code. Full Text AvailableOpen AccessSpanish |