2023
|
Journal ArticleJoan Giner-Miguelez, Abel Gómez, Jordi Cabot DescribeML: A dataset description tool for machine learning In: Science of Computer Programming, vol. 231, pp. 103030, 2023, ISSN: 0167-6423. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Fairness, Machine Learning (ML), Model-Driven Engineering (MDE), Software @article{Giner-Miguelez:SCICO:2024,
title = {DescribeML: A dataset description tool for machine learning},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
doi = {10.1016/j.scico.2023.103030},
issn = {0167-6423},
year = {2023},
date = {2023-09-12},
urldate = {2024-01-01},
journal = {Science of Computer Programming},
volume = {231},
pages = {103030},
publisher = {Elsevier BV},
abstract = {Datasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing as a best practice the adoption of common guidelines for describing datasets. However, these guidelines are based on natural language descriptions of the dataset, hampering the automatic computation and analysis of such descriptions. To overcome this situation, we present DescribeML, a language engineering tool to precisely describe machine learning datasets in terms of their composition, provenance, and social concerns in a structured format. The tool is implemented as a Visual Studio Code extension.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Fairness, Machine Learning (ML), Model-Driven Engineering (MDE), Software},
pubstate = {published},
tppubtype = {article}
}
Datasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing as a best practice the adoption of common guidelines for describing datasets. However, these guidelines are based on natural language descriptions of the dataset, hampering the automatic computation and analysis of such descriptions. To overcome this situation, we present DescribeML, a language engineering tool to precisely describe machine learning datasets in terms of their composition, provenance, and social concerns in a structured format. The tool is implemented as a Visual Studio Code extension. Full Text AvailableOpen Access |
2022
|
Conference Joan Giner-Miguelez, Abel Gómez, Jordi Cabot Enabling Content Management Systems as an Information Source in Model-Driven Projects Research Challenges in Information Science. RCIS 2022., Lecture Notes in Business Information Processing Springer International Publishing, Cham, 2022, ISBN: 978-3-031-05760-1. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs @conference{Giner-Miguelez:RCIS:2022,
title = {Enabling Content Management Systems as an Information Source in Model-Driven Projects},
author = { Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
editor = { Renata Guizzardi and Jolita Ralyt\'{e} and Xavier Franch},
doi = {10.1007/978-3-031-05760-1_30},
isbn = {978-3-031-05760-1},
year = {2022},
date = {2022-05-11},
urldate = {2022-05-11},
booktitle = {Research Challenges in Information Science. RCIS 2022.},
pages = {513--528},
publisher = {Springer International Publishing},
address = {Cham},
series = {Lecture Notes in Business Information Processing},
abstract = {Content Management Systems (CMSs) are the most popular tool when it comes to create and publish content across the web. Recently, CMSs have evolved, becoming headless. Content served by a headless CMS aims to be consumed by other applications and services through REST APIs rather than by human users through a web browser. This evolution has enabled CMSs to become a notorious source of content to be used in a variety of contexts beyond pure web navigation. As such, CMS have become an important component of many information systems. Unfortunately, we still lack the tools to properly discover and manage the information stored in a CMS, often highly customized to the needs of a specific domain. Currently, this is mostly a time-consuming and error-prone manual process.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs},
pubstate = {published},
tppubtype = {conference}
}
Content Management Systems (CMSs) are the most popular tool when it comes to create and publish content across the web. Recently, CMSs have evolved, becoming headless. Content served by a headless CMS aims to be consumed by other applications and services through REST APIs rather than by human users through a web browser. This evolution has enabled CMSs to become a notorious source of content to be used in a variety of contexts beyond pure web navigation. As such, CMS have become an important component of many information systems. Unfortunately, we still lack the tools to properly discover and manage the information stored in a CMS, often highly customized to the needs of a specific domain. Currently, this is mostly a time-consuming and error-prone manual process. |
ConferenceJoan Giner-Miguelez, Abel Gómez, Jordi Cabot Un lenguaje para definir datasets para machine learning Actas de las XXVI Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2022), SISTEDES, 2022. Abstract | Links | BibTeX | Tags: Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs @conference{Giner-Miguelez:JISBD:2022,
title = {Un lenguaje para definir datasets para machine learning},
author = {Joan Giner-Miguelez and Abel G\'{o}mez and Jordi Cabot},
editor = {A. Go\~{n}i Sarriguren},
url = {http://hdl.handle.net/11705/JISBD/2022/4368},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Actas de las XXVI Jornadas de Ingenier\'{i}a del Software y Bases de Datos (JISBD 2022)},
publisher = {SISTEDES},
abstract = {Recientes estudios han reportado efectos indeseados y nocivos en modelos de machine learning (ML), en gran parte causados por problemas o limitaciones en los datasets usados para entrenarlos. Esta situaci\'{o}n ha despertado el inter\'{e}s dentro de la comunidad de ML para mejorar los procesos de creaci\'{o}n y compartici\'{o}n de datasets. Sin embargo, hasta la fecha, las propuestas para estandarizar la descripci\'{o}n y formalizaci\'{o}n de los mismos se basan en gu\'{i}as generales en texto natural y que, como tales, presentan limitaciones (precisi\'{o}n, ambig+APw-edad, etc.) y son dif\'{i}ciles de aplicar de una forma (semi)automatizada.En este trabajo proponemos un lenguaje espec\'{i}fico de dominio para describir datasets basado en las propuestas mencionadas. Este lenguaje contribuye a estandarizar los procesos de descripci\'{o}n de los datasets, y pretende ser la base para aplicaciones de formalizaci\'{o}n, b\'{u}squeda y comparaci\'{o}n de estos. Finalmente, presentamos la implementaci\'{o}n de este lenguaje en forma de plug-in para Visual Studio Code.},
keywords = {Datasets, Domain-Specific Languages (DSLs), Machine Learning (ML), MLOPs},
pubstate = {published},
tppubtype = {conference}
}
Recientes estudios han reportado efectos indeseados y nocivos en modelos de machine learning (ML), en gran parte causados por problemas o limitaciones en los datasets usados para entrenarlos. Esta situación ha despertado el interés dentro de la comunidad de ML para mejorar los procesos de creación y compartición de datasets. Sin embargo, hasta la fecha, las propuestas para estandarizar la descripción y formalización de los mismos se basan en guías generales en texto natural y que, como tales, presentan limitaciones (precisión, ambig+APw-edad, etc.) y son difíciles de aplicar de una forma (semi)automatizada.En este trabajo proponemos un lenguaje específico de dominio para describir datasets basado en las propuestas mencionadas. Este lenguaje contribuye a estandarizar los procesos de descripción de los datasets, y pretende ser la base para aplicaciones de formalización, búsqueda y comparación de estos. Finalmente, presentamos la implementación de este lenguaje en forma de plug-in para Visual Studio Code. Full Text AvailableOpen AccessSpanish |