LabChain: Enabling reproducible and modular scientific experiments in Python

UDC.coleccionInvestigación
UDC.departamentoCiencias da Computación e Tecnoloxías da Información
UDC.endPage10
UDC.grupoInvInformation Retrieval Lab (IRlab)
UDC.institutoCentroCITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.issue102543
UDC.journalTitleSoftwareX
UDC.startPage1
UDC.volume33
dc.contributor.authorCouto Pintos, Manuel
dc.contributor.authorParapar, Javier
dc.contributor.authorLosada, David E.
dc.date.accessioned2026-02-16T10:55:32Z
dc.date.available2026-02-16T10:55:32Z
dc.date.issued2026-02
dc.descriptionData availability: The LabChain framework is publicly available at https://github.com/manucouto1/LabChain. The reference implementation for this article is v1.2.1. The mental health detection case study uses publicly available datasets from the eRisk shared tasks: Depression (2017, 2018, 2022), Anorexia (2018, 2019), Self-harm (2020, 2021), and Gambling (2022, 2023). These datasets can be requested from the eRisk organizers at https://erisk.irlab.org/. The complete implementation of the case study, including all pipeline configurations and preprocessing code, is available at https://github.com/manucouto1/Temporal-Word-Embeddings-for-Early-Detection.... No new data were generated or analyzed in support of this research.
dc.description.abstract[Abstract]: Python’s flexibility accelerates research prototyping but frequently results in unmaintainable code and duplicated computational effort. The absence of software engineering practices in academic development leads to fragile experiments where even minor modifications require rerunning expensive computations from scratch. LabChain addresses this through a pipeline-and-filter architecture with hash-based caching that automatically identifies and reuses intermediate results. When evaluating multiple classifiers on the same embeddings, the framework computes embeddings once—regardless of how many classifiers are tested. This automatic reuse extends across research teams: if another researcher applies different models to the same preprocessed data, LabChain detects existing results and eliminates redundant computation. Beyond efficiency, the framework’s modular structure reduces technical debt that obscures experimental logic. Pipelines serialize to JSON for reproducibility and distributed execution across computational clusters. A mental health detection case study demonstrates dual impact: computational savings exceeding 12 hours per task with reduced CO2 emissions, alongside substantial scientific improvements—performance gains up to 192.3% in some tasks. These improvements emerged from clearer experimental organization that exposed a critical preprocessing bug hidden in the original monolithic implementation. LabChain proves that software engineering discipline amplifies scientific discovery.
dc.description.sponsorshipMC and DEL thank the financial support provided by MICIU/AEI/10.13039/501100011033 (PID2022-137061OB-C22, supported by ERDF) and Xunta de Galicia-Consellería de Cultura, Educación, Formación Profesional e Universidades (ED431G 2023/04, ED431C 2022/19, supported by ERDF). JP has received support from project PID2022-137061OB-C21 (MCIU/AEI/10.13039/5011000 11033, Ministerio de Ciencia e Innovación). He also thanks the financial support provided by the Consellería de Educación, Universidade e Formación Profesional, Spain (grant number ED481A-2024–079 and GRC ED431C 2025/49); and the European Regional Development Fund, which supports the CITIC Research Center.
dc.description.sponsorshipXunta de Galicia; ED481A-2024–079
dc.description.sponsorshipXunta de Galicia; ED431C 2025/49
dc.description.sponsorshipXunta de Galicia; ED431G 2023/04
dc.description.sponsorshipXunta de Galicia; ED431C 2022/19
dc.description.urihttps://github.com/manucouto1/LabChain
dc.description.urihttps://erisk.irlab.org/
dc.description.urihttps://github.com/manucouto1/Temporal-Word-Embeddings-for-Early-Detection-of-Psychological-Disorders-on-Social-Media
dc.identifier.citationCouto, M., Parapar, J., & Losada, D. E. (2026). LabChain: Enabling reproducible and modular scientific experiments in Python. SoftwareX, 33(102543). https://doi.org/10.1016/j.softx.2026.102543
dc.identifier.doi10.1016/j.softx.2026.102543
dc.identifier.issn2352-7110
dc.identifier.urihttps://hdl.handle.net/2183/47431
dc.language.isoeng
dc.publisherElsevier
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C22/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD: BUSQUEDA Y DETECCION DE DESINFORMACION
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.urihttps://doi.org/10.1016/j.softx.2026.102543
dc.rightsAttribution-NonCommercial 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/
dc.subjectScientific workflows
dc.subjectPipeline architecture
dc.subjectHash-based caching
dc.subjectReproducible research
dc.subjectSoftware engineering practices
dc.titleLabChain: Enabling reproducible and modular scientific experiments in Python
dc.typejournal article
dc.type.hasVersionVoR
dspace.entity.typePublication
relation.isAuthorOfPublicationfef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscoveryfef1a9cb-e346-4e53-9811-192e144f09d0

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Parapar_Javier_2026_LabChain.pdf
Size:
2.52 MB
Format:
Adobe Portable Document Format