LabChain: Enabling reproducible and modular scientific experiments in Python

Couto Pintos, Manuel; Parapar, Javier; Losada, David E.

LabChain: Enabling reproducible and modular scientific experiments in Python

UDC.coleccion	Investigación
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.endPage	10
UDC.grupoInv	Information Retrieval Lab (IRlab)
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.issue	102543
UDC.journalTitle	SoftwareX
UDC.startPage	1
UDC.volume	33
dc.contributor.author	Couto Pintos, Manuel
dc.contributor.author	Parapar, Javier
dc.contributor.author	Losada, David E.
dc.date.accessioned	2026-02-16T10:55:32Z
dc.date.available	2026-02-16T10:55:32Z
dc.date.issued	2026-02
dc.description	Data availability: The LabChain framework is publicly available at https://github.com/manucouto1/LabChain. The reference implementation for this article is v1.2.1. The mental health detection case study uses publicly available datasets from the eRisk shared tasks: Depression (2017, 2018, 2022), Anorexia (2018, 2019), Self-harm (2020, 2021), and Gambling (2022, 2023). These datasets can be requested from the eRisk organizers at https://erisk.irlab.org/. The complete implementation of the case study, including all pipeline configurations and preprocessing code, is available at https://github.com/manucouto1/Temporal-Word-Embeddings-for-Early-Detection.... No new data were generated or analyzed in support of this research.
dc.description.abstract	[Abstract]: Python’s flexibility accelerates research prototyping but frequently results in unmaintainable code and duplicated computational effort. The absence of software engineering practices in academic development leads to fragile experiments where even minor modifications require rerunning expensive computations from scratch. LabChain addresses this through a pipeline-and-filter architecture with hash-based caching that automatically identifies and reuses intermediate results. When evaluating multiple classifiers on the same embeddings, the framework computes embeddings once—regardless of how many classifiers are tested. This automatic reuse extends across research teams: if another researcher applies different models to the same preprocessed data, LabChain detects existing results and eliminates redundant computation. Beyond efficiency, the framework’s modular structure reduces technical debt that obscures experimental logic. Pipelines serialize to JSON for reproducibility and distributed execution across computational clusters. A mental health detection case study demonstrates dual impact: computational savings exceeding 12 hours per task with reduced CO2 emissions, alongside substantial scientific improvements—performance gains up to 192.3% in some tasks. These improvements emerged from clearer experimental organization that exposed a critical preprocessing bug hidden in the original monolithic implementation. LabChain proves that software engineering discipline amplifies scientific discovery.
dc.description.sponsorship	MC and DEL thank the financial support provided by MICIU/AEI/10.13039/501100011033 (PID2022-137061OB-C22, supported by ERDF) and Xunta de Galicia-Consellería de Cultura, Educación, Formación Profesional e Universidades (ED431G 2023/04, ED431C 2022/19, supported by ERDF). JP has received support from project PID2022-137061OB-C21 (MCIU/AEI/10.13039/5011000 11033, Ministerio de Ciencia e Innovación). He also thanks the financial support provided by the Consellería de Educación, Universidade e Formación Profesional, Spain (grant number ED481A-2024–079 and GRC ED431C 2025/49); and the European Regional Development Fund, which supports the CITIC Research Center.
dc.description.sponsorship	Xunta de Galicia; ED481A-2024–079
dc.description.sponsorship	Xunta de Galicia; ED431C 2025/49
dc.description.sponsorship	Xunta de Galicia; ED431G 2023/04
dc.description.sponsorship	Xunta de Galicia; ED431C 2022/19
dc.description.uri	https://github.com/manucouto1/LabChain
dc.description.uri	https://erisk.irlab.org/
dc.description.uri	https://github.com/manucouto1/Temporal-Word-Embeddings-for-Early-Detection-of-Psychological-Disorders-on-Social-Media
dc.identifier.citation	Couto, M., Parapar, J., & Losada, D. E. (2026). LabChain: Enabling reproducible and modular scientific experiments in Python. SoftwareX, 33(102543). https://doi.org/10.1016/j.softx.2026.102543
dc.identifier.doi	10.1016/j.softx.2026.102543
dc.identifier.issn	2352-7110
dc.identifier.uri	https://hdl.handle.net/2183/47431
dc.language.iso	eng
dc.publisher	Elsevier
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C22/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD: BUSQUEDA Y DETECCION DE DESINFORMACION
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.uri	https://doi.org/10.1016/j.softx.2026.102543
dc.rights	Attribution-NonCommercial 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/
dc.subject	Scientific workflows
dc.subject	Pipeline architecture
dc.subject	Hash-based caching
dc.subject	Reproducible research
dc.subject	Software engineering practices
dc.title	LabChain: Enabling reproducible and modular scientific experiments in Python
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	fef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication.latestForDiscovery	fef1a9cb-e346-4e53-9811-192e144f09d0

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Parapar_Javier_2026_LabChain.pdf
Size:: 2.52 MB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)