Show simple item record

dc.contributor.authorGarralda-Barrio, Mariano
dc.contributor.authorEiras-Franco, Carlos
dc.contributor.authorBolón-Canedo, Verónica
dc.date.accessioned2024-04-22T09:25:00Z
dc.date.available2024-04-22T09:25:00Z
dc.date.issued2024-07
dc.identifier.citationM. Garralda-Barrio, C. Eiras-Franco, and V. Bolón-Canedo, "A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning", Journal of Parallel and Distributed Computing, Vol. 189, 104881, Jul. 2024, doi: 10.1016/j.jpdc.2024.104881es_ES
dc.identifier.urihttp://hdl.handle.net/2183/36284
dc.description.abstract[Abstract]: Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task-level, in a non-intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro-benchmarks, available in HiBench. Our framework achieves a high accuracy F-Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.es_ES
dc.description.sponsorshipThis work was supported by CITIC, as Research Center accredited by Galician University System, which is funded by “Consellería de Cultura, Educación e Universidade from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014-2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). It was also partially funded by Xunta de Galicia/FEDER-UE under Grant ED431C 2022/44; Ministerio de Ciencia e Innovación MCIN/AEI/10.13039/501100011033 under Grant PID2019-109238 GB-C22.es_ES
dc.description.sponsorshipXunta de Galicia; ED431G 2019/01es_ES
dc.description.sponsorshipXunta de Galicia; ED431C 2022/44es_ES
dc.language.isoenges_ES
dc.publisherElsevieres_ES
dc.relationinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-109238GB-C22/ES/APRENDIZAJE AUTOMATICO ESCALABLE Y EXPLICABLEes_ES
dc.relation.urihttps://doi.org/10.1016/j.jpdc.2024.104881es_ES
dc.rightsAtribución-NoComercial 3.0 Españaes_ES
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/es/*
dc.subjectApache sparkes_ES
dc.subjectBig dataes_ES
dc.subjectMachine learninges_ES
dc.subjectPattern recognitiones_ES
dc.subjectWorkload characterizationes_ES
dc.titleA novel framework for generic Spark workload characterization and similar pattern recognition using machine learninges_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.rights.accessinfo:eu-repo/semantics/openAccesses_ES
UDC.journalTitleJournal of Parallel and Distributed Computinges_ES
UDC.volume189es_ES
UDC.issue104881es_ES
dc.identifier.doi10.1016/j.jpdc.2024.104881


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record