A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Enes, Jonatan; Expósito, Roberto R.; Fuentes Rodríguez, Jose; López Cacheiro, Javier; Touriño, Juan

dc.contributor.author	Enes, Jonatan
dc.contributor.author	Expósito, Roberto R.
dc.contributor.author	Fuentes Rodríguez, Jose
dc.contributor.author	López Cacheiro, Javier
dc.contributor.author	Touriño, Juan
dc.date.accessioned	2023-03-27T15:25:31Z
dc.date.available	2023-03-27T15:25:31Z
dc.date.issued	2023-05
dc.identifier.citation	Enes, J., Expósito, R. R., Fuentes, J., Cacheiro, J. L., & Touriño, J. (2023). A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs. Information Fusion, 93, 1-20. 10.1016/j.inffus.2022.12.017	es_ES
dc.identifier.issn	1566-2535
dc.identifier.uri	http://hdl.handle.net/2183/32787
dc.description.abstract	[Abstract]: Time series are key across industrial and research areas for their ability to model behaviour across time, making them ideal for a wide range of use cases such as event monitoring, trend prediction or anomaly detection. This is even more so due to the increasing monitoring capabilities in many areas, with the subsequent massive data generation. But it is also interesting to consider the potential of time series for Machine Learning processing, often fused with Big Data, to search for useful information and solve real-world problems. However, time series can be studied individually, representing a single entity or variable to be analysed, or in a grouped fashion, to study and represent a more complex entity or scenario. In this latter case we are dealing with multivariate time series, which usually imply different approaches when dealt with. In this paper, we present a pipeline architecture to process and cluster multiple groups of multivariate time series. To implement this, we apply a multi-process solution composed by a feature-based extraction stage, followed by a dimension reduction, and finally, several clustering algorithms. The pipeline is also highly configurable in terms of the stage techniques to be used, allowing to perform a search with several combinations for the most promising results. The pipeline has been experimentally applied to batches of HPC jobs from different users of a supercomputer, with the multivariate time series coming from the monitoring of several node resource metrics. The results show how it is possible to apply this multi-process information fusion to create different meaningful clusters from the batches, using only the time series, without any labelling information, thus being an unsupervised scenario. Optionally, the pipeline also supports an outlier detection stage to find and separate jobs that are radically different when compared to others on a dataset. These outliers can be removed for a better clustering, and later reviewed looking for anomalies, or if numerous, fed back to the pipeline to identify possible groupings. The results also include some outliers found in the experiments, as well as scenarios where they are clustered, or ignored and not removed at all. In addition, by leveraging Big Data technologies like Spark, the pipeline is proven to be scalable by working with up to hundreds of jobs and thousands of time series.	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G 2019/01	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431C 2021/30	es_ES
dc.description.sponsorship	This research was funded by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00/AEI/10.13039/501100011033), and by Xunta de Galicia, Spain and FEDER funds of the European Union (Centro de Investigación de Galicia accreditation 2019–2022, ref. ED431G 2019/01; Consolidation Program of Competitive Reference Groups, ref. ED431C 2021/30). Funding for open access charge: Universidade da Coruña/CISUG.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier B.V.	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104184RB-I00/ES/DESAFIOS ACTUALES EN HPC: ARQUITECTURAS, SOFTWARE Y APLICACIONES	es_ES
dc.relation.uri	https://doi.org/10.1016/j.inffus.2022.12.017	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	Unsupervised clustering	es_ES
dc.subject	Feature extraction	es_ES
dc.subject	Multivariate time series	es_ES
dc.subject	Anomaly detection	es_ES
dc.subject	HPC jobs	es_ES
dc.title	A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Information Fusion	es_ES
UDC.volume	93	es_ES
UDC.issue	May	es_ES
UDC.startPage	1	es_ES
UDC.endPage	20	es_ES

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.203Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: Enes_Exposito_fuentes_Cacheiro ...
Tamaño:: 5.654Mb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-GAC - Artigos [181]

Mostrar o rexistro simple do ítem