A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Enes, J., Expósito, R. R., Fuentes, J., Cacheiro, J. L., & Touriño, J. (2023). A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs. Information Fusion, 93, 1-20. 10.1016/j.inffus.2022.12.017

Abstract

[Abstract]: Time series are key across industrial and research areas for their ability to model behaviour across time, making them ideal for a wide range of use cases such as event monitoring, trend prediction or anomaly detection. This is even more so due to the increasing monitoring capabilities in many areas, with the subsequent massive data generation. But it is also interesting to consider the potential of time series for Machine Learning processing, often fused with Big Data, to search for useful information and solve real-world problems. However, time series can be studied individually, representing a single entity or variable to be analysed, or in a grouped fashion, to study and represent a more complex entity or scenario. In this latter case we are dealing with multivariate time series, which usually imply different approaches when dealt with. In this paper, we present a pipeline architecture to process and cluster multiple groups of multivariate time series. To implement this, we apply a multi-process solution composed by a feature-based extraction stage, followed by a dimension reduction, and finally, several clustering algorithms. The pipeline is also highly configurable in terms of the stage techniques to be used, allowing to perform a search with several combinations for the most promising results. The pipeline has been experimentally applied to batches of HPC jobs from different users of a supercomputer, with the multivariate time series coming from the monitoring of several node resource metrics. The results show how it is possible to apply this multi-process information fusion to create different meaningful clusters from the batches, using only the time series, without any labelling information, thus being an unsupervised scenario. Optionally, the pipeline also supports an outlier detection stage to find and separate jobs that are radically different when compared to others on a dataset. These outliers can be removed for a better clustering, and later reviewed looking for anomalies, or if numerous, fed back to the pipeline to identify possible groupings. The results also include some outliers found in the experiments, as well as scenarios where they are clustered, or ignored and not removed at all. In addition, by leveraging Big Data technologies like Spark, the pipeline is proven to be scalable by working with up to hundreds of jobs and thousands of time series.

Keywords

Unsupervised clustering
Feature extraction
Multivariate time series
Anomaly detection
HPC jobs

Editor version

https://doi.org/10.1016/j.inffus.2022.12.017

Rights

Atribución-NoComercial-SinDerivadas 3.0 España

ISSN

1566-2535