Parallelization of a graph-based multi-view machine learning algorithm

Mosteiro García, Jesús

Parallelization of a graph-based multi-view machine learning algorithm

UDC.coleccion	Traballos académicos	es_ES
UDC.tipotrab	TFG	es_ES
UDC.titulacion	Grao en Enxeñaría Informática	es_ES
dc.contributor.advisor	González-Domínguez, Jorge
dc.contributor.advisor	Touriño, Juan
dc.contributor.author	Mosteiro García, Jesús
dc.contributor.other	Universidade da Coruña. Facultade de Informática	es_ES
dc.date.accessioned	2023-11-03T18:29:13Z
dc.date.embargoEndDate	2024-05-03	es_ES
dc.date.embargoLift	2024-05-03
dc.date.issued	2023
dc.description.abstract	[Abstract]: In recent years, a perfect storm has unfolded in the field of machine learning. Increases in computational power, a surge of interest for analyzing big volumes of data, and novel developments in data processing techniques have all contributed to massive leaps both in research and practical applications. Moreover, the ever-increasing diversity in the nature of the data studied has boosted an emerging paradigm: multi-view learning. Multi-view algorithms are characterized by their superior capability to harvest knowledge from data obtained from multiple sources or “views”. This paradigm has shown impressive results in various contexts. This thesis focuses on the efforts to apply multi-view learning to unsupervised clustering problems, which can offer great insight into underlying patterns in data. In 2019, H. Wang, Y. Yang, and B. Liu presented a novel solution to the problem: Graph-Based Multi-View Clustering. It combined methods used in graph-based and spectral clustering, and presented a new data-fusion approach that outperformed state-of-the-art techniques. Despite its impressive results, the computational requirements to execute this algorithm over big data are prohibitive. The provided reference implementation works under the MATLAB interpreter and lacks scalability due to its high memory usage and cubic time complexity. This work aims to address those shortcomings, presenting an implementation built from the ground up in C language, and designed specifically to scale in high performance computing cluster environments. For this purpose, both shared-memory and distributed-memory parallelism techniques will be employed, via OpenMP and MPI, respectively. This highly optimized implementation reduces not only memory usage, but also the algorithm’s complexity to quadratic, which theoretical analysis suggests to be optimal. Standard, highly performant linear algebra libraries were used to accelerate execution. Furthermore, both the lowered memory requirements and the possibility to run the code in distributed-memory systems allow this method to be used for analysis of much bigger datasets, that would have been impossible to process before. Extensive experimental tests were performed over a suitable high performance computing cluster, using a total of 288 CPU cores and 2.25TiB of memory. The results are considered satisfactory. During the jump from MATLAB to C, performance increased by factors of up to 60 when testing with the small datasets that could be processed sequentially. On top of that, parallelization achieved additional speedups. During testing, speeds up to 60 times faster than the sequential C counterpart were observed, with the possibility of further scaling depending on the resources allocated.	es_ES
dc.description.abstract	[Resumo]: Nos últimos anos, unha tormenta perfecta tomou por sorpresa ao campo da aprendizaxe automática. Aumentos da potencia computacional, grande afluencia na interese por analizar grandes conxuntos de datos, e innovacións no procesado de datos contribuíron a grandes saltos tanto na investigación coma no uso práctico. Ademais, o constante incremento na diversidade dos datos estudados levou a un novo paradigma: a aprendizaxe multi-vista. Os algoritmos multi-vista caracterizanse pola súa capacidade de explotar o coñecemento dos datos estruturados dependendo da súa orixe ou “vista”. Este paradigma mostra resultados remarcables en contextos varios. Este traballo de fin de grao se centra nos esforzos de aplicar aprendizaxe multi-vista a problemas de agrupamento non supervisado, que ofrecen información interesante sobre patróns ocultos nos datos. En 2019, H. Wang, Y. Yang, e B. Liu presentaron unha solución innovadora ao problema: agrupamento multi-vista baseado en grafos. Este combina métodos empregados en agrupamento por grafos e espectral, e presenta un novo mecanismo de fusión de datos que supera os resultados obtidos no estado da arte. Pese a estes resultados, os requisitos computacionais para executar este algoritmo sobre grandes conxuntos de datos son prohibitivos. A implementación de referencia proporcionada funciona baixo o intérprete de MATLAB, e carece de escalabilidade debido ao seu elevado uso de memoria e complexidade computacional cúbica. Este traballo busca aliviar estes problemas, presentando unha implementación construída dende cero en linguaxe C, e deseñada especificamente para o seu uso en sistemas de computación de altas prestacións. Para isto, farase uso tanto de técnicas de paralelismo en memoria compartida coma distribuída, mediante OpenMP e MPI, respectivamente. Esta implementación, altamente optimizada, reduce o uso de memoria e a complexidade computacional do algoritmo. Esta será cuadrática, o cal é óptimo segundo a análise teórica. Empregáronse librerías estándar de álxebra lineal de alto rendemento para acelerar a execución. Adicionalmente, tanto os relaxados requisitos de memoria coma a posibilidade de executar o código en sistemas de memoria distribuída permiten que este método analice conxuntos de datos moito maiores, que non eran posibles de tratar con anterioridade. Leváronse a cabo numerosas probas de rendemento sobre un moderno clúster de computación de altas prestacións, usando un total de 288 núcleos de CPU e 2.25TiB de memoria. Os resultados considéranse satisfactorios. O salto dende MATLAB a C é responsable de aceleracións con tempos ata 60 veces menores en probas cos datos que os métodos secuenciais poden procesar. Aínda por riba, o proceso de paralelización obtivo aceleracións adicionais. Durante as probas, observáronse velocidades ata 60 veces máis rápidas cas da implementación en C secuencial, coa posibilidade de escalar máis aló dependendo dos recursos dispoñibles.	es_ES
dc.description.traballos	Traballo fin de grao (UDC.FIC). Enxeñaría Informática. Curso 2022/2023	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/34030
dc.language.iso	eng	es_ES
dc.rights	Atribución 3.0 España	es_ES
dc.rights.accessRights	open access	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/
dc.subject	Multi-view clustering	es_ES
dc.subject	Graph-based clustering	es_ES
dc.subject	Machine learning	es_ES
dc.subject	High performance computing	es_ES
dc.subject	Parallel computing	es_ES
dc.subject	MPI	es_ES
dc.subject	OpenMP	es_ES
dc.subject	Big Data	es_ES
dc.subject	Agrupamento multi-vista	es_ES
dc.subject	Agrupamento baseado en grafos	es_ES
dc.subject	Aprendizaxe máquina	es_ES
dc.subject	Computación de altas prestacións	es_ES
dc.subject	Computación paralela	es_ES
dc.title	Parallelization of a graph-based multi-view machine learning algorithm	es_ES
dc.type	bachelor thesis
dspace.entity.type	Publication
relation.isAdvisorOfPublication	84d13059-7f4b-4cb5-ac65-0e07a77271f0
relation.isAdvisorOfPublication	86e306a5-99a1-4c43-8faa-720f0a9f0a34
relation.isAdvisorOfPublication.latestForDiscovery	84d13059-7f4b-4cb5-ac65-0e07a77271f0

Files

Original bundle

Now showing 1 - 1 of 1

Name:: MosteiroGarcia_Jesus_TFG_2023.pdf
Size:: 3.46 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Traballos académicos (FIC)