Soft-fault recovery in MPI applications

Fernández Rey, David

dc.contributor.advisor	Martín Santamaría, María José
dc.contributor.advisor	González Gómez, Patricia
dc.contributor.author	Fernández Rey, David
dc.contributor.other	Enxeñaría informática, Grao en	es_ES
dc.date.accessioned	2020-11-12T17:00:50Z
dc.date.available	2020-11-12T17:00:50Z
dc.date.issued	2020-09
dc.identifier.uri	http://hdl.handle.net/2183/26687
dc.description.abstract	[Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU cores, and this number is expected to grow into the millions in the near future. With such an elevated number of processors, the mean time between failures (MTBF) can become so small that most scientific applications will not have time to complete their execution before a failure occurs. It is therefore critical to develop fault tolerance and resilience mechanisms in order to guarantee the completion and integrity of massively parallel applications. University of A Coruña’s Computer Architecture Group (GAC) proposed a solution (Controller/comPiler for Portable Checkpointing - CPPC) in order to transparently convert generic MPI applications into fault tolerant applications, based on a checkpoint-restart scheme. CPPC was extended by Nuria Losada into CPPC-resilience in order to make resilient MPI applications, that is, those that are capable of detecting and reacting to failures without aborting the application, such that survivor processes don’t have to be restarted. This was accomplished by means of a logging protocol and the usage of a proposed fault tolerance interface addition to the MPI standard (User Level Failure Mitigation). However, this system cannot handle soft errors efficiently, since it kills and respawns the failed processes entirely when it is not necessary, as these errors are transient in nature. The object of this project is to extend and adapt CPPCresilience in order to handle soft errors in a more efficient manner, without having to respawn the failed processes. This proposal has been evaluated using 3 MPI applications with different characteristics, achieving a decrease in recovery times after a soft error ranging from 2 to 44 percent, depending on the total number of processes involved.	es_ES
dc.description.abstract	[Resumo] Os sistemas actuais de computación de altas prestacións (HPC) están formados por miles de núcleos de procesadores, e espérase que este número aumente ata os millóns nun futuro cercano. Cun número tan elevado de procesadores, o tempo medio entre fallos (MTBF) pode chegar a reducirse tanto que a maioría de computacións científicas non terían tempo de completar a súa execución antes de que ocorrese un fallo. Polo tanto, é crítico o desenvolvemento de sistemas tolerantes e resilientes a fallos, para garantizar a finalización e integridade das aplicacións masivamente paralelas. O Grupo de Arquitectura de Computadores (GAC) da UDC propuxo unha solución (Controller/comPiler for Portable Checkpointing - CPPC) para convertir de xeito transparente aplicatcións MPI xenéricas en aplicacións tolerantes a fallos, basándose nun esquema de checkpointing e reinicio. CPPC foi posteriormente extendido por Nuria Losada en CPPC-resilience coa finalidade de crear aplicacións MPI resilientes, é dicir, aquelas que son capaces de detectar e reaccionar a fallos sen abortar a aplicación, de xeito que os procesos superviventes non necesitan ser reiniciados. Isto logrouse mediante un protocolo de logging de mensaxes e o uso dunha interfaz de tolerancia a fallos, ULFM (User Level Failure Mitigation), proposta para adición ao estándar MPI. Sen embargo, este sistema non xestiona os errores soft de maneira eficiente, xa que mata e reinicia os procesos fallados por completo cando non é necesario, xa que este tipo de errores teñen natureza transitoria. A meta deste TFG é extender e adaptar CPPC-resilience para poder manexar os errores soft eficientemente, sen ter que reiniciar os procesos fallados. Esta proposta foi evaluada utilizando 3 aplicacións MPI con diferentes características, conseguindo unha redución nos tempos de recuperación tras un erro soft de entre un 2 e un 44 por cento, dependendo do número total de procesos involucrados.	es_ES
dc.language.iso	eng	es_ES
dc.rights	Atribución-NoComercial 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/es/	*
dc.subject	High-performance computing	es_ES
dc.subject	MPI	es_ES
dc.subject	ULFM	es_ES
dc.subject	Fault tolerance	es_ES
dc.subject	Parallelism	es_ES
dc.subject	CPPC	es_ES
dc.subject	Resilience	es_ES
dc.subject	Soft errors	es_ES
dc.subject	Computación de altas prestacións	es_ES
dc.subject		es_ES
dc.subject	Paralelismo
dc.subject	Resiliencia
dc.subject	Erros soft
dc.subject	Tolerancia a los fallos
dc.title	Soft-fault recovery in MPI applications	es_ES
dc.type	info:eu-repo/semantics/bachelorThesis	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.346Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: D.Fernández_Rey_2020_Soft-faul ...
Tamaño:: 829.7Kb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

Enxeñaría informática, Grao en [447]

Mostrar o rexistro simple do ítem