Failure Avoidance in MPI Applications Using an Application-Level Approach
Ver/ abrir
Use este enlace para citar
http://hdl.handle.net/2183/20947Coleccións
- GI-GAC - Artigos [193]
Metadatos
Mostrar o rexistro completo do ítemTítulo
Failure Avoidance in MPI Applications Using an Application-Level ApproachData
2014Cita bibliográfica
Iván Cores, Gabriel Rodríguez, Patricia González, María J. Martín; Failure Avoidance in MPI Applications Using an Application-Level Approach, The Computer Journal, Volume 57, Issue 1, 1 January 2014, Pages 100–114, https://doi.org/10.1093/comjnl/bxs158
Resumo
[Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.
Palabras chave
Failure avoidance
Proactive migration
Checkpointing
Message passing
Proactive migration
Checkpointing
Message passing
Versión do editor
ISSN
0010-4620
1460-2067
1460-2067