Failure Avoidance in MPI Applications Using an Application-Level Approach
Use this link to citehttp://hdl.handle.net/2183/20947
- GI-GAC - Artigos 
MetadataShow full item record
TitleFailure Avoidance in MPI Applications Using an Application-Level Approach
Iván Cores, Gabriel Rodríguez, Patricia González, María J. Martín; Failure Avoidance in MPI Applications Using an Application-Level Approach, The Computer Journal, Volume 57, Issue 1, 1 January 2014, Pages 100–114, https://doi.org/10.1093/comjnl/bxs158
[Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.