Resilient MPI applications using an application-level checkpointing framework and ULFM
Use this link to cite
http://hdl.handle.net/2183/20890Collections
- GI-GAC - Artigos [192]
Metadata
Show full item recordTitle
Resilient MPI applications using an application-level checkpointing framework and ULFMDate
2017-01Citation
Losada, N., Cores, I., Martín, M.J. et al. J Supercomput (2017) 73: 100. https://doi.org/10.1007/s11227-016-1629-7
Abstract
[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.
Keywords
Resilience
Checkpointing
Fault tolerance
MPI
Checkpointing
Fault tolerance
MPI
Description
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7
Editor version
ISSN
0920-8542
1573-0484
1573-0484