Use this link to cite:
http://hdl.handle.net/2183/20890 Resilient MPI applications using an application-level checkpointing framework and ULFM
Loading...
Identifiers
Publication date
Advisors
Other responsabilities
Journal Title
Bibliographic citation
Losada, N., Cores, I., Martín, M.J. et al. J Supercomput (2017) 73: 100. https://doi.org/10.1007/s11227-016-1629-7
Type of academic work
Academic degree
Abstract
[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.
Description
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7






