Buscar
Mostrando ítems 1-4 de 4
Resilient MPI applications using an application-level checkpointing framework and ULFM
(Springer New York LLC, 2017-01)
[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. ...
Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging
(Elsevier BV * North-Holland, 2019-02)
[Abstract]
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many ...
Fault tolerance of MPI applications in exascale systems: The ULFM solution
(Elsevier BV * North-Holland, 2020-05)
[Abstract]
The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running ...
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
(Springer New York LLC, 2017-01)
[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with ...