Buscar
Mostrando ítems 1-10 de 12
Resilient MPI applications using an application-level checkpointing framework and ULFM
(Springer New York LLC, 2017-01)
[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. ...
Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study
(Oxford University Press, 2011-11-01)
[Abstract] This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a tool for the checkpointing of parallel message-passing applications. Its performance and the factors that impact ...
Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
(Springer Japan KK, 2013)
[Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures ...
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
(Springer New York LLC, 2017-01)
[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with ...
Reducing the overhead of an MPI application-level migration approach
(Elsevier BV * North-Holland, 2016)
[Abstract] Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead ...
Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging
(Elsevier BV * North-Holland, 2019-02)
[Abstract]
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many ...
Fault tolerance of MPI applications in exascale systems: The ULFM solution
(Elsevier BV * North-Holland, 2020-05)
[Abstract]
The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running ...
In-memory application-level checkpoint-based migration for MPI programs
(Springer New York LLC, 2014)
[Abstract] Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based ...
CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications
(John Wiley & Sons Ltd., 2010-11-19)
[Abstract] With the evolution of high‐performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the ...
Fast search of third-order epistatic interactions on CPU and GPU clusters
(Sage Publications Ltd., 2019-05-27)
[Abstract]
Genome-Wide Association Studies (GWASs), analyses that try to find a link between a given phenotype (such as a disease) and genetic markers, have been growing in popularity in the recent years. Relations between ...