Browsing by Author "Losada, Nuria"
Now showing items 1-7 of 7
-
A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications
Losada, Nuria; Fraguela, Basilio B.; González, Patricia; Martín, María J. (Academic Press, 2017-06)[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This ... -
Application-level Fault Tolerance and Resilience in HPC Applications
Losada, Nuria (2018)[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense ... -
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
Losada, Nuria; Martín, María J.; González, Patricia (Springer New York LLC, 2017-01)[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with ... -
Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications
Losada, Nuria; Martín, María J.; Rodríguez, Gabriel; González, Patricia (Technische Universitaet Graz * Institut fuer Informationssysteme und Computer Medien,Graz University of Technology, Institute for Information Systems and Computer Media, 2014-09)[Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an ... -
Fault tolerance of MPI applications in exascale systems: The ULFM solution
Losada, Nuria; González, Patricia; Martín, María J.; Bosilca, George; Bouteiller, Aurelien; Teranishi, Keita (Elsevier BV * North-Holland, 2020-05)[Abstract] The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running ... -
Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging
Losada, Nuria; Bosilca, George; Bouteiller, Aurelien; González, Patricia; Martín, María J. (Elsevier BV * North-Holland, 2019-02)[Abstract] The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many ... -
Resilient MPI applications using an application-level checkpointing framework and ULFM
Losada, Nuria; Cores González, Iván; Martín, María J.; González, Patricia (Springer New York LLC, 2017-01)[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. ...