• A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes 

      Rodríguez, Gabriel; Martín, María J.; González, Patricia; Touriño, Juan (Technische Universitaet Graz * Institut fuer Informationssysteme und Computer Medien,Graz University of Technology, Institute for Information Systems and Computer Media, 2009-08)
      [Abstract] Checkpointing tools may be typically implemented at two different abstraction levels: at the system level or at the application level. The latter has become a more popular alternative due to its flexibility and ...
    • A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications 

      Losada, Nuria; Fraguela, Basilio B.; González, Patricia; Martín, María J. (Academic Press, 2017-06)
      [Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This ...
    • Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study 

      Rodríguez, Gabriel; Martín, María J.; Touriño, Juan; González, Patricia (Oxford University Press, 2011-11-01)
      [Abstract] This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a tool for the checkpointing of parallel message-passing applications. Its performance and the factors that impact ...
    • Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications 

      Losada, Nuria; Martín, María J.; González, Patricia (Springer New York LLC, 2017-01)
      [Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with ...
    • Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience 

      Rodríguez, Gabriel; Martín, María J.; González, Patricia; Touriño, Juan; Doallo, Ramón (Springer New York LLC, 2013)
      [Abstract] With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools ...
    • Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications 

      Losada, Nuria; Martín, María J.; Rodríguez, Gabriel; González, Patricia (Technische Universitaet Graz * Institut fuer Informationssysteme und Computer Medien,Graz University of Technology, Institute for Information Systems and Computer Media, 2014-09)
      [Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an ...
    • Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes 

      Cores González, Iván; Rodríguez, Gabriel; Martín, María J.; González, Patricia; Osorio, Roberto (Springer Japan KK, 2013)
      [Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures ...
    • Resilient MPI applications using an application-level checkpointing framework and ULFM 

      Losada, Nuria; Cores González, Iván; Martín, María J.; González, Patricia (Springer New York LLC, 2017-01)
      [Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. ...