• Failure Avoidance in MPI Applications Using an Application-Level Approach 

      Cores González, Iván; Rodríguez, Gabriel; González, Patricia; Martín, María J. (Oxford University Press, 2014)
      [Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the ...
    • Fault-tolerance and malleability in parallel message-passing applications 

      Cores González, Iván (2015)
      [Resumo] Esta tese explora solucións para tolerancia a fallos e maleabilidade baseadas en técnicas de checkpoint e reinicio para aplicacións de pase de mensaxes. No campo da tolerancia a fallos, esta tese contribúe ...
    • Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes 

      Cores González, Iván; Rodríguez, Gabriel; Martín, María J.; González, Patricia; Osorio, Roberto (Springer Japan KK, 2013)
      [Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures ...
    • In-memory application-level checkpoint-based migration for MPI programs 

      Cores González, Iván; Rodríguez, Gabriel; Martín, María J.; González, Patricia (Springer New York LLC, 2014)
      [Abstract] Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based ...
    • Reducing the overhead of an MPI application-level migration approach 

      Cores González, Iván; Rodríguez, Mónica; González, Patricia; Martín, María J. (Elsevier BV * North-Holland, 2016)
      [Abstract] Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead ...
    • Resilient MPI applications using an application-level checkpointing framework and ULFM 

      Losada, Nuria; Cores González, Iván; Martín, María J.; González, Patricia (Springer New York LLC, 2017-01)
      [Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. ...