A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications
Use este enlace para citar
http://hdl.handle.net/2183/20889
A non ser que se indique outra cousa, a licenza do ítem descríbese como Atribución-NoComercial-SinDerivadas 4.0 Internacional (CC-BY-NC-ND 4.0)
Coleccións
- GI-GAC - Artigos [193]
Metadatos
Mostrar o rexistro completo do ítemTítulo
A Portable and Adaptable Fault Tolerance Solution for Heterogeneous ApplicationsData
2017-06Cita bibliográfica
Losada, N., Fraguela, B. B., González, P., & Martín, M. J. (2017). A portable and adaptable fault tolerance solution for heterogeneous applications. Journal of Parallel and Distributed Computing, 104, 146-158.
Resumo
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used. Besides, applications can be restarted changing the host CPU and/or the accelerator device architecture, and adapting the computation to the number of devices available during recovery. The proposed solution is built combining CPPC (ComPiler for Portable Checkpointing), an application-level checkpointing tool, and HPL (Heterogeneous Programming Library), a library that facilitates the development of OpenCL-based applications. Experimental results show the low overhead introduced by the proposal and prove its portability and adaptability benefits.
Palabras chave
Checkpointing
Fault tolerance
Heterogeneous systems
OpenCL
Portability
Fault tolerance
Heterogeneous systems
OpenCL
Portability
Versión do editor
Dereitos
Atribución-NoComercial-SinDerivadas 4.0 Internacional (CC-BY-NC-ND 4.0) ©2017. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/
The formal publication is available at https://doi.org/10.1016/j.jpdc.2017.01.020
ISSN
0743-7315
1096-0848
1096-0848