Resilient MPI applications using an application-level checkpointing framework and ULFM

UDC.coleccionInvestigaciónes_ES
UDC.departamentoEnxeñaría de Computadoreses_ES
UDC.endPage113es_ES
UDC.grupoInvGrupo de Arquitectura de Computadores (GAC)es_ES
UDC.issue1es_ES
UDC.journalTitleJournal of Supercomputinges_ES
UDC.startPage100es_ES
UDC.volume73es_ES
dc.contributor.authorLosada, Nuria
dc.contributor.authorCores González, Iván
dc.contributor.authorMartín, María J.
dc.contributor.authorGonzález, Patricia
dc.date.accessioned2018-07-10T14:29:26Z
dc.date.available2018-07-10T14:29:26Z
dc.date.issued2017-01
dc.descriptionThis is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7es_ES
dc.description.abstract[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.es_ES
dc.description.sponsorshipMinisterio de Economía y Competitividad; TIN2013-42148-Pes_ES
dc.description.sponsorshipMinisterio de Economía y Competitividad; TIN2014-53522-REDTes_ES
dc.description.sponsorshipMinisterio de Economía y Competitividad; BES-2014-068066es_ES
dc.description.sponsorshipGalicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/055es_ES
dc.identifier.citationLosada, N., Cores, I., Martín, M.J. et al. J Supercomput (2017) 73: 100. https://doi.org/10.1007/s11227-016-1629-7es_ES
dc.identifier.doi10.1007/s11227-016-1629-7
dc.identifier.issn0920-8542
dc.identifier.issn1573-0484
dc.identifier.urihttp://hdl.handle.net/2183/20890
dc.language.isoenges_ES
dc.publisherSpringer New York LLCes_ES
dc.relation.urihttps://doi.org/10.1007/s11227-016-1629-7es_ES
dc.rights.accessRightsopen accesses_ES
dc.subjectResiliencees_ES
dc.subjectCheckpointinges_ES
dc.subjectFault tolerancees_ES
dc.subjectMPIes_ES
dc.titleResilient MPI applications using an application-level checkpointing framework and ULFMes_ES
dc.typejournal articlees_ES
dspace.entity.typePublication
relation.isAuthorOfPublication992b3436-2d71-403c-8922-e22060554a96
relation.isAuthorOfPublication040e0007-80e8-4213-b049-be346ac2b018
relation.isAuthorOfPublication049797cb-6695-43ea-8f32-efc754fbfda6
relation.isAuthorOfPublication0ed2a744-9046-4c62-8300-a17ef95bea86
relation.isAuthorOfPublication.latestForDiscovery992b3436-2d71-403c-8922-e22060554a96

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nuria_Losada_Resilient_MPI_Applications_using_an_Application-level_Checkpointing_Framework_and_ULFM_2017.pdf
Size:
846.29 KB
Format:
Adobe Portable Document Format
Description: