Transient fault tolerance for ccNUMA architecture

  • Xingjun Zhang*
  • , Endong Wang
  • , Feilong Tang
  • , Meishun Yang
  • , Hengyi Wei
  • , Xiaoshe Dong
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Transient fault is a critical concern in the reliability of microprocessors system. The software fault tolerance is more flexible and lower cost than the hardware fault tolerance. And also, as architectural trends point toward multi core designs, there is substantial interest in adapting parallel and redundancy hardware resources for transient fault tolerance. The paper proposes a process-level fault tolerance technique, a software centric approach, which efficiently schedule and synchronize of redundancy processes with ccNUMA processors redundancy. So it can improve efficiency of redundancy processes running, and reduce time and space overhead. The paper focuses on the researching of redundancy processes error detection and handling method. A real prototype is implemented that is designed to be transparent to the application. The test results show that the system can timely detect soft errors of CPU and memory that cause the redundancy processes exception, and meanwhile ensure that the services of application is uninterrupted and delay shortly.

Original languageEnglish
Title of host publicationProceedings - 6th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2012
Pages197-202
Number of pages6
DOIs
StatePublished - 2012
Externally publishedYes
Event6th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2012 - Palermo, Italy
Duration: 4 Jul 20126 Jul 2012

Publication series

NameProceedings - 6th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2012

Conference

Conference6th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2012
Country/TerritoryItaly
CityPalermo
Period4/07/126/07/12

Keywords

  • Transient fault
  • ccNUMA
  • dual-process

Fingerprint

Dive into the research topics of 'Transient fault tolerance for ccNUMA architecture'. Together they form a unique fingerprint.

Cite this