Resilience 2018 - 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids
Topics/Call fo Papers
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Other CFPs
- International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms
- Workshop on Parallel and Distributed Computing for Life Sciences: Algorithms, Methodologies and Tools
- WORKSHOP ON ADVANCES IN HIGH-PERFORMANCE BIOINFORMATICS, SYSTEMS BIOLOGY (MED-HPC 2018)
- 2018 International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms
- 2018 International Workshop on Future Perspective of Decentralised Applications
Last modified: 2018-04-16 11:21:24