FTXS 2015 - International Workshop on Fault Tolerance for HPC at eXtreme Scale
Topics/Call fo Papers
Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:
Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
Advances in monitoring, analysis, and control of highly complex systems
Highly scalable fault-tolerant programming models
Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress
Frameworks and APIs for fault tolerance and fault management.
Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
Advances in monitoring, analysis, and control of highly complex systems
Highly scalable fault-tolerant programming models
Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress
Frameworks and APIs for fault tolerance and fault management.
Other CFPs
- Science of Cyberinfrastructure: Research, Experience, Applications and Models
- ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
- Second Workshop on the Changing Landscape in HPC Security
- 44th National Theological Conference: Creating Common Good: A Practical Conference on Economic Equality
- 3rd international conference on Machinery, Materials Science and Energy
Last modified: 2014-12-30 14:50:32