ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Resilience 2012 - Workshop on Resiliency in High Performance Computing

Date2012-08-27

Deadline2012-06-04

VenueRhodes, Greece Greece

Keywords

Websitehttp://europar2012.cti.gr

Topics/Call fo Papers

5th Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids in conjunction with the
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012), Rhodes Island, Greece, August 27th - August 31st, 2012
Overview:
Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single- processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of these HPC systems increases from today's tera- and peta-scale to next-generation multi peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.
Important websites:
? Euro-Par 2012 at http://europar2012.cti.gr/

Last modified: 2012-05-06 14:05:23