Resilience 2014 - Seventh Workshop on Resiliency in High Performance Computing with Clouds, Grids, and Clusters Resilience 2014
Topics/Call fo Papers
Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today’s current multi-petascale to next-generation exascale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today’s systems to several hundreds of thousands of compute nodes in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to reliability, availability and serviceability (RAS).
The expected total component count of these HPC systems calls into questions many of today’s HPC RAS assumptions. Although the mean-time to failure (MTTF) for each individual component, e.g., processor, memory module, and network interface, may be above typical consumer product standard, the probability of failure for the overall system scales proportionally to the number of interdependent components and their combined probabilities of failure. Thus, the enormous number of individual components results in a much lower system meantime to failure (SMTTF), causing more frequent system-wide interruptions than displayed by current HPC systems. This effect is not limited to hardware components, but also extends to software components, e.g., operating system, system software, and applications. Although software components do not show less reliability with increasing age like hardware components, they do contain other sources of failures, such as design and implementation errors. Furthermore, the health of software components also involves resource utilization, such as processor, memory and network usage.
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 12 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at . Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
?Resilience 2014 at http://xcr.cenit.latech.edu/resilience2014
?Resilience 2014 Submissions:https://www.easychair.org/conferences/?conf=europa...
?Euro-Par 2014 website:https://europar2014.dcc.fc.up.pt/
Topics of interest include, but are not limited to:
? Hardware for fault detection and resiliency
? System-level resiliency for HPC, Grid, Cluster, and Cloud
? Algorithmic based resiliency ? Generic, fundamental advances (not Hadoop)
? Statistical methods to improve system resiliency
? Fault tolerance mechanisms experiments
? Resource management for system resiliency and availability
? Resilient system based on hardware probes
? Monitoring mechanisms to support fault prediction, and fault mitigation
? Application-level fault tolerance
? Fault prediction and failure modeling
Important Dates:
?Workshop papers due: May 30, 2014
?Workshop author notification: July 4, 2014
?Workshop early registration: July 25, 2014
?Workshop camera-ready papers due: October 3, 2014
General Co-Chairs:
? Stephen L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl-AT-ornl.gov
? Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box-AT-latech.edu
Program Co-Chairs:
? Patrick G. Bridges
University of New Mexico, USA
bridges-AT-cs.unm.edu
? Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc-AT-ornl.gov
Program Committee:
? Ferrol Aderholdt, Tennessee Tech University
? Vassil Alexandrov, Barcelona Supercomputer Center
? Wesley Bland, Argonne National Laboratory
? Greg Bronevetsky, Lawrence Livermore National Laboratory
? Franck Cappello, INRIA and University of Illinois at Urbana-Champaign
? Zizhong Chen, University of California at Riverside
? Nathan Debardeleben, Los Alamos National Laboratory
? Kurt Ferreira, Sandia National Laboratory
? Cecile Germain, Université Paris-Sud
? Larry Kaplan, Cray Inc.
? Dieter Kranzlmüller, Ludwig-Maximilians University of Munich
? Sriram Krishnamoorthy, Pacific Northwest National Laboratory
? Scott Levy, University of New Mexico
? Celso Mendes, University of Illinois at Urbana-Champaign
? Kathryn Mohror, Lawrence Livermore National Laboratory
? Christine Morin, INRIA Rennes
? Mihaela Paun, Louisiana Tech University
? Alexander Reinefeld, Zuse Institute Berlin
? Rolf Riesen, Intel Corporation
Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today’s current multi-petascale to next-generation exascale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today’s systems to several hundreds of thousands of compute nodes in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to reliability, availability and serviceability (RAS).
The expected total component count of these HPC systems calls into questions many of today’s HPC RAS assumptions. Although the mean-time to failure (MTTF) for each individual component, e.g., processor, memory module, and network interface, may be above typical consumer product standard, the probability of failure for the overall system scales proportionally to the number of interdependent components and their combined probabilities of failure. Thus, the enormous number of individual components results in a much lower system meantime to failure (SMTTF), causing more frequent system-wide interruptions than displayed by current HPC systems. This effect is not limited to hardware components, but also extends to software components, e.g., operating system, system software, and applications. Although software components do not show less reliability with increasing age like hardware components, they do contain other sources of failures, such as design and implementation errors. Furthermore, the health of software components also involves resource utilization, such as processor, memory and network usage.
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 12 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at . Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
?Resilience 2014 at http://xcr.cenit.latech.edu/resilience2014
?Resilience 2014 Submissions:https://www.easychair.org/conferences/?conf=europa...
?Euro-Par 2014 website:https://europar2014.dcc.fc.up.pt/
Topics of interest include, but are not limited to:
? Hardware for fault detection and resiliency
? System-level resiliency for HPC, Grid, Cluster, and Cloud
? Algorithmic based resiliency ? Generic, fundamental advances (not Hadoop)
? Statistical methods to improve system resiliency
? Fault tolerance mechanisms experiments
? Resource management for system resiliency and availability
? Resilient system based on hardware probes
? Monitoring mechanisms to support fault prediction, and fault mitigation
? Application-level fault tolerance
? Fault prediction and failure modeling
Important Dates:
?Workshop papers due: May 30, 2014
?Workshop author notification: July 4, 2014
?Workshop early registration: July 25, 2014
?Workshop camera-ready papers due: October 3, 2014
General Co-Chairs:
? Stephen L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl-AT-ornl.gov
? Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box-AT-latech.edu
Program Co-Chairs:
? Patrick G. Bridges
University of New Mexico, USA
bridges-AT-cs.unm.edu
? Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc-AT-ornl.gov
Program Committee:
? Ferrol Aderholdt, Tennessee Tech University
? Vassil Alexandrov, Barcelona Supercomputer Center
? Wesley Bland, Argonne National Laboratory
? Greg Bronevetsky, Lawrence Livermore National Laboratory
? Franck Cappello, INRIA and University of Illinois at Urbana-Champaign
? Zizhong Chen, University of California at Riverside
? Nathan Debardeleben, Los Alamos National Laboratory
? Kurt Ferreira, Sandia National Laboratory
? Cecile Germain, Université Paris-Sud
? Larry Kaplan, Cray Inc.
? Dieter Kranzlmüller, Ludwig-Maximilians University of Munich
? Sriram Krishnamoorthy, Pacific Northwest National Laboratory
? Scott Levy, University of New Mexico
? Celso Mendes, University of Illinois at Urbana-Champaign
? Kathryn Mohror, Lawrence Livermore National Laboratory
? Christine Morin, INRIA Rennes
? Mihaela Paun, Louisiana Tech University
? Alexander Reinefeld, Zuse Institute Berlin
? Rolf Riesen, Intel Corporation
Other CFPs
- First International Workshop on Reproducibility in Parallel Computing REPPAR 2014
- First Workshop on Techniques and Applications for Sustainable Ultrascale Computing Systems TASUS 2014
- Seventh International Workshop on Multi-/Many-core Computing Systems? MuCoCoS 2014
- The Second Workshop on Large Scale Distributed Virtual Environments on Clouds and P2P - LSDVE 2014
- International Workshop on Software Engineering for Web Application Development (SEWAD-2014)
Last modified: 2014-05-05 23:07:00