|
Title: Cray OS Road Map
Abstract: This paper will discuss Cray's operating system road map. This includes the compute node OS, the service node OS, the network stack, file systems, and administrative tools. Coming changes will be previewed, and themes of future releases will be discussed. |
Author(s):
Carroll, Charlie, Presenter Cray Inc.
Carroll, Charlie, Presenter Cray Inc.
Carroll, Charlie, Presenter Cray Inc.
Carroll, Charlie, Presenter Cray Inc.
|
Suggested Technical Category:
System Operations
|
|
Title: A Pedagogical Approach to User Assistance
Abstract: This presentation will focus on a pedagogical approach to providing user assistance. By making user education the central theme in training, outreach, and user assistance activities, a set of competencies can be developed that encompasses the knowledge required for productive use of leadership-class computing resources such as the Cray XT5 Jaguar system. |
Author(s):
Whitten, Jr., Robert, Presenter Oak Ridge National Laboratory (ORNL)
Whitten, Jr., Robert, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Consulting
|
|
Title: A Scalable Boundary Adjusting High-Resolution Technique for Turbulent Flows
Abstract: To accurately resolve turbulent flow structures, high-fidelity simulations require the use of millions of grid points. The Compact Accurately Boundary Adjusting High-Resolution Technique (CABARET) is capable of producing accurate results with at least 10 times more efficiency than conventional schemes. CABARET is based on a local second-order finite difference scheme which lends itself extremely well to large scale distributed systems. For Reynolds numbers of 10^4 the method gives rapid convergence without requiring additional preconditioning for Mach numbers as low as 0.05. In this paper we shall discuss the implementation and performance of the CABARET method on the HECToR XT4/6 system. We shall describe the development and optimization of an irregular parallel decomposition for the hexahedral numerical grid structure. Scalability of the code will be discussed in relation to i) the effectiveness of the load balancing for grids generated from the partitioning method ii) compiler performance and iii) efficient use of MPI and memory utilisation. |
Author(s):
Karabasov, Sergey University of Cambridge
Ridley, Phil, Presenter Numerical Algorithms Group
Karabasov, Sergey University of Cambridge
Ridley, Phil, Presenter Numerical Algorithms Group
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Optimisation of the I/O for Distributed Data Molecular Dynamics Applications
Abstract: With the increase in size of HPC facilities it is not only the parallel performance of applications that is preventing greater exploitation, in many cases it is the I/O which is the bottleneck. This is especially the case for distributed data algorithms. In this paper we will discuss how the I/O in the distributed data molecular dynamics application DL_POLY_3 has been optimised. In particular we shall show that extensive data redistribution specifically to allow best use of the I/O subsystem can result in a code that scales to many more processors, despite the large increase in communications required. |
Author(s):
Smith, William Numerical Algorithms Group
Todorov, Ilian Numerical Algorithms Group
Bush, Ian, Presenter Numerical Algorithms Group
Smith, William Numerical Algorithms Group
Todorov, Ilian Numerical Algorithms Group
Bush, Ian, Presenter Numerical Algorithms Group
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Automatic Library Tracking Database
Abstract: The National Institute for Computational Sciences and the National Center for Computational Sciences (both located at Oak Ridge National Laboratory) have been working on an automatic library tracking database whose purpose is to track which libraries are used on their Cray XT5 Supercomputers. The database stores the libraries that are used at link time and it records which executable is run during a batch job. With this data, many operationally important questions can be answered like which libraries are most frequently used and who is using deprecated libraries or applications. The infrastructure design and reporting mechanisms will be presented with production data to this point. |
Author(s):
Jones, Nicholas National Institute for Computational Sciences (NICS)
Fahey, Mark, Presenter National Institute for Computational Sciences (NICS)
Hadri, Bilel National Institute for Computational Sciences (NCIS)
Jones, Nicholas National Institute for Computational Sciences (NICS)
Fahey, Mark, Presenter National Institute for Computational Sciences (NICS)
Hadri, Bilel National Institute for Computational Sciences (NCIS)
|
Suggested Technical Category:
Libraries
|
|
Title: DMAPP—An API for One-sided Program Models on Baker Systems
Abstract: Baker Systems and follow-on systems will deliver a network with advanced remote memory access capabilities. A new API (DMAPP) has been developed to expose these capabilities to one-sided program models. This paper presents the DMAPP API as well as some preliminary performance data. |
Author(s):
ten Bruggencate, Monika, Presenter Cray Inc.
ten Bruggencate, Monika, Presenter Cray Inc.
|
Suggested Technical Category:
Programming Environment
|
|
Title: Using Quality of Service for Scheduling on Cray XT Systems
Abstract: The University of Tennessee's National Institute for Computational Sciences (NICS) operates two Cray XT systems for the U.S. National Science Foundation (NSF): Kraken, an 88-cabinet XT5 system, and Athena, a 48-cabinet XT4 system. Access to Kraken is allocated through the NSF's Teragrid allocations process, while Athena is currently being dedicated to individual projects on a quarterly basis; as a result, the two systems have somewhat different scheduling goals. However, user projects on both systems have sometimes required the use of quality of service (QoS) levels for scheduling of certain sets of jobs. We will present case studies of three situations where QoS levels were used to fulfill specific requirements: two on Kraken in fully allocated production service, and one on Athena while dedicated to an individual project. These case studies will include lessons learned about impact on other users and unintended side effects. |
Author(s):
Baer, Troy, Presenter National Institute for Computational Sciences (NICS)
Baer, Troy, Presenter National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Operations
|
|
Title: Use of the Cray XT5 Architecture to Push the Limits of WRF Beyond One Billion Gridpoints
Abstract: The Arctic Region Supercomputing Center (ARSC) Weather Research and Forecasting (WRF) model benchmark suite continues to push software and available hardware limits by successfully running a 1km resolution case study composed of more than one billion grid points. Simulations of this caliber are important for providing detailed weather forecasts over the rugged Alaska terrain and are intended for benchmarking on systems with tens of thousands of cores. In pursuing these large scale simulations, we have incurred numerical, software and hardware limitations that have required us to use various parallel I/O schemes and to explore different PBS "aprun" options. In this paper we will discuss issues encountered while gradually expanding the problem sizes in which WRF can operate and our solutions in running high resolution and/or large-scale WRF simulations on the Cray XT5 architecture. |
Author(s):
Nudson, Oralee, Presenter Arctic Region Supercomputing Center (ARSC)
Morton, Dr. Don Arctic Region Supercomputing Center (ARSC)
Nudson, Oralee, Presenter Arctic Region Supercomputing Center (ARSC)
Morton, Dr. Don Arctic Region Supercomputing Center (ARSC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: MRNet: A Scalable Infrastructure for Development of Parallel Tools and Applications
Abstract: MRNet is a customizable, high-throughput communication software system for parallel tools and applications. It reduces the cost of these tools' activities by incorporating a tree-based overlay network (TBON) of processes between the tool's front-end and back-ends. MRNet was recently ported and released for Cray XT systems. In this talk we describe the main features that make MRNet well-suited as a general facility for building scalable parallel tools. We present our experiences with MRNet and examples of its use. |
Author(s):
Miller, Barton, Presenter University of Wisconsin
Roth, Philip Oak Ridge National Laboratory (ORNL)
DeRose, Luiz Cray Inc.
Miller, Barton, Presenter University of Wisconsin
Roth, Philip Oak Ridge National Laboratory (ORNL)
DeRose, Luiz Cray Inc.
|
Suggested Technical Category:
Programming Environment
|
|
Title: Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory and Torus Topologies in the Cray XT5 environment
Abstract: MPI Collective operations tend to play a large role in limiting the scalability of high-performance scientific simulation codes. As such, developing methods for improving the scalability of these operations is critical to improving the scalability of such applications. Using infrastructure recently developed in the contest of the FASTOS program we will study the performance of blocking collective operations, as well as those of the recently added MPI nonblocking collective operations taking into account both shared memory and network topologies. |
Author(s):
Graham, Richard, Presenter Oak Ridge National Laboratory (ORNL)
Ladd, Joshua Oak Ridge National Laboratory (ORNL)
Graham, Richard, Presenter Oak Ridge National Laboratory (ORNL)
Ladd, Joshua Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Networking
|
|
Title: 2DECOMP&FFT—A Highly Scalable 2D Decomposition Library and FFT Interface
Abstract: As part of a HECToR distributed CSE support project, a general-purpose 2D decomposition (also known as 'pencil' or 'drawer' decomposition) communication library has been developed. This Fortran library provides a powerful and flexible framework to build applications based on 3D Cartesian data structures and spatially implicit numerical schemes (such as compact finite difference method or spectral method). The library also supports shared-memory architecture which becomes increasingly popular. A user-friendly FFT interface has been built on top of the communication library to perform distributed multi-dimensional FFTs. Both the decomposition library and the FFT interface scale well to tens of thousands of cores on Cray XT systems. The library has been applied to Incompact3D, a CFD application performing large-scale Direct Numerical Simulations of turbulence, enabling exciting scientific studies to be conducted. |
Author(s):
Li, Ning, Presenter Numerical Algorithms Group
Laizet, Sylvain Imperial College London
Li, Ning, Presenter Numerical Algorithms Group
Laizet, Sylvain Imperial College London
|
Suggested Technical Category:
Libraries
|
|
Title: Mixed Mode Computation in CASINO
Abstract: CASINO is a quantum Monte Carlo code that solves many particle Schroedinger equations with the help of configurations of random walkers. This method is suitable for parallel computation because it has a very good computation/communication ratio. The standard parallel algorithm increases the computation speed by distributing equally the configurations among the available processors. For a computation with P processing elements the computation time for Nc configurations is proportional with Nc*tc/P, where tc is the average time taken for one configuration step. On petascale computers one can have more processing elements than configurations and besides that for models with more that 1000 electrons tc increases significantly. We present a mixed mode implementation of CASINO that takes advantage of the architectures with large numbers of multicore processors to improve computation speed by using multiple OpenMP threads for the computation of each configuration step. |
Author(s):
Anton, Lucian, Presenter Numerical Algorithms Group
Alfe, Dario University College of London
Anton, Lucian, Presenter Numerical Algorithms Group
Alfe, Dario University College of London
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Regression Testing on Petaflop Computational Resources
Abstract: As the complexity of supercomputers increases, it is becoming more difficult to measure how system performance changes over time. Routine system checks performed after scheduled maintenance or emergency downtime give administrators an instantaneous glimpse of system performance; however, rigorous testing, such as that performed for machine acceptance, provides more in-depth information on system performance. Both routine and rigorous testing are necessary to fully characterize system performance, and a mechanism to store and compare previous results is needed to determine the change in system performance over time. A regression testing framework has been developed at the National Institute for Computational Sciences (NICS) which provides a mechanism to measure the change in system performance over time. These performance results can also be correlated to system events such as downtimes, system upgrades, or any other documented system change. We will describe the design and implementation of the regression testing framework, including the development of test suites, interfaces to the batch system, and the extraction of performance data. The import of extracted data into a relational database for long- term storage, report generation, and real- time analysis will also be discussed. |
Author(s):
McCarty, Mike, Presenter National Institute for Computational Sciences (NICS)
Baer, Troy National Institute for Computational Sciences (NICS)
Crosby, Lonnie National Institute for Computational Sciences (NICS)
McCarty, Mike, Presenter National Institute for Computational Sciences (NICS)
Baer, Troy National Institute for Computational Sciences (NICS)
Crosby, Lonnie National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
System Operations
|
|
Title: Combining Open MP and MPI within GLOMAP Mode to Take Advantage of Multiple Core Processors: An Example of Legacy Software Keeping Pace with Hardware Developments
Abstract: The MPI version of GLOMAP MODE is being used in production runs for research into atmospheric science. The memory requirement prohibits use of high resolution scenarios so 32 MPI tasks is the usual decomposition. One way to attempt higher resolution simulations is to under-populate the nodes, making more memory available per MPI task. Although this is wasteful of resources, it does provide a shorter time per existing simulation. The NAG Ltd DCSE service has examined the code and introduced Open MP so that the otherwise "idle" cores can contribute to the MPI task. This improves the performance so that the additional cost of a simulation is reduced. |
Author(s):
Richardson, Mark, Presenter Numerical Algorithms Group
Mann, Graham University of Leeds
Richardson, Mark, Presenter Numerical Algorithms Group
Mann, Graham University of Leeds
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Improving the Performance of CP2K on the Cray XT
Abstract: CP2K is a freely available and increasingly popular Density Functional Theory code for the simulation of a wide range of systems. It is heavily used on many Cray XT systems, including 'HECToR' in the UK and 'Monte Rosa' in Switzerland. We describe performance optimisations made to the code in several key areas, including 3D Fourier Transforms, and present the implementation of a load balancing scheme for multi-grids. These result in performance gains of around 30% on 256 cores (for a generally representative benchmark) and up to 300% on 1024 cores (for non-homogeneous systems). Early results from the implementation of hybrid MPI/OpenMP parallelism in the code are also presented. |
Author(s):
Bethune, Iain, Presenter EPCC (EPCC)
Bethune, Iain, Presenter EPCC (EPCC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The NEMO Ocean Modelling Code: A Case Study
Abstract: We present a case study of a popular ocean modelling code, NEMO, on the Cray XT4 HECToR system. HECToR is the UK's high-end computing resource for academic users. Two different versions of NEMO have been investigated. The performance and scaling of the code has been evaluated and optimised by investigating the choice of grid dimensions, by examining the use of land versus ocean grid cells and also by checking for memory bandwidth problems. The code was profiled and the time spent carrying out file input/output was identified to be a potential bottleneck. We present a solution to this problem which gives a significant saving in terms of runtime and disk space usage. |
Author(s):
Reid, Fiona, Presenter EPCC (EPCC)
Reid, Fiona, Presenter EPCC (EPCC)
|
Suggested Technical Category:
Joint Session
Joint Session, Tutorial or Other
Technical Category suggested:
3rd Party Applications/User Code Optimization
|
|
Title: Configuring and Optimising the Weather Research and Forecast Model On the Cray XT
Abstract: The Weather Research and Forecast (WRF) Model is a well-established and widely used application. Designed and written to be highly scalable, the code has a large number of configuration options at both compile- and run-time. We report the results of an investigation into the effect of these options on the performance of WRF on a Cray XT4 with a typical scientific use-case. Covering areas such as MPI/OpenMP comparison, cache usage and I/O performance, we discuss the implications for both regular WRF users and the authors of other application codes. |
Author(s):
Porter, Andrew, Presenter STFC Daresbury Laboratory
Ashworth, Mike HPCX Consortium (HPCX)
Porter, Andrew, Presenter STFC Daresbury Laboratory
Ashworth, Mike HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: A Hybrid MPI/Openmp Code Employing a High-Order Compact Scheme for the Simulation of Hypersonic Aerodynamics
Abstract: High-order compact schemes are excellent candidates for Direct Numerical Simulation and Large Eddy Simulation of flow fields. We have devised a high-order compact scheme suitable for the simulation of hypersonic flows, to exploit both shared and distributed memory paradigms. Our hybrid application, employing both MPI and OpenMP standards, has been tested on HECToR. |
Author(s):
Fico, Vincenzo, Presenter STFC Daresbury Laboratory
Emerson, David HPCX Consortium (HPCX)
Reese, Jason University of Strathclyde
Fico, Vincenzo, Presenter STFC Daresbury Laboratory
Emerson, David HPCX Consortium (HPCX)
Reese, Jason University of Strathclyde
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: High Performance Computing Driven Software Development for Next-Generation Modeling of the World’s Oceans
Abstract: The Imperial College Ocean Model (ICOM) is an open-source next generation ocean model build upon finite element methods and anisotropic unstructured adaptive meshing. Since 2009, a project has been funded by EPSRC to optimize the ICOM for the UK national HPC service, Hector. Extensive use of profiling tools such as CrayPAT and Vampir has been made in order to understand performance issues of the code on the Cray XT4. Of particular interest is the scalability of the sparse linear solvers and the algebraic multigrid preconditioners required to solve the system of equations. Scalability of model I/O have been examined and we have implemented a parallel I/O strategy in the code for the Lustre filesystem. |
Author(s):
Guo, Xiaohu , Presenter STFC Daresbury Laboratory
Kramer, Stephan Imperial College London
Ashworth, Mike HPCX Consortium (HPCX)
Gorman, Gerard Imperial College London
Piggott, Matthew Imperial College London
Sunderland, Andrew HPCX Consortium (HPCX)
Guo, Xiaohu , Presenter STFC Daresbury Laboratory
Kramer, Stephan Imperial College London
Ashworth, Mike HPCX Consortium (HPCX)
Gorman, Gerard Imperial College London
Piggott, Matthew Imperial College London
Sunderland, Andrew HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: XT System Reliability: Metrics, Trends, and Actions
Abstract: In 2009, the XT product family saw a significant improvement in overall reliability as measured by Cray’s support organization. This paper will discuss the reliability trends that have been observed and the main reasons for the improvements. We will also discuss the tools used to collect the field data, the metrics generated by Cray to evaluate XT product reliability and the actions taken as a result of this analysis. |
Author(s):
Johnson, Steve, Presenter Cray Inc.
Johnson, Steve, Presenter Cray Inc.
|
Suggested Technical Category:
System Operations
|
|
Title: Multi-Core Aware Performance Optimization of Halo Exchanges in Ocean Simulations
Abstract: The advent of multi-core brings new opportunities for performance optimization in MPI codes. For example, the cost of performing a halo exchange in a finite-difference simulation can be reduced by choosing a partition into sub-domains that takes full advantage of the faster shared-memory mechanisms available for MPI communication between tasks on the same node. We have implemented these ideas in the Proudman Oceanographic Laboratory Coastal-Ocean Modelling System, and find that multi-core aware optimizations can offer significant performance benefit, especially on hex-core systems. |
Author(s):
Pickles, Stephen, Presenter STFC Daresbury Laboratory
Pickles, Stephen, Presenter STFC Daresbury Laboratory
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Scaling Applications on Cray XT Systems
Abstract: In this tutorial we will present tools and techniques for application performance tuning on the Cray XT system, with focus on multi-core processors. Attendees will learn about the Cray XT architecture and its programming environment. They will have an initial understanding of potential causes of application performance bottlenecks, and how to identify some of these bottlenecks using the Cray Performance tools. In addition, attendees will learn advanced techniques to deal with scaling problems and how to access the on-line documentation for user help. Attendees will also have some exposure to the Cray debugging support tools, which provide innovative techniques to debug applications at scale. |
Author(s):
DeRose, Luiz, Presenter Cray Inc.
Levesque, John, Presenter Cray Inc.
Moench, Bob, Presenter Cray Inc.
DeRose, Luiz, Presenter Cray Inc.
Levesque, John, Presenter Cray Inc.
Moench, Bob, Presenter Cray Inc.
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
This is a proposal for a morning tutorial.
|
|
Title: The Cray Programming Environment: Current Status and Future Directions
Abstract: The Cray Programming Environment has been designed to address issues of scale and complexity of high end HPC systems. Its main goal is to hide the complexity of the system, such that applications can achieve the highest possible performance from the hardware. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which consists of state of the art compiler, tools, and libraries, supporting a wide range of programming models. |
Author(s):
DeRose, Luiz, Presenter Cray Inc.
DeRose, Luiz, Presenter Cray Inc.
|
Suggested Technical Category:
Programming Environment
|
|
Title: Jaguar-The World's Most Powerful Computer System
Abstract: At the SC'09 conference in November 2009, Jaguar was crowned as the world's fastest computer by the web site www.Top500.org. In this paper, we will describe Jaguar, present results from a number of benchmarks and applications, and talk about future computing in the Oak Ridge Leadership Computing Facility. |
Author(s):
Bland, Arthur, Presenter Oak Ridge National Laboratory (ORNL)
Bland, Arthur, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Joint Session
Joint Session, Tutorial or Other
Technical Category suggested:
Major Systems
|
|
Title: General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications
Abstract: GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL utilizes it to gather hardware performance counter data. GPTL built with PAPI support is installed on the jaguar machine at ORNL. |
Author(s):
Rosinski, James, Presenter Oak Ridge National Laboratory (ORNL)
Rosinski, James, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Performance Analysis of Pure MPI Versus MPI+OpenMP for Jacobi Iteration and a 3D FFT on the Cray XT5
Abstract: Today many high performance computers are collections of shared memory compute nodes with each compute node having one or more multi-core processors. When writing parallel programs for these machines, one can use pure MPI or various hybrid approaches using MPI and OpenMP. Since OpenMP threads are lighter weight than MPI processes, one would expect that hybrid approaches will achieve better performance and scalability than pure MPI. In practice this is not always the case. This paper investigates the performance and scalability of pure MPI versus hybrid MPI+OpenMP for Jacobi iteration and a 3D FFT on the Cray XT5. |
Author(s):
Weiss, Olga Iowa State University
Luecke, Glenn, Presenter Iowa State Unversity
Weiss, Olga Iowa State University
Luecke, Glenn, Presenter Iowa State Unversity
|
Suggested Technical Category:
Programming Environment
|
|
Title: Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms
Abstract: Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper, we will examine the effect of the choice of programming model upon performance and overall memory usage. We will study how to make efficient use of the memory system and explore the advantages and disadvantage of MPI, OpenMP, and UPC on the Cray XT5 multicore platforms for several synthetic and application benchmarks. |
Author(s):
Shan, Hongzhang National Energy Research Scientific Computing Center (NERSC)
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Wright, Nick National Energy Research Scientific Computing Center (NERSC)
Jin, Haoqiang NAS Systems Division (NAS)
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Min, Seung-Jai Lawrence Berkeley National Lab
Shan, Hongzhang National Energy Research Scientific Computing Center (NERSC)
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Wright, Nick National Energy Research Scientific Computing Center (NERSC)
Jin, Haoqiang NAS Systems Division (NAS)
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Min, Seung-Jai Lawrence Berkeley National Lab
|
Suggested Technical Category:
Programming Environment
|
|
Title: MPI Queue Characteristics of Large-scale Applications
Abstract: Applications running at scale have varying communication characteristics. By employing the PERUSE introspection interface of Open MPI, this paper evaluates several large-scale simulations running production-level input data-sets on the jaguar installation at ORNL. Maximum number of queued messages, average duration of unexpected receives and late sender and receiver information as a function of job size is being presented. |
Author(s):
Keller, Rainer, Presenter Oak Ridge National Laboratory (ORNL)
Graham, Richard L. Oak Ridge National Laboratory (ORNL)
Keller, Rainer, Presenter Oak Ridge National Laboratory (ORNL)
Graham, Richard L. Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Tools, Tips and Tricks for Managing Cray XT Systems
Abstract: Managing large complex systems requires processes beyond what is taught in vendor training classes. Many sites must manage multiple systems from different vendors. This paper covers a collection of techniques to enhance the usability, reliability and security of Cray XT systems. A broad range of activities, from complex tasks like security, integrity and environmental checks of the Cray Linux Environment, to relatively simple things like making 'rpm -qa' available to users will be discussed. Some techniques will be XT specific, such as monitoring L0/L1 environment, but others will be generic, such as security tools adapted from other systems and re-spun as necessary for the XT Cray Linux Environment. |
Author(s):
Carlson, Kurt, Presenter Arctic Region Supercomputing Center (ARSC)
Carlson, Kurt, Presenter Arctic Region Supercomputing Center (ARSC)
|
Suggested Technical Category:
System Operations
|
|
Title: Collecting Application-Level Job Completion Statistics
Abstract: Job failures are common on large high performance computing systems, but logging, analyzing, and understanding the low-level error messages can be difficult on Cray XT systems. This paper describes a set of tools to log and analyze applications in real-time as they run on the system. By obtaining more information about typical error scenarios, system administrators can work to resolve the underlying issues and educate users. |
Author(s):
Ezell, Matthew, Presenter National Institute for Computational Sciences (NICS)
Ezell, Matthew, Presenter National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
System Operations
|
|
Title: ALPS, Topology, and Performance
Abstract: Application performance can be improved or reduced depending on the compactness of the set of nodes on which an application is placed (as demonstrated convincingly by PSC at a recent CUG). This paper describes the approach to placements that ALPS now uses based on the underlying node topology, the reasons for this approach, and the variations that sites can use to optimize for their specific machine and workload. |
Author(s):
Albing, Carl, Presenter Cray Inc.
Albing, Carl, Presenter Cray Inc.
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Dynamic Shared Libraries and Virtual Cluster Environment
Abstract: Cray is expanding system functionality to support Dynamic Shared Libraries (DSL) on compute nodes, and the ability to run a wide range of packaged ISV applications on compute nodes. Built upon Data Virtualization Service (DVS), a more standard Linux runtime environment is distributed across the system by the DSL capability via DVS Server nodes to the Compute Node clients. The CLE Virtual Cluster Environment (VCE) adds a further layer of functionality, by supporting natively installed and executed ISV applications. This three component solution allows customers to meet a wide range of runtime environment demands with limited impact and complexity while increasing productivity. |
Author(s):
Schildt, Jason, Presenter Cray Inc.
Schildt, Jason, Presenter Cray Inc.
|
Suggested Technical Category:
Tools
|
|
Title: Resiliency Features in the Next Generation Cray Gemini Network
Abstract: As system sizes scale to ever increasing numbers of nodes and network links, network failures become an increasingly important problem to address. With its next generation high speed network (code named Gemini), Cray will introduce a number of new resiliency features in this area. These features, including network link failover, are discussed in this paper as well as a comparison to other, more familiar, network technologies such as Ethernet and Infiniband. |
Author(s):
Godfrey, Forest, Presenter Cray Inc.
Godfrey, Forest, Presenter Cray Inc.
|
Suggested Technical Category:
Architecture
|
|
Title: Scalable Performance Analysis of Large-scale Parallel Applications on Cray XT Systems with Scalasca
Abstract: The open-source Scalasca toolset [www.scalasca.org] supports integrated runtime summarization and automated trace analysis on a diverse range of HPC computer systems. An HPC-Europa2 visit to EPCC in 2009 resulted in significantly enhanced support for Cray XT systems, particularly the auxiliary programming environments and hybrid OpenMP/MPI. Combined with its previously demonstrated extreme scalability and portable performance analyses comparison capabilities, Scalasca has been used to analyse and tune numerous key applications (and benchmarks) on Cray XT and other PRACE prototype systems, from which experience with a representative selection is reviewed. |
Author(s):
Wylie, Brian, Presenter Juelich Supercomputing Centre
Wylie, Brian, Presenter Juelich Supercomputing Centre
|
Suggested Technical Category:
Tools
|
|
Title: Using I/O Servers to Improve Performance on Cray XT Technology
Abstract: Amdhal's Law proposes that parallel codes are combinations of parallel and serial tasks. In many cases these tasks are inherently parallel and can be decomposed and performed asynchronously. Each task operates on a dedicated subset of processors with highly scalable tasks operating on very large numbers of processors and less scalable tasks (like IO) operating on a smaller number. By moving to this Multiple Instruction Multiple Data paradigm codes can achieve greater parallel efficiency and scale further. This paper specifically addresses the implementation and experiences of adapting several codes important to HECToR to offload writing output data onto a set of dedicated server processors. |
Author(s):
Edwards, Thomas, Presenter Cray Inc.
Roy, Kevin Cray Inc.
Edwards, Thomas, Presenter Cray Inc.
Roy, Kevin Cray Inc.
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Petascale Debugging
Abstract: The need for debugging at scale is well known—yet machine sizes have raced ahead of the levels reachable by debuggers for many years. This paper outlines major development of Allinea's DDT debugging tool to introduce production-grade petascale debugging on the Oak Ridge Jaguar XT5 system. The resulting scalable architecture is raising the bar of usability and performance in a debugger by multiple orders of magnitude—and has already achieved record 225,000 core debugging at ORNL. |
Author(s):
Lecomber, David, Presenter Allinea Software
January, Chris Allinea Software
O'Connor, Mark Allinea Software
Lecomber, David, Presenter Allinea Software
January, Chris Allinea Software
O'Connor, Mark Allinea Software
|
Suggested Technical Category:
Tools
|
|
Title: PRACE Application Enabling Work at EPCC
Abstract: The Partnership for Advanced Computing in Europe (PRACE) created the prerequisites for a pan-European HPC service, consisting of several tier-0 centres. PRACE's aim has now moved to the implementation of this service. The now completed work looked into all aspects of the pan-European service, including the contractual and organisational issues, the system managment, application enabling and future computer technologies. This talk discusses the work done by EPCC on the application codes HELIUM (from Queen's University Belfast, UK) and NAMD (from University of Illinois at Urbana Champaign, US) with a particular focus on the work carried out for the Prace prototype Louhi, which is Cray XT5 at CSC in Finland. We will also include a performance comparison with non-Cray systems available to PRACE. |
Author(s):
Guo, Xu, Presenter EPCC (EPCC)
Hein, Joachim EPCC (EPCC)
Guo, Xu, Presenter EPCC (EPCC)
Hein, Joachim EPCC (EPCC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Imperative Recovery for Lustre
Abstract: Recovery times for Lustre failover are mainly a function of the overriding bulk data timeout because clients must timeout to a server twice before initiating contact with its backup. As a result, failover completion times exceeding ten minutes are common. During failover and recovery, all IO operations stall and the long duration can lead to job timeouts, poor system utilization, and increased administrator load. To improve overall failover times we are implementing Imperative Recovery, the framework by which Lustre can initiate and finish failover without waiting for long timeouts. Imperative Recovery directs clients to switch server connections based on automatic processing of node health data. With these changes and Version Based Recovery, it is possible to begin recovery very fast, reducing overall failover times to a few minutes. This paper discusses Imperative Recovery from a system perspective and characterizes the speedup achieved. |
Author(s):
Spitz, Cory, Presenter Cray Inc.
Henke, Nic Cray Inc.
Horn, Chris Cray Inc.
Spitz, Cory, Presenter Cray Inc.
Henke, Nic Cray Inc.
Horn, Chris Cray Inc.
|
Suggested Technical Category:
Mass Storage
|
|
Title: Towards a European Training Network in Computational Science
Abstract: The implementation phase of The Partnership for Advanced Computing in Europe (PRACE) project will develop and maintain a European training network in the field of computational science. Its key ingredients are solid contacts between the partner organisations and European research centres, as well as establishing new links to universities. In this talk, I will review the completed training-related activities of the preparatory phase of PRACE as well as plans for the implementation phase. |
Author(s):
Manninen, Pekka, Presenter CSC ? Scientific Computing Ltd. (CSC)
Turunen, Ari CSC ? Scientific Computing Ltd. (CSC)
Manninen, Pekka, Presenter CSC ? Scientific Computing Ltd. (CSC)
Turunen, Ari CSC ? Scientific Computing Ltd. (CSC)
|
Suggested Technical Category:
Training
|
|
Title: XGC1: Performance on the 8-core and 12-core Cray XT5 Systems at ORNL
Abstract: The XGC1 code is used to model multiscale tokamak plasma turbulence dynamics in realistic edge geometry. In June 2009, XGC1 demonstrated nearly linear weak and strong scaling out to 150,000 cores on a a Cray XT5 with 8-core nodes when solving problems of relevance to running experiments on the ITER tokamak. Here we compare performance, and discuss further performance optimizations, when running XGC1 on an XT5 with 12-core nodes on up to 224,000 cores. |
Author(s):
Worley, Patrick, Presenter Oak Ridge National Laboratory (ORNL)
Adams, Mark Columbia University
D'Azevedo, Eduardo Oak Ridge National Laboratory (ORNL)
Chang, C-S New York University
Ku, Seung-Hoe New York University
McCurdy, Collin Oak Ridge National Laboratory (ORNL)
Worley, Patrick, Presenter Oak Ridge National Laboratory (ORNL)
Adams, Mark Columbia University
D'Azevedo, Eduardo Oak Ridge National Laboratory (ORNL)
Chang, C-S New York University
Ku, Seung-Hoe New York University
McCurdy, Collin Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: RAVEN: RAS Data Analysis Through Visually Enhanced Navigation
Abstract: Supercomputer RAS data contain various signatures regarding system status, thus are routinely examined to detect and diagnose faults. However, due to voluminous sizes of logs generated during faulty situations, a comprehensive investigation that requires comparisons of different types of RAS logs over both spatial and temporal dimensions is often beyond the capacity of human operators, which leaves a cursory look to be the only feasible option. As an effort to better embrace informative but huge supercomputer RAS data in a fault diagnosis/detection process, we present a GUI tool called RAVEN that visually overlays various types of RAS logs on a physical system map where correlations between different fault types can be easily observed in terms of their quantities and locations at a given time. RAVEN also provides an intuitive fault navigation mechanism that helps examine logs by clustering them to their common locations, types, or user applications. By tracing down notable fault patterns reflected on the map and their clustered logs, and superimposing user application data, RAVEN, which has been adopted at National Institute of Computational Science (NICS) at the University of Tennessee, identified root causes of several system failures logged in Kraken XT5. |
Author(s):
Park, Byung-Hoon Oak Ridge National Laboratory (ORNL)
Heo, Junseong National Institute for Computational Sciences (NICS)
Kora, Guruprasad Oak Ridge National Laboratory (ORNL)
Geist, Al, Presenter Oak Ridge National Laboratory (ORNL)
Park, Byung-Hoon Oak Ridge National Laboratory (ORNL)
Heo, Junseong National Institute for Computational Sciences (NICS)
Kora, Guruprasad Oak Ridge National Laboratory (ORNL)
Geist, Al, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Environmental Monitoring
|
|
Title: Application Acceleration on Current and Future Cray Platforms
Abstract: Application codes in a variety of areas are being updated for performance on the latest architectures. We describe current bottlenecks and performance improvement areas for applications including plasma physics, chemistry related to carbon capture and sequestration, and material science. |
Author(s):
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Kim, Jihan National Energy Research Scientific Computing Center (NERSC)
Preissl, Robert National Energy Research Scientific Computing Center (NERSC)
Eder, David, Presenter Lawrence Livermore National Laboratory
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Velimir, Mlaker Lawrence Livermore National Laboratory
Fisher, Aaron Lawrence Livermore National Laboratory
Masters, Nathan Lawrence Livermore National Laboratory
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Kim, Jihan National Energy Research Scientific Computing Center (NERSC)
Preissl, Robert National Energy Research Scientific Computing Center (NERSC)
Eder, David, Presenter Lawrence Livermore National Laboratory
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Velimir, Mlaker Lawrence Livermore National Laboratory
Fisher, Aaron Lawrence Livermore National Laboratory
Masters, Nathan Lawrence Livermore National Laboratory
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Automatic Iterative Optimization of Parallel Applications
Abstract: Manual software optimization is effective, but also time-consuming, and can thus benefit from complementary automatic optimization schemes. This paper describes a novel cross-platform framework that is able to optimize parallel applications by tuning three sets of parameters: compiler options, environment variables and internal program parameters. The optimization is carried out using a genetic algorithm, where the trial simulations may be run in parallel allowing the optimization algorithm to scale. The performance of this framework is assessed on a Cray XT5 by optimizing both real world applications, as well as well-known synthetic benchmarks such as the High-Performance Linpack (HPL) benchmark. The results show that our optimization framework increases the performance of the test cases significantly. |
Author(s):
von Alfthan, Sebastian, Presenter CSC ? Scientific Computing Ltd. (CSC)
Lehto, Olli-Pekka CSC ? Scientific Computing Ltd. (CSC)
von Alfthan, Sebastian, Presenter CSC ? Scientific Computing Ltd. (CSC)
Lehto, Olli-Pekka CSC ? Scientific Computing Ltd. (CSC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Improving the Performance of COSMO-CLM
Abstract: The COSMO-Model, originally developed by Deutscher Wetterdienst, is a non-hydrostatic regional atmospheric model which can be used for numerical weather prediction and climate simulations and is now in use by a number of weather services for operational forecasting (e.g. MeteoSwiss). One current software engineering goal is to improve its scaling characteristics on multicore architectures by making it a hybrid MPI-OpenMP code. We will present hybridization strategies for different components of the model, show some first performance results, and discuss the impact on further development of the model. |
Author(s):
Cordery, Mathew, Presenter CSCS-Swiss National Supercomputing Centre (CSCS) CSCS
Sawyer, Will CSCS-Swiss National Supercomputing Centre (CSCS) CSCS
Schaettler, Ulrich CSCS-Swiss National Supercomputing Centre (CSCS) Deutscher Wetterdienst
Cordery, Mathew, Presenter CSCS-Swiss National Supercomputing Centre (CSCS) CSCS
Sawyer, Will CSCS-Swiss National Supercomputing Centre (CSCS) CSCS
Schaettler, Ulrich CSCS-Swiss National Supercomputing Centre (CSCS) Deutscher Wetterdienst
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Overview and Performance Evaluation of Cray LibSci Products
Abstract: This talk serves as both an introduction to the Cray scientific library suite and as a tutorial on obtaining advanced performance with applications that utilize scientific libraries. The talk will include a thorough and frank performance evaluation of all scientific library products on Cray XT systems, including dense kernels on single core and multiple cores, dense linear solvers and eigensolvers in serial and parallel, serial and distributed Fourier Transforms and Sparse kernels within sparse iterative solvers. The emphasis will be on usage and how to increase performance by using different algorithms or libraries, better configurations, or advanced controls of the scientific libraries. |
Author(s):
Tate, Adrian, Presenter Cray Inc.
Tate, Adrian, Presenter Cray Inc.
|
Suggested Technical Category:
Libraries
Joint Session, Tutorial or Other
Technical Category suggested:
it makes sense to join this talk with any other internal PE software talks such as compilers, tools, PE overview etc.
|
|
Title: Evaluation of Productivity and Performance Characteristics of CCE CAF and UPC Compilers
Abstract: The Co-Array Fortran (CAF) and Unified Parallel C (UPC) functional compilers available with the Cray Compiler Environment (CCE) on the Cray XT5 platform offer an integrated framework for code development and execution for Partitioned Global Address Space (PGAS) programming paradigm together with message-passing MPI and shared-memory OpenMP programming models. Using micro-benchmarks, conformance test cases and micro-kernels of representative scientific calculations, we attempt to evaluate the following characteristics of the CCE PGAS compilers: (1) usability of the framework for code development and execution; (2) completeness and integrity of code generation; (3) efficiency of the generated code, particularly usage of the communication layer (GASNet on SeaStar2); and (4) tools availability for performance measurement and diagnostics. Our initial results show that the current version of compiler provides a highly productive code development environment for CAF or UPC code development on our target Cray XT5 platform. At the same time however, we observe that the code transformation and generation processes are unable to aggregate remote memory access for simple access patterns causing significant slowdown. We will compare and contrast code generation with two multi-platform PGAS compilers: Berkley UPC environment that uses the Intrepid UPC compiler and the g95 CAF compiler extensions. In the full paper version, we would also include comparative results using the Rice CAF 2.0 compiler, if it becomes available in due time. |
Author(s):
Alam, Sadaf, Presenter CSCS-Swiss National Supercomputing Centre (CSCS)
Cordery, Matthew CSCS-Swiss National Supercomputing Centre (CSCS)
Sawyer, William CSCS-Swiss National Supercomputing Centre (CSCS)
Stitt, Tim CSCS-Swiss National Supercomputing Centre (CSCS)
Stringfellow, Neil CSCS-Swiss National Supercomputing Centre (CSCS)
Alam, Sadaf, Presenter CSCS-Swiss National Supercomputing Centre (CSCS)
Cordery, Matthew CSCS-Swiss National Supercomputing Centre (CSCS)
Sawyer, William CSCS-Swiss National Supercomputing Centre (CSCS)
Stitt, Tim CSCS-Swiss National Supercomputing Centre (CSCS)
Stringfellow, Neil CSCS-Swiss National Supercomputing Centre (CSCS)
|
Suggested Technical Category:
Compilers
|
|
Title: An Alliance for Computing at the Extreme Scale
Abstract: Los Alamos and Sandia National Laboratories have formed a new high performance computing center, the Alliance for Computing at the Extreme Scale (ACES). The two labs will jointly architect, develop, procure and operate capability systems for DOE’s Advanced Simulation and Computing Program. This presentation will discuss (1) a petascale production capability system, Cielo, that will be deployed in late 2010, (2) a technology roadmap for exascale computing and (3) a new partnership with Cray on advanced interconnect technologies. |
Author(s):
Dosanjh, Sudip, Presenter Sandia National Laboratories (SNLA)
Morrison, John, Presenter Los Alamos National Laboratory
Ang, James Sandia National Laboratories (SNLA)
Koch, Ken Los Alamos National Laboratory
Dosanjh, Sudip, Presenter Sandia National Laboratories (SNLA)
Morrison, John, Presenter Los Alamos National Laboratory
Ang, James Sandia National Laboratories (SNLA)
Koch, Ken Los Alamos National Laboratory
|
Suggested Technical Category:
Architecture
|
|
Title: File System Monitoring as a Window Into User I/O Requirements
Abstract: The effective management of HPC I/O resources requires an understanding of user requirements, so the National Energy Research Scientific Computing center (NERSC) annually surveys its project leads for their anticipated needs. With the advent of detailed monitoring on the Lustre prarallel file system of the Franklin Cray XT it becomes possible to compare actual experience with the expectations presented in the surveys. A correlation of the Lustre Monitoring Tool (LMT) data with job log statistics reveals I/O behavior on a per-project basis. This feedback for both the users and the center enhances NERSC's ability to manage and provision Franklin's I/O subsytem as well as to plan for future I/O requirements. |
Author(s):
Uselton, Andrew, Presenter National Energy Research Scientific Computing Center (NERSC)
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
Ushizima, Daniela Lawrence Berkeley National Lab
Sukharev, Jefferey University of California, Davis
Uselton, Andrew, Presenter National Energy Research Scientific Computing Center (NERSC)
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
Ushizima, Daniela Lawrence Berkeley National Lab
Sukharev, Jefferey University of California, Davis
|
Suggested Technical Category:
Mass Storage
|
|
Title: Correlating Log Messages for System Diagnostics
Abstract: In large-scale computing systems the sheer volume of log generated has challenged the interpretation of log messages for debugging and monitoring purposes. For a non-trivial event, the Jaguar XT5 at the Oak Ridge Leadership Computing Facility with more than eighteen thousand compute nodes would generate a few hundred thousand log entries in less than a minute. Determining the root cause of such events requires analyzing and understanding these log messages. Most often, these log messages are best understood when they are interpreted collectively rather than being read as individual messages. In this paper, we present our approach to interpreting log messages by identifying commonalities and grouping them into clusters. Given a set of log messages within a time interval, we parse and group the messages based on source, target, and/or error type, and correlate the messages with hardware and application information. We monitor the XT5’s console, netwatch and sys log and show how such grouping of log messages help in detecting system events. By intelligent grouping and correlation of events from multiple sources we are able to provide system administrators with meaningful information in a concise format for root cause analysis. |
Author(s):
Gunasekaran, Raghul Oak Ridge National Laboratory (ORNL)
Park, Byung Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
Gunasekaran, Raghul Oak Ridge National Laboratory (ORNL)
Park, Byung Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Operations
|
|
Title: Improving the Productivity of Scalable Application Development with TotalView
Abstract: Scientists and engineers who set out to solve grand computing challenges need TotalView at their side. The TotalView debugger provides a powerful and scalable tool for analyzing, diagnosing, debugging and troubleshooting a wide variety of different problems that might come up in the process of such achievements. These teams, and teams of scientists pursuing a wide range of computationally complex problems on Cray XT systems are frequently diverse and geographically distributed. These groups work collaboratively on complex applications in a computational environment that they access through a batch resource management system. This talk will explore the productivity challenges faced by scientists and engineers in this environment -- highlighting both long standing (but perhaps unfamiliar) and recently introduced capabilities that TotalView users on Cray can take advantage of to boost their productivity. The list of capabilities will include the CLI, subset attach, Remote Display Client, TVScript, MemoryScape's reporting, and ReplayEngine. |
Author(s):
Gottbrath, Chris, Presenter TotalView Technologies
Gottbrath, Chris, Presenter TotalView Technologies
|
Suggested Technical Category:
Programming Environment
|
|
Title: Lessons Learned in Deploying the World's Largest Scale Lustre File System
Abstract: The Spider parallel file system at Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) is the world’s largest scale Lustre file system. It has nearly 27,000 file system clients, 10 PB of capacity, and over 240 GB/s of demonstrated I/O bandwidth. In full-scale production for over 6 months, Spider provides a high performance parallel I/O environment to a diverse portfolio of computational resources. These range from the high end, multi-Petaflop Jaguar XT5, the mid-range, 260 Teraflop Jaguar XT4, to the low end, with numerous systems supporting development, visualization, and data analytics. Throughout this period we have had a number of critical design points reinforced while learning a number of lessons on designing, deploying, managing, and using a system of this scale. This paper details our operational experience with the Spider file system, focusing on observed reliability (including MTTI and MTTF), manageability, and system performance under a diverse workload. |
Author(s):
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Leverman, Dustin Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Simmons, James Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Leverman, Dustin Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Simmons, James Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Mass Storage
|
|
Title: What is a 200,000 CPUs Petaflop Computer Good For (a Theoretical Chemist Perspective)?
Abstract: We describe the efforts undertaken to efficiently parallelize the computational chemistry code NWChem on the Cray XT hardware using the Global Arrays/ARMCI middleware. We show how we can now use 200K+ processors to address complex scientific problems. |
Author(s):
Apra, Edoardo, Presenter Oak Ridge National Laboratory (ORNL)
Tipparaju, Vinod Oak Ridge National Laboratory (ORNL)
Olson, Ryan Cray Inc.
Apra, Edoardo, Presenter Oak Ridge National Laboratory (ORNL)
Tipparaju, Vinod Oak Ridge National Laboratory (ORNL)
Olson, Ryan Cray Inc.
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Reducing Application Runtime Variability on Jaguar XT5
Abstract: Operating system (OS) noise is defined as interference generated by the OS that prevents the compute core from performing “useful” work. Compute node kernel daemons, network interfaces, and other OS related services are major sources of such interference. This interference on individual compute cores can vary in duration and frequency and can cause de-synchronization (jitter) in collective communication tasks and thus results in variable (degraded) overall parallel application performance. This behavior is more observable in large-scale applications using certain types of collective communication primitives, such as MPI_Allreduce. This paper presents our efforts towards reducing the overall effect of OS noise on our large-scale parallel applications. Our tests were performed on the quad-core Jaguar, the Cray XT5 at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). At the time of these tests, Jaguar was a 1.4 PFLOPS supercomputer with 144,000 compute cores and 8 cores per node. The technique we used was to aggregate and merge all OS noise sources onto a single compute core for each node. The scientific application was then run on the remaining seven cores in each node. Our results show that we were able to improve the MPI_Allreduce performance by two orders of magnitude and to boost the Parallel Ocean Program (POP) performance over 30% using this technique. |
Author(s):
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, Dave Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Becklehimer, Jeff Cray Inc.
Larkin, Jeff Cray Inc.
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, Dave Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Becklehimer, Jeff Cray Inc.
Larkin, Jeff Cray Inc.
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Franklin Job Completion Analysis
Abstract: The NERSC Cray XT4 machine Franklin has been in production for 3000+ users since October 2007, where about 1800 jobs were run each day. There has been an on-going effort to better understand how well these jobs run, whether failed jobs are due to application errors or system issues, and to further reduce system related job failures. In this paper, we will talk about the progress we made in tracking job completion status, in identifying job failure root cause, and in expediting resolution of job failures, such as hung jobs, that are caused by system issues. In addition, we will present some Cray software design enhancements we requested to help us track application progress and identify errors. |
Author(s):
He, Yun (Helen), Presenter National Energy Research Scientific Computing Center (NERSC)
Lin, Hwa-Chun Wendy, Presenter National Energy Research Scientific Computing Center (NERSC)
Yang, Woo-Sun National Energy Research Scientific Computing Center (NERSC)
Lin, Hwa-Chun Wendy, Presenter National Energy Research Scientific Computing Center (NERSC)
He, Yun (Helen), Presenter National Energy Research Scientific Computing Center (NERSC)
Lin, Hwa-Chun Wendy, Presenter National Energy Research Scientific Computing Center (NERSC)
Yang, Woo-Sun National Energy Research Scientific Computing Center (NERSC)
Lin, Hwa-Chun Wendy, Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Consulting
Joint Session, Tutorial or Other
Technical Category suggested:
note: The more appropriate category for this paper would be: "User Support" (which is missing) under "Systems Support" category. Thanks!
|
|
Title: An Overview of the Chapel Programming Language and Implementation
Abstract: Chapel is a new parallel programming language under development at Cray Inc. as part of the DARPA High Productivity Computing Systems (HPCS) program. Chapel has been designed to improve the productivity of parallel programmers working on large-scale supercomputers as well as small-scale, multicore computers and workstations. It aims to vastly improve programmability over current parallel programming models while supporting performance and portability at least as good as today's technologies. In this tutorial, we will present an introduction to Chapel, from context and motivation to a detailed description of Chapel via many example computations. This tutorial will focus on writing Chapel programs for both multi-core and distributed-memory computers. We will explore the optimizations added to the Chapel implementation this past year that helped with the most recent Chapel HPCC entry. |
Author(s):
Deitz, Steve, Presenter Cray Inc.
Deitz, Steve, Presenter Cray Inc.
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
Language / Programming Environment / Compiler Tutorial
|
|
Title: Interactions Between Application Communication and I/O Traffic on the Cray XT High Speed Network
Abstract: The massive size of modern leadership computing resources often leads to the discovery of application performance bottlenecks not seen at smaller scales. Many of these performance bottlenecks originate within individual applications; however, recent application testing on a Cray XT5 indicates that an application's I/O pattern can negatively impact the communication performance of another application via interaction over the shared high speed network (HSN). This study seeks to identify and to quantify such interactions on the HSN of Kraken, the Cray XT5 operated by the National Institute for Computational Sciences (NICS). |
Author(s):
Brook, R. Glenn, Presenter National Institute for Computational Sciences (NICS)
Crosby, Lonnie D. National Institute for Computational Sciences (NICS)
Brook, R. Glenn, Presenter National Institute for Computational Sciences (NICS)
Crosby, Lonnie D. National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Networking
|
|
Title: Five Powerful Chapel Idioms
Abstract: The Chapel parallel programming language, under development at Cray Inc., has the potential to deliver high performance to more programmers with less effort than current practices provide. This is especially the case with the many-core architectures that are already becoming more and more prevalent. This paper presents five reasons why: 1. Chapel supports easy-to-use asynchronous and synchronous remote tasks, 2. Chapel supports local and remote transactions, 3. Chapel supports simple data-parallel abstractions when applicable, 4. Chapel supports user-defined data distributions, and 5. Chapel supports arbitrarily nested parallelism. |
Author(s):
Deitz, Steve, Presenter Cray Inc.
Chamberlain, Brad Cray Inc.
Choi, Sung-Eun Cray Inc.
Iten, David Cray Inc.
Prokowich, Lee Cray Inc.
Deitz, Steve, Presenter Cray Inc.
Chamberlain, Brad Cray Inc.
Choi, Sung-Eun Cray Inc.
Iten, David Cray Inc.
Prokowich, Lee Cray Inc.
|
Suggested Technical Category:
Programming Environment
|
|
Title: Thermodynamics of Magnetic Systems from First Principles: WL-LSMS
Abstract: We describe a method to combine classical thermodynamic Monte Carlo calculations (the Wang-Landau method) with a first principles electronic structure calculation, specifically our locally self-consistent multiple scattering (LSMS) code. The combined code shows superb scaling behavior on massively parallel computers and is able to calculate the transition temperature of Fe without external parameters. The code was the recipient of the 2009 Gordon-Bell prize for peak performance. |
Author(s):
Eisenbach, Markus, Presenter Oak Ridge National Laboratory (ORNL)
Nicholson, Donald Oak Ridge National Laboratory (ORNL)
Brown, Gregory Florida State University
Zhou, Chengang J P Morgan Chase & Co
Larkin, Jeff Cray Inc.
Schulthess, Thomas CSCS-Swiss National Supercomputing Centre (CSCS)
Eisenbach, Markus, Presenter Oak Ridge National Laboratory (ORNL)
Nicholson, Donald Oak Ridge National Laboratory (ORNL)
Brown, Gregory Florida State University
Zhou, Chengang J P Morgan Chase & Co
Larkin, Jeff Cray Inc.
Schulthess, Thomas CSCS-Swiss National Supercomputing Centre (CSCS)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Parallelism in System Tools
Abstract: The Cray XT, when employed in conjunction with the Lustre filesystem, provides the ability to generate huge amounts of data in the form of many files. This is accommodated by satisfying the requests of multiple Lustre clients in parallel. In contrast, a single service node (Lustre client) cannot provide timely management for such datasets. Consequently, as the dataset enters the 10+ TB range and/or hundreds of thousands of files, using traditional UNIX tools like cp, tar, or “find . –exec ... ;” to manage these datasets causes the impact to user productivity to become substantial. For example, it would take about 12 hours to copy a 10 TB dataset from the service node via cp if dedicated resources were employed. In general, it is not practical to schedule dedicated resources for a data copy and, as a result, a typical duty factor of 4X is incurred. This means that, in practice, it would take 48 hours to perform a serial copy of a 10 TB dataset. Over the next three to four years, datasets are likely to grow by a factor of 4X. At that point, the simple copy of a dataset may be expected to take over a week and represents significant impediment to the investigation of science. In this paper, we introduce the Lustre User Toolkit for Cray XT, developed at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF) and demonstrate that, by optimizing and parallelizing system tools, an order of magnitude performance increase or more can be achieved, thereby reducing or eliminating the bottleneck. The conclusion is self-evident: parallelism in system tools is vital to managing large datasets. |
Author(s):
Matney, Sr., Kenneth, Presenter Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
Matney, Sr., Kenneth, Presenter Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Systems Support - Tools
|
|
Title: Analyzing Multicore Characteristics for a Suite of Applications on an XT5 System
Abstract: In this paper, we will explore the performance of applications important to Sandia on an XT5 system with dual socket AMD 6 core Istanbul nodes. We will explore scaling as a function of the number of cores used on each node and determine the effective core utilization as core count increases. We will then analyze these results using profiling to better understand resource contention within and between nodes. |
Author(s):
Vaughan, Courtenay, Presenter Sandia National Laboratories (SNLA)
Doerfler, Douglas Sandia National Laboratories (SNLA)
Vaughan, Courtenay, Presenter Sandia National Laboratories (SNLA)
Doerfler, Douglas Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: External Services on the Cray XT5 System Hopper
Abstract: Cray External Service offerings such as login nodes, data mover nodes, and file systems which are external to the main XT system, provide an opportunity to make Cray XT High Performance Computing resources more robust and accessible to end users. This paper will discuss our experiences using external services on Hopper, a Cray XT5 system at the National Energy Research Scientific Computing (NERSC) Center. It will describe the motivation for externalizing services, early design decisions, security issues, implementation challenges and production feedback from NERSC users. |
Author(s):
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
Butler, Tina National Energy Research Scientific Computing Center (NERSC)
Carter, Jonathan , Presenter National Energy Research Scientific Computing Center (NERSC)
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
Butler, Tina National Energy Research Scientific Computing Center (NERSC)
Carter, Jonathan , Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Architecture
|
|
Title: The Evolution of a Petascale Application: Work on CHIMERA
Abstract: CHIMERA is a multi-dimensional radiation hydrodynamics code designed to study core-collapse supernovae. We will review several recent enhancements to CHIMERA designed to better exploit features of the CRAY XT architecture, as well as some forward-looking work to take advantage of the next generation of Cray supercomputers. |
Author(s):
Messer, Bronson , Presenter Oak Ridge National Laboratory (ORNL)
Bruenn, Stephen Florida Atlantic University
Hix, Raph Oak Ridge National Laboratory (ORNL)
Mezzacappa, Anthony Oak Ridge National Laboratory (ORNL)
Blondin, John North Carolina State University
Messer, Bronson , Presenter Oak Ridge National Laboratory (ORNL)
Bruenn, Stephen Florida Atlantic University
Hix, Raph Oak Ridge National Laboratory (ORNL)
Mezzacappa, Anthony Oak Ridge National Laboratory (ORNL)
Blondin, John North Carolina State University
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The Graph 500
Abstract: New large-scale informatics applications require radically different architectures from those optimizing for 3D Physics. The 3D physics community is represented in the Top 500 list by a LINPACK as a single, simple, dense algebra benchmark. Informally, the Cray XMT performs significantly better than other known architectures on large-scale graph problems, which is a core informatics application kernel. The Graph 500 list, to be introduced at Supercomputing 2010, will formalize a single, unified graph benchmark for the informatics community to rally around and to precipitate innovation in the informatics space. This paper will discuss the need for this kind of benchmark, the benchmark itself, an initial set of results on a small subset of platforms (including XMT), and why those platforms are fundamentally different from other classes of supercomputer. |
Author(s):
Murphy, Richard, Presenter Sandia National Laboratories (SNLA)
Ang, Jim Sandia National Laboratories (SNLA)
Henrickson, Bruce Sandia National Laboratories (SNLA)
Rodrigues, Arun Sandia National Laboratories (SNLA)
Barrett, Brian Sandia National Laboratories (SNLA)
Murphy, Richard, Presenter Sandia National Laboratories (SNLA)
Ang, Jim Sandia National Laboratories (SNLA)
Henrickson, Bruce Sandia National Laboratories (SNLA)
Rodrigues, Arun Sandia National Laboratories (SNLA)
Barrett, Brian Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Architecture
|
|
Title: Performance Monitoring Tools for Large Scale Systems
Abstract: Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are insufficient to process the increasingly large and diverse volume of performance and status log data produced by the world’s largest systems. In addition to the large data volume, the wide variety of systems employed by the largest computing facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation. This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory’s Leadership Computing Facility, including the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools. |
Author(s):
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Environmental Monitoring
|
|
Title: Multi-core Programming Paradigms and MPI Message Rates—A Growing Concern?
Abstract: The continued growth in per-node core count in high performance computing platforms has lead the community to investigate alternatives to an MPI-everywhere programming environment. A hybrid programming environment, in which MPI is used for coarse-grained, inter-node parallelism and a threaded environment (pthreads, OpenMP, etc.) is used for fine-grained, intra-node parallelism presents an appealing target for future applications. At the same time, memory and network bandwidth both continue to grow at a significantly slower pace than processor performance. This trend, combined with increased parallelism due to larger machine sizes, will drive applications away from the bandwidth-limited BSP model to one with a higher number of smaller messages, which avoids unnecessary memory-to-memory copies inside a single node. The increase in small message transfers requires a higher message rate from a single node. Current network designs rely on a number of tasks on a single node injecting messages into the network in order to achieve optimal message rates. This paper quantifies the impact of local process count on node-level message rate for Cray XT5 hardware. The results are an important metric in designing both MPI implementations and applications for the hybrid programming future. |
Author(s):
Hemmert, Scott, Presenter Sandia National Laboratories (SNLA)
Hemmert, Scott, Presenter Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Networking
|
|
Title: Cray Debugging Support Tools for Petascale Applications
Abstract: As HPC systems have gotten ever larger, the amount of information associated with a debugging failing parallel application has grown beyond what the beleaguered applications developer has the time, resources, and wherewithal to analyze. With the release of the Cray Debugging Support package, Cray introduces several innovative methods of attacking this vexing problem. FTD (Fast Track Debugging) achieves debugging at fully optimized speeds. STAT (Stack Trace Analysis Tool) facilitates the evaluation and study of hung applications. ATP (Abnormal Termination Processing) captures a STAT-like view of applications that have taken a fatal trap. And Guard, the Cray comparative debugger, delivers an automated search for the location of program errors by comparing a working version of an application against a failing version. This paper describes and explores each of the above technologies. |
Author(s):
Moench, Bob, Presenter Cray Inc.
Moench, Bob, Presenter Cray Inc.
|
Suggested Technical Category:
Programming Environment
|
|
Title: Running Hadoop on a Cray XT System
Abstract: Hadoop is an open source implementation of the MapReduce programming model popularized by Google. Hadoop has been heavily adopted in the Web 2.0 community and is now making inroads in the scientific and research communities. The flexibility of the MapReduce programming model combined with the power of the Cray XT can impact the size and nature of scientific explorations. In this paper we will explain the motivations for using Hadoop and describe the steps to deploy the framework on a Cray XT. We will examine some of the configuration options and their impact on performance. We will compare the performance of several applications running in Hadoop on the Cray system with the performance on standard Hadoop deployments on clusters and Cloud systems. We will conclude with some assessment on the feasibility and efficacy of running Hadoop on HPC systems and future work. |
Author(s):
Canon, Shane, Presenter National Energy Research Scientific Computing Center (NERSC)
Ramakrishnan, Lavanya National Energy Research Scientific Computing Center (NERSC)
Jackson, Keith Lawrence Berkeley National Lab
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Canon, Shane, Presenter National Energy Research Scientific Computing Center (NERSC)
Ramakrishnan, Lavanya National Energy Research Scientific Computing Center (NERSC)
Jackson, Keith Lawrence Berkeley National Lab
Shalf, John National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Validating File System Permissions on Multi-OS Systems
Abstract: The Cray XT series of HPC computers presents the system security officer and system administrator with a range of operating systems (Linux, CNL, CVN), job launch (shell/exec, ALPS, yod) and file systems (UFS, NFS, Lustre, LibSysIO, DVS). Available open-source packages do not span this range of requirements. As the system integrator, Cray provides the fundamentals for validating that file system permissions are correctly enforced. However, due to Sandia's security requirements, we were forced to develop a software tool for checking POSIX permission handling across multiple combinations of OS's and file systems. This paper presents the architecture and design of a novel Lisp-based POSIX file system validation tool that uses multi-methods and object-oriented programming to validate tester-specified combinations of access patterns. |
Author(s):
Ballance, Robert, Presenter Sandia National Laboratories (SNLA) Sandia National Laboratories
Ballance, Robert, Presenter Sandia National Laboratories (SNLA) Sandia National Laboratories
|
Suggested Technical Category:
System Operations
|
|
Title: FutureGrid: Design and Implementation of a National Grid Test-Bed
Abstract: Indiana University is leading the creation of a grid test-bed for the National Science Foundation with nine partner institutions. FutureGrid is a high performance grid test-bed that will allow scientists to work collaboratively to develop and test novel approaches to parallel, grid, and cloud computing. |
Author(s):
Hancock, David, Presenter Indiana University
von Laszewski, Gregor Indiana University
Hancock, David, Presenter Indiana University
von Laszewski, Gregor Indiana University
|
Suggested Technical Category:
Joint Session
Joint Session, Tutorial or Other
Technical Category suggested:
This is a high level overview of our NSF Track 2 award with technical details in networking, operations, architecture, and systems software. As a new CUG member you can place this in the most appropriate track.
|
|
Title: Overview of the Current and Future Cray CX Product Family
Abstract: Please join us for a detailed product briefing of an exciting new product from Cray. This new product will be a significant enhancement to the Cray portfolio, and will expand the range of capabilities and programming models available to our customers and prospects. |
Author(s):
Miller, Ian, Presenter Cray Inc.
Miller, Ian, Presenter Cray Inc.
|
Suggested Technical Category:
Architecture
|
|
Title: Virtual Paleontology: Gait Reconstruction of Extinct Vertebrates Using High Performance Computing
Abstract: CARP is a code for studying cardiac arrhymthias by discretizing an MRI scan and simulating the electric potential in the heart tissue. The aim is to use simulations to try out possible surgical interventions before using them on a patient. In this paper we report on work carried out to improve both the absolute and scaling performance on HECToR, the Cray XT at Edinburgh. We study the effects of different decomposition techniques, including a communication-hierarchy aware decomposition that minimizes inter-node communication at the expense of intra-node messages. These and further output optimizations have improved the scaling from 512 to 8192 cores on HECToR, allowing simulation of a single heartbeat in under 5 minutes. |
Author(s):
Sellers, Bill, Presenter Manchester University
Mitchell, Lawrence, Presenter
Sellers, Bill, Presenter Manchester University
Mitchell, Lawrence, Presenter
|
Suggested Technical Category:
User Code Optimization
Joint Session, Tutorial or Other
Technical Category suggested:
Invited Talk, 0900 to 0945 on Weds 26th May
|
|
Title: Advanced Job Scheduling Features for Cray Systems with Platform LSF
Abstract: On large Cray systems where all simulation jobs are running through workload management, visibility of the system and jobs are critical for users and administrators to troubleshoot problems. Features in Platform LSF such as scheduling performance, resource reservation and job level data display help simulation users and system administrators easily overcome this challenge. Benchmark data will show how Platform LSF outperforms other workload schedulers. We will also discuss additional technologies from Platform including Platform MPI and its integration with Platform LSF. |
Author(s):
Lu, William, Presenter Platform Computing
Lu, William, Presenter Platform Computing
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: LSI Storage Best Practices for Deploying and Maintaining Large Scale Parallel File Systems Environments
Abstract: This BoF session will focus on the challenges and rewards of deploying and implementing a parallel file system to improve cluster performance. The discussion will focus on the impact that deployment and ongoing support has in terms of system performance, system availability and pain for support personnel. Discussion will highlight experiences with different file systems, different types of platform approaches, and balancing vendor support vs. in-house support. We will discuss best practices and ask for audience participation to try to refine those best practices to help users understand how to leverage parallel file systems successfully. |
Author(s):
Didier, Gava, Presenter LSI
Merrill, LaNet LSI
Didier, Gava, Presenter LSI
Didier, Gava, Presenter LSI
Merrill, LaNet LSI
Didier, Gava, Presenter LSI
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
BoF
|
|
Title: High Performance Computing with Clouds: Past, Present, and Future
Abstract: Cloud Computing has its roots in technologies spanning the past 30 years: in network operating systems, distributed systems, metacomputing, clusters, and Grids. The promise of Clouds-seamless access to computing power and information resources, on-demand--is delivering substantial benefits in real production settings for datacenter applications. Although, private Clouds are having success in HPC, key characteristics that make public Clouds a compelling solution for the datacenter are orders-of-magnitude larger in HPC. Whether this gap can be crossed in the coming years is an open question; Clouds for HPC are at their infancy. |
Author(s):
Nitzberg, Bill, Presenter Altair Engineering, Inc.
Nitzberg, Bill, Presenter Altair Engineering, Inc.
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Plenary/keynote style talk.
|
|
Title: HPC at NCAR: Past, Present and Future
Abstract: The history of high-performance computing at NCAR is reviewed from Control Data Corporation’s 3600 through the current IBM p575 cluster, but with special recognition of NCAR’s relationship with Seymour Cray and Cray Research, Inc. The recent acquisition of a Cray XT5m is discussed, along with the rationale for that acquisition. NCAR’s plans for the new NCAR-Wyoming Supercomputing Center in Cheyenne, Wyoming, and the current status of that construction project, are also described. |
Author(s):
Engel, Tom, Presenter National Center for Atmospheric Research (NCAR)
Engel, Tom, Presenter National Center for Atmospheric Research (NCAR)
|
Suggested Technical Category:
Other Cray Systems
|
|
Title: CUDA Fortran 2003
Abstract: In the past year, The Portland Group has brought to market a low-level, explicit, Fortran GPU programming language, a higher-level, implicit, directive-based GPU programming model and implementation, and object-oriented features from the Fortran 2003 standard. Together, these provide a rich environment for programming today and tomorrow's many-core systems. In this paper we will present some of the latest features available in the PGI Fortran compiler from these three areas, and explain how they can be combined to access the performance of CPUs and GPUs while keeping application developers hidden from many of the messy details. |
Author(s):
Leback, Brent, Presenter The Portland Group
Wolfe, Michael The Portland Group
Miles, Douglas The Portland Group
Leback, Brent, Presenter The Portland Group
Wolfe, Michael The Portland Group
Miles, Douglas The Portland Group
|
Suggested Technical Category:
Compilers
|
|
Title: A Comparison of Shared Memory Parallel Programming Models
Abstract: The dominant parallel programming models for shared memory computers, Pthreads and OpenMP, are both "thread-centric" in that they are based on explicit management of tasks and enforce data dependencies through task management. By comparison, the Cray XMT programming model is data-centric where the primary concern of the programmer is managing data dependencies, allowing threads to progress in a data flow fashion. The XMT implements this programming model by associating tag bits with each word of memory, affording efficient fine grained synchronization of data independent of the number of processors or how tasks are scheduled. When task management is implicit and synchronization is abundant, efficient, and easy to use, programmers have viable alternatives to traditional thread-centric algorithms. In this paper we compare the amount of available parallelism in a variety of different algorithms and data structures when synchronization does not need to be rationed, as well as identify opportunities for platform and performance portability of the data-centric programming model on multi-core processors. |
Author(s):
Mogill, Jace, Presenter Pacific Northwest National Laboratory
Haglin, David Pacific Northwest National Laboratory
Mogill, Jace, Presenter Pacific Northwest National Laboratory
Haglin, David Pacific Northwest National Laboratory
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Cray OS Road Map
Abstract: This paper will discuss Cray's operating system road map. This includes the compute node OS, the service node OS, the network stack, file systems, and administrative tools. Coming changes will be previewed, and themes of future releases will be discussed.
|
Author(s):
Carroll, Charlie, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Architecture
|
|
Title: Data Systems Modernization (DSM) Project: Development, Deployment, and Direction
Abstract: The Data Systems Modernization (DSM) project was undertaken to consolidate and update the current information systems of the Oak Ridge Leadership Computing Facility (OLCF). The project combined the Resource Allocation and Tracking System (RATS), the New Account Creation System (NACS) and open-source process management and business intelligence software to streamline the data processing systems of the OLCF. This paper will discuss the development, deployment and future directions of this ongoing project. |
Author(s):
Whitten, Robert, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Account Administration
|
|
Title: Cray's Security update process (or Why can't we have it right now?!)
Abstract: "Security vulnerability" is a frightening and annoying fact of life in the computer world. There is a lot of confusion regarding Cray's process for monitoring and evaluating vulnerabilities as well as providing updates to our customers. This paper will describe the current process of periodic security updates and the fast-paced exciting "security scramble" we all know and don't love so much. |
Author(s):
Palm, Wendy, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
XTreme BOF - Better coordination of our security effects to more effectively support our sites.
|
|
Title: A uGNI-Based MPICH2 Nemesis Network Module for Cray XE Computer Systems
Abstract: Recent versions of MPICH2 have featured Nemesis - a scalable, high-performance, multi-network communication subsystem. Nemesis provides a framework for developing Network Modules (Netmods) for interfacing the Nemesis subsystem to various high speed network protocols. Cray has developed a User-Level Generic Network Interface (uGNI) for interfacing MPI implementations to the internal high speed network of Cray XE and follow-on compute systems. This paper describes the design of a uGNI Netmod for the MPICH2 nemesis subsystem. Performance data on the Cray XE will be presented. Planned future enhancements to the uGNI MPICH2 Netmod will also be discussed. |
Author(s):
Pritchard, Howard, Presenter Cray Inc. (CRAY)
Gorodetsky, Igor Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Grand-scale WRF Testing on the Cray XT5 and XE6
Abstract: The Arctic Region Supercomputing Center (ARSC) continues to push the Weather Research and Forecasting (WRF) model in ambitious directions. With the help of Cray, Inc. and WRF developers, a model size of more than one billion grid points was tested on a real-world weather scenario over the North Pacific and Arctic, providing 1 kilometer horizontal resolution over the entire region. With research and operations groups increasingly interested in horizontal resolutions of 100 meters or less, and fine-scale vertical resolutions near the surface, it becomes imperative to begin testing the full WRF environment (pre-processing, model execution and post-processing) on domains consisting of billions of grid points, executed on tens to hundreds of thousands of cores. In this year's paper and presentation, we extend our work of previous years by attempting to run real-world weather simulations domains of billions of grid points, employing various optimization schemes suggested by Cray and WRF developers. The platforms used for this work include the Cray XT5 (kraken) and the Cray XE6 (chugach). |
Author(s):
Morton, Don, Presenter Arctic Region Supercomputing Center (ARSC)
Nudson, Oralee Arctic Region Supercomputing Center (ARSC)
Bahls, Don Arctic Region Supercomputing Center (ARSC)
Johnsen, Peter Cray Inc. (CRAY)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Introduction to Programming for GPU Accelerators
Abstract: At Nvidia’s GTC ‘10, Cray announced the future support for Nvidia GPUs in XE6 systems. At SC10, GPUs were one of the hottest topics of the conference. This tutorial will teach the basics of GPU computing in preparation for gpu-accelerated Cray XE6 systems. I will establish a baseline knowledge in GPU architecture. This will be followed by a discussion of the currently available options for GPU programming, including CUDA C, CUDA Fortran, OpenCL, and compiler directives. The tutorial will also include a demonstration of GPU performance analysis and basic optimization techniques. |
Author(s):
Larkin, Jeff, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
As discussed with Dave Hancock, this outline is roughly 4 hours in length, but could be pruned down to a 2 hour tutorial if necessary.
Outline (Total Time: 4 hours with 1 break):
1) Introduction to GPUs (45 minutes)
a) GPU architecture basics
b) Differences between GPUs and CPUs
c) When do GPUs make sense?
2) GPU Programming Models (90 minutes)
a) What makes GPU programming different?
b) CUDA (C and Fortran)
c) OpenCL
d) PGI Accelerator Directives
e) Proposed OpenMP Accellerator Extensions
3) Performance Analysis (30 minutes)
a) CUDA Profiler
4) GPU Optimization Basics (60 minutes)
a) Optimizing data transfers
b) GPU memory optimizations
c) Register occupancy
d) Asynchronous execution
|
|
Title: Optimizing Nuclear Physics Codes on the XT5
Abstract: Scientists studying the structure and behavior of the atomic nucleus require immense high-performance computing resources to gain scientific insights. Several nuclear physics codes are capable of scaling to more than 100,000 cores on Oak Ridge National Laboratory's petaflop Cray XT5 system, Jaguar. In this paper, we present our work on optimizing codes in the nuclear physics domain. |
Author(s):
Hartman-Baker, Rebecca, Presenter Oak Ridge National Laboratory (ORNL)
Nam, Hai Ah Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Metrics and Best Practices for Host-based Access Control to Ensure
System Integrity and Availability
Abstract: Open access in academic research computing exposes the servers to many
kind of brute-force attacks and vulnerability exploits. The system
administrator has a delicate task to similarly ensure system integrity by proper access controls and by applying security patches but also to enable service availability and ease of use. This paper will present an analysis of aggregated log metrics for access history and service up-time processed with tools as Nagios and Splunk in conjunction of a set of cases of vulnerabilities, intrusions and faults. The paper will also compare and suggest improved best practices to be shared between sites.
|
Author(s):
Kaila, Urpo, Presenter Scientific Computing Ltd. (CSC)
Passerini, Marco, Presenter Scientific Computing Ltd. (CSC)
Virtanen, Joni, Presenter Scientific Computing Ltd. (CSC)
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Security
|
|
Title: Performance of Density Functional Theory codes on Cray XE6
Abstract: Author: Zhengji Zhao, and Nicholas Wright, Lawrence Berkeley National Laboratory National Energy Research Scientific Computing Center
Title: Performance of DFT codes on Cray XE6
Abstract: Around 1/3 of the computing cycles is consumed by the materials science and chemistry users in each allocation year at NERSC, and ~75% of them run various Density Functional Theory (DFT) codes, among which the majority are pure MPI codes. In this paper, we select a few representative codes and discuss their performance on Cray XE6, especially the performance impact from the multicore archietecture in comparison with that on Cray XT4. We also explore how OpenMP and/or the multi-threaded blas library help to address the reduced per-core memory on Cray XE6 and to improve the parallel performance of the codes. |
Author(s):
Zhao, Zhengji , Presenter National Energy Research Scientific Computing Center (NERSC)
Wright, Nicholas National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Automation-Assisted Debugging on the Cray with TotalView
Abstract: A little bit of automation can go a long way towards streamlining and simplifying the process of debugging scientific applications. This talk will demonstrate using a new TotalView feature, C++View, to transform complex data structures and automatically perform validity checks within them. C++View is an element of TotalView's extensive scripting framework, which also includes a type transformation facility, a fully programmable TCL-based CLI, a C and Fortran expression evaluation system, and the scripting tools MemScript and TVScript. |
Author(s):
Gottbrath, Chris, Presenter Rogue Wave Software
|
Suggested Technical Category:
Tools
|
|
Title: Shared Libraries on a Capability Class Computer
Abstract: Popularity of dynamically linked executables continues to grow within the scientific computing community. The system software implementation of shared libraries is non-trivial and has significant implications on application scalability. This presentation will first provide some background on the Linux implementation of shared libraries, which was not designed for distributed HPC platforms. This introductory information will be used to identify the scalability issues for massively parallel systems such as the Cray XT/XE product lines. Lastly, the presentation will describe the considerations and lesson learned in file system placement of the shared libraries on Cielo, a Cray XE6 system with over 100,000 cores. Scaling results and comparisons will be included. |
Author(s):
Kelly, Suzanne, Presenter Sandia National Laboratories (SNLA)
Klundt, Ruth Sandia National Laboratories (SNLA)
Laros, James Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: The Cray Programming Environment: Current Status and Future Directions
Abstract: The Cray Programming Environment has been designed to address issues of scale and complexity of high end HPC systems. Its main goal is to hide the complexity of the system, such that applications can achieve the highest possible performance from the hardware. In this talk I will present the recent activities and future directions of the Cray Programming Environment, including an overview of the Cray PE plans to support heterogeneous CPU/GPU systems. |
Author(s):
DeRose, Luiz, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Overview of the Cray XMT2
Abstract: The Cray XMT2 system will be the newest offering in Cray's line of scalable multithreaded computers. The Cray XMT2 is based on the latest Cray Threadstorm processor. This paper will describe the Cray XMT2 with particular emphasis on the new architectural features provided by this processor. |
Author(s):
Kopser, Andrew, Presenter Cray Inc. (CRAY)
Vollrath, Dennis Cray Inc. (CRAY)
|
Suggested Technical Category:
Architecture
|
|
Title: Acceleration of porous media simulations on the Cray XE6 platform
Abstract: Simulating carbon sequestration and reacting groundwater flow is
important because each of these problems involve processes that can not be sufficiently simulated in the laboratory. Investigation of these issues requires resolution of spatial scales on the order of meters within domains that are on the order of tens of kilometers, which necessitates the use of adaptive mesh refinement. Also, stiff chemical reactions and large acoustic wave speeds limit the size of the time step, but simulations must be able to predict results for 10-15 years in the future, making this a challenging multi-scale physics problem. This paper examines the memory requirements of the porous media code used for these simulations and discusses improving performance through the use of a hybrid (OpenMP+MPI) programming on the Cray XE6 platform. |
Author(s):
Wright, Nicholas National Energy Research Scientific Computing Center (NERSC)
Pau, George Lawrence Berkeley National Laboratory
Lijewski, Michael Lawrence Berkeley National Laboratory
Fagnan, Kirsten, Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Cheetah: A Scalable Hierarchical Collective Operation Framework
Abstract: The performance and scalability of collective operations play a key role in the performance and scalability of many scientific applications. Within the Open MPI code base we have developed a general purpose hierarchical collective operations framework called Cheetah, and applied it at large scale on the Oak Ridge Leadership Computing Facility's Jaguar platform, obtaining better performance and scalability than the native MPI implementation, in measurement taken up to order 49K process count. This talk discuss Cheetah's design and implementation, as well as the results of large-scale benchmark data. |
Author(s):
Graham, Richard, Presenter Oak Ridge National Laboratory (ORNL)
Shamis, Pavel Oak Ridge National Laboratory (ORNL)
Ladd, Joshua Oak Ridge National Laboratory (ORNL)
Gorentla Venkata, Manjunath Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Libraries
|
|
Title: A Programming Environment for Heterogeneous Multi-Core Computer Systems
Abstract: As part of the OLCF-3 project, the Oak Ridge National Laboratory is working with several vendors and engaged in research to develop a Programming Environment for mixed CPU/Accelerator based ultra-scale computer systems. The environment provides a toolset to port or develop CPU/Accelerator systems while reducing development time to improve the performance and portability of the codes while minimizing sources of errors. Our toolset consists of compilers for high-level Accelerator directives, libraries, performance tools and a debugger with synergistic interfaces among them. In this paper we show how these tools work together and how they support the different stages of the program development/porting cycle. Our paper will describe how we successfully used the tools to port DOE codes to a CPU/Accelerator system. |
Author(s):
Graham, Richard Oak Ridge National Laboratory (ORNL)
Shamis, Pavel Oak Ridge National Laboratory (ORNL)
Hernandez, Oscar, Presenter Oak Ridge National Laboratory (ORNL)
Kartsaklis, Christos Oak Ridge National Laboratory (ORNL)
Mintz, Tiffany Oak Ridge National Laboratory (ORNL)
Hsu, Chung Hsing Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Evolution of the Cray Performance Measurement and Analysis Tools
Abstract: The goal of the Cray Performance Measurement and Analysis Tools is to help the user identify important and meaningful information from potentially massive data sets by providing hints around problem areas instead of just reporting raw data. Analysis of data that addresses multiple dimensions of scalability including millions of lines of code, lots of processes or threads and long running applications is needed. The Cray toolset supports these dimensions by collecting information at process and thread levels, and providing features such as load imbalance analysis, derived metrics based on hardware events, and optimal MPI rank placement strategies. This paper focuses on recent additions to the performance tools to enhance the analysis experience and support new architectures such as hybrid X86 and GPU systems. Work presented includes support for applications using PGAS programming models, loop work estimates that help identify parallel or accelerator loop candidates, and statistics around accelerated loops.
|
Author(s):
Poxon, Heidi, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tools
|
|
Title: A study of scalability performance for hybrid mode computation and asynchronous MPI transpose operation in DSTAR3D
Abstract: A necessary condition for good scalability of parallel computation at large core counts is to minimise the data communication and overlap it with computation whenever this is possible. Along these ideas we study the parallel performance of the code DSTAR3D which simulates reactive turbulent flows using direct numerical simulation. The studied algorithm uses a two-dimensional domain decomposition with OpenMP threads inside each local domain for numerical intensive kernels and asynchronous MPI for the needed transpose operation of subdomains. This new algorithm allows DSTAR3D to use around 10000 cores with excellent scalability, a significant improvement from hundreds of cores used by the initial algorithm based on one-dimensional decomposition. |
Author(s):
Anton, Lucian, Presenter NAG Ltd.
Li, Ning NAG Ltd.
Luo, Kai Southampton University
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The design of an auto-tuning I/O framework on Cray XT5 system
Abstract: Cray XT5 is equipped with Lustre, a parallel file system. To utilize I/O effectively is essential for an application to scale up. We have developed a mathematical model based on queuing theory and built an experimental I/O auto-tuning infrastructure for XT5 system. |
Author(s):
You, Haihang, Presenter National Institute for Computational Sciences (NICS)
Liu, Qing National Institute for Computational Sciences (NICS)
Li, Zhiqiang University of Tennessee
|
Suggested Technical Category:
Tools
|
|
Title: Application-Driven Acceptance of Cielo, an XE6 Petascale Capability Platform
Abstract: Cielo is one of the first instantiations of Cray's new XE6 architecture and will provide capability computing for the NNSA's Advanced Simulation and Computing (ASC) Campaign. A primary acceptance criteria for the initial phase of Cielo was to demonstrate a six times (6x) performance improvement for a suite of ASC codes relative to its predecessor, the ASC Purple platform. This paper describes the 6x performance acceptance criteria and discusses the applications and the results. Performance up to tens of thousands of cores are presented with analysis to relate the architectural characteristics of the XE6 that enabled the platform to exceed the acceptance criteria. |
Author(s):
Doerfler, Douglas, Presenter Sandia National Laboratories (SNLA)
Rajan, Mahesh Sandia National Laboratories (SNLA)
Nuss, Cindy Cray Inc. (CRAY)
Wright, Cornell Los Alamos National Laboratory (LANL)
Spelce, Thomas Lawrence Livermore National Laboratory
|
Suggested Technical Category:
Architecture
|
|
Title: The NERSC- Cray Center of Excellence: Performance Optimization for the
Multicore Era.
Abstract: We compare performance of several NERSC benchmarks on three Cray
platforms, Franklin (XT4), Jaguar (XT5) and Hopper (XE6).
We also report our work on evaluating the hybrid MPI-OpenMP
programming model for several of these benchmarks.
By using detailed timing breakdowns, we measure the contributions to
the total runtime of the applications from computation, communication,
and from runtime overhead, such as that due to OpenMP regions; and
we discuss their effect on the performance differences observed.
Finally, we report preliminary results of our PGAS (UPC and CAF)
performance measurements. |
Author(s):
Wright, Nicholas, Presenter National Energy Research Scientific Computing Center (NERSC)
Shan, Hongzhang National Energy Research Scientific Computing Center (NERSC)
Blagoievic, Filip National Energy Research Scientific Computing Center (NERSC)
Wasserman, Harvey National Energy Research Scientific Computing Center (NERSC)
Drummond, Tony National Energy Research Scientific Computing Center (NERSC)
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Fuerlinger, Karl UC Berkeley
Yelick, Katherine National Energy Research Scientific Computing Center (NERSC)
Ethier, Stephane Princeton Plasma Physics Lab
Wagner, Marcus Cray Inc. (CRAY)
Wichmann, Nathan Cray Inc. (CRAY)
Anderson, Sarah Cray Inc. (CRAY)
Aamodt, Mike Cray Inc. (CRAY)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Providing Runtime Clock Synchronization With Minimal Node-To-Node Time Deviation on XT4s and XT5s
Abstract: We present a new high precision clock synchronization algorithm designed for large XT4 and XT5 leadership-class machines. The algorithm, which is designed to support OS noise reduction through co-scheduling, is suitable for usage cases requiring low overhead and minimal time deviation between nodes. Unlike most high-precision algorithms which reach their precision in a post-mortem analysis after the application has completed, the new ORNL developed algorithm rapidly provides precise results during runtime. Previous to our work, the leading high-precision clock synchronization algorithms that made results available during runtime relied on probabilistic schemes which are not guaranteed to result in an answer. |
Author(s):
Jones, Terry, Presenter Oak Ridge National Laboratory (ORNL)
Koenig, Gregory Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Parallel Finite Element Earthquake Rupture Simulations on Quad- and Hex-core Cray XT Systems
Abstract: In this paper, we illustrate an element-based partitioning scheme for explicit finite element methods, and based on the partitioning scheme, we discuss how efficiently to use hybrid MPI/OpenMP to parallelize a sequential finite element earthquake rupture simulation code in order to not only achieve multiple levels of parallelism of the code but also to reduce the communication overhead of MPI within a multicore node by taking advantage of the shared address space and on-chip high inter-core bandwidth and low inter-core latency. We evaluate the hybrid MPI/OpenMP finite element earthquake rupture simulations on quad- and hex-core Cray XT 4/5 systems from Oak Ridge National Laboratory using the Southern California Earthquake Center (SCEC) benchmark TPV 210, which is to test convergence of TPV 10 problem with increasingly higher spatial resolutions (smaller element sizes). The benchmark solves dynamic rupture propagation along a 60° dipping normal fault (30 km x 15 km) and wave propagation in a homogeneous three-dimensional half space, and the initial stress on the fault linearly increases with depth. Our experimental results indicate that the parallel finite element earthquake rupture simulation obtains the accurate output results and has good scalability on these Cray XT systems. |
Author(s):
Wu, Xingfu, Presenter Oak Ridge National Laboratory (ORNL) Texas A&M University
Duan, Benchun Texas A&M University
Taylor, Valerie Texas A&M University
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: A Pragmatic Approach to Improving the Large-scale Parallel I/O Performance of Scientific Applications.
Abstract: I/O performance in scientific applications is an often neglected area of concern during performance optimizations. However, various scientific applications have been identified which benefit from I/O improvements due to the volume of data or number of compute processes utilized. This work will detail the I/O patterns and data layouts of real scientific applications, discuss their impacts, and demonstrate pragmatic approaches to improve I/O performance. |
Author(s):
Crosby, Lonnie, Presenter National Institute for Computational Sciences (NICS)
Brook, Glenn National Institute for Computational Sciences (NICS)
Sekachev, Mikhail, Presenter National Institute for Computational Sciences (NICS)
Wong, Kwai National Institute for Computational Sciences (NICS)
Rekepalli, Bhanu National Institute for Computational Sciences (NICS)
Vose, Aaron National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: XE System Reliability and Resiliency: Observations and impact to operations
Abstract: In 2010, Cray Inc introduced the XE6 product with new technology in the software, interconnect, blades and cabinets. This paper will discuss the reliability trends that have been observed since the introduction of the product and our observations of the effectiveness of the network resiliency improvements incorporated into the system. We will also discuss how these improvements are impacting reliability and availability metrics and Cray support activities. |
Author(s):
Johnson, Steven Cray Inc. (CRAY)
|
Suggested Technical Category:
Operations
|
|
Title: Benchmark Performance of Different Compilers on a Cray XE6
Abstract: There are four different supported compilers on NERSC's recently acquired XE6, and our users often request guidance from us in determining which compiler is best for a particular application. In this paper, we will describe the comparative performance of different compilers on several MPI and Hybrid MPI/OpenMP benchmarks with different characteristics. For each compiler and benchmark, we will establish the best set of optimization arguments to the compiler. |
Author(s):
Stewart, Michael National Energy Research Scientific Computing Center (NERSC)
He, Yun (Helen), Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Compilers
|
|
Title: Performance of the time-dependent close-coupling approach to electron-impact ionization on the Cray XE6
Abstract: We report on time-dependent close-coupling calculations of the electron-impact ionization of small atoms and molecules. Such calculations are required to accurately predict the angular distributions and energy sharings of the two outgoing electrons after ionization by an incident electron. Our calculations treat the long-range electron-electron interaction without approximation, resulting in computationally intensive problems. We have performed calculations for electron-impact ionization of helium and molecular hydrogen on the NERSC facilities Franklin (Cray XT4) and Hopper II (Cray XE6). We report on the scaling properties of our codes on these platforms, and discuss some performance issues we have encountered. This work is supported in part by grants from the US Department of Energy and the US National Science Foundation. The Los Alamos National Laboratory is operated by Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under Contract No. DE-AC5206NA25396. |
Author(s):
Colgan, James Los Alamos National Laboratory (LANL)
Pindzola, Michael Auburn University
Antypas, Katie, Presenter National Energy Research Scientific Computing Center (NERSC)
Yang, Woo-Sun National Energy Research Scientific Computing Center (NERSC)
He, Helen National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: A Deep Dive on New Features of the Cray Programming Environment
Abstract: This tutorial is intended to users that are interested in learning in more depth some of the more recent features in the Programming Environment for the Cray XT and Cray XE systems. The tutorial will cover with examples features in the Cray Compiling Environment (CCE), the Cray Performance Measurement and Analysis Tools (CPMAT), the Cray Debugging Supporting Tools (CDST), and the Cray Scientific and Math Libraries (CSML). |
Author(s):
DeRose, Luiz, Presenter Cray Inc. (CRAY)
Poxon, Heidi, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
This is a suggestion for a programming environment tutorial
|
|
Title: Transitioning applications from the Franklin XT4 system with 4 cores per node to the Hopper XE6 system with 24 cores per node.
Abstract: As NERSC users move from the Franklin XT4 system with 4 cores per node to the Hopper XE6 system with 24 cores per node, they have had to adapt to a lower amount of memory per core and on-node I/O performance which does not scale up linearly with the number of cores per node. This paper will discuss the practical implications of running on a system with 24 cores per node, exploring advanced aprun and memory affinity options for typical NERSC applications as well as strategies to improve I/O performance out of a node. |
Author(s):
Antypas, Katie, Presenter National Energy Research Scientific Computing Center (NERSC)
He, Helen National Energy Research Scientific Computing Center (NERSC)
Wasserman, Harvey National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Training
|
|
Title: Debugging at Petascale and Beyond
Abstract: Debugging at scale is now a reality - with Allinea DDT 3.0 achieving whole-machine interactive debugging of Petaflop Cray systems at high speed in production usage. This paper explores the need and opportunities for debugging at the target application scale, and shows how Allinea's fast and scalable debugging architecture enables new possibilities that are simplifying the task of debugging at a time when system architectures are becoming more complex. |
Author(s):
January, Chris Allinea Software
Lecomber, David, Presenter Allinea Software
O'Connor, Mark Allinea Software
|
Suggested Technical Category:
Tools
|
|
Title: Cray Scientific Libraries : Overview, Performance Evaluation and Advanced Usage.
Abstract: The Cray Scientific Libraries enable highly efficient usage of Cray systems with the minimum programmer effort. The standard and near-standard scientific libraries for dense linear algebra, sparse linear algebra and FFTs are all provided and tuned extensively for AMD processors, the Cray network, or both. A comprehensive set of custom tools are also provided that allow simpler usage, higher degrees of control or better performance than the standardized libraries. Some parts of LibSci use auto-tuning and adaptation, the tailoring of numerical kernels to the calling problem at run-time for increased performance. This talk will describe the technical innovations required to implement the optimizations and features in LibSci, provide a performance evaluation of the currently released libraries and detail upcoming feature and performance optimizations. |
Author(s):
Tate, Adrian, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Libraries
|
|
Title: Titan: ORNL’s New System for Scientific Computing
Abstract: ORNL is planning to install a 10-20 petaflops computer system over the next 18 months that will be the next generation system for scientific computing for the U.S. Department of Energy. While there will be many similarities to the existing Jaguar system, there will also be architectural differences. In this paper, we discuss the accelerator based architecture of Titan and the reasons for our decision to go in this direction. We also discuss our choice of the file systems to support the system. |
Author(s):
Bland, Arthur, Presenter Oak Ridge National Laboratory (ORNL)
Rogers, James Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Architecture
|
|
Title: The Programmability and Performance of Cray's Programming Environment for Accelerators
Abstract: Although GPU accelerators are rapidly being adopted within the HPC community due to their high floating-point performance, the related programming languages, such as CUDA and OpenCL, require significant application code modifications to achieve and sustain high fraction of peak performance. Cray attempts to improve programmability and portability with the prototype Programming Environment (PE) for Accelerators that allows efficient usage of hybrid GPU/CPU systems using a directive-based programming model and accelerated scientific libraries. Directive-based programming has the potential to considerably improve programmability and portability by abstracting away the complex underlying hardware characteristics and providing an interface to express concurrency and parallelism to the compiler. Likewise tuned scientific libraries that are designed to make efficient use of the accelerators through standardized interfaces also increase user productivity. In this manuscript, we evaluate programmability and performance potential of a prototype version of Cray's PE for Accelerators, discuss user productivity while porting codes containing OpenMP and PGI accelerator directives and libsci interfaces, gauge the effort required for tuning and optimization, and review the advantages and shortcomings for acceleration of production level applications on heterogeneous scalable systems. |
Author(s):
Poznanovic, Jeffrey, Presenter CSCS?Swiss National Supercomputing Centre (CSCS)
Alam, Sadaf CSCS?Swiss National Supercomputing Centre (CSCS)
Fourestey, Gilles CSCS?Swiss National Supercomputing Centre (CSCS)
Tate, Adrian Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Determining the health of Lustre filesystems at scale
Abstract: Monitoring the components of a Lustre file system is crucial to meeting mission requirements as the scale and complexity of the installation grows. Determining the health and performance of the file system becomes non-trivial, and the complexity increases faster than the size of the installation. This paper discusses the ongoing work at the Oak Ridge Leadership Computing Facility to monitor the health of its center-wide Lustre file systems. |
Author(s):
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Leverman, Dustin Oak Ridge National Laboratory (ORNL)
Koch, Scott Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
System Operations
|
|
Title: I/O Congestion Avoidance via Routing and Object Placement
Abstract: As storage systems get larger to meet the the demands of petascale systems, careful planning must be applied to avoid congestion points and extract the maximum performance. In addition, the large size of the data sets generated by such systems make it desirable for all compute resources in a center to have common access to this data without needing to copy it to each machine. This paper describes a method of placing I/O close to the storage nodes to minimize contention on Cray's SeaStar2+ network, and extends it to a routed Lustre configuration to gain the same benefits when running against a center-wide file system. Our experiments show performance improvements for both direct attached and routed file systems. |
Author(s):
Dillow, David, Presenter Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Zhang, Zhe IBM T.J. Watson Research Center
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Discovering the Petascale User Experience in Scheduling Diverse Scientific Applications
Abstract: Newly emerging petascale computational resources are popular for both capacity and capability computing. However, these varied job classes have widely different resource requirements that make scheduling challenging. Beyond machine utilization, the scheduling of computational resources should provide reasonable throughput for all classes of jobs. This work will examine the user impact of scheduling various job classes on a petascale, shared-computing resource with a diverse workload, including scheduling policies and user behavior. |
Author(s):
Baer, Troy National Institute for Computational Sciences (NICS)
Brook, R. Glenn National Institute for Computational Sciences (NICS)
Crosby, Lonnie D., Presenter National Institute for Computational Sciences (NICS)
Ezell, Matt National Institute for Computational Sciences (NICS)
Samuel, Tabitha National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
System Operations
|
|
Title: Petascale Capability Computing
Abstract: The LANL/Sandia Alliance for Computing at the Extreme Scale (ACES) is partnering with Cray to deploy a Petascale capability system, Cielo, for the Department of Energy's Advanced Simulation and Computing (ASC) program. Many targeted national security applications are extremely large and will require a significant fraction of the cores on Cielo to execute. Cielo is one of three Cray Petascale systems, has 6,704 nodes and uses a 3-D Torus topology and Cray's Gemini interconnect. Each node is composed of two 8 core AMD Magny Cours processors with 16 GB of memory each, resulting in a total of 107, 264 cores and 160 TB of memory. It is the first large Cray system that employs a Panasas filesystem. An upgrade in early 2011 will add 0.34 Petaflops to Cielo. This presentation describes Cielo's architecture, the user environment and preliminary performance results. We discuss some of our system integration efforts and challenges, as well as our experience with initial use of the system for simulations. An ASC R&D partnership with Cray to design an interconnect for the 2014 timeframe is also described. |
Author(s):
Doerfler, Doug Sandia National Laboratories (SNLA)
Dosanjh, Sudip Sandia National Laboratories (SNLA)
Morrison, John Los Alamos National Laboratory (LANL)
Vigil, Manuel Los Alamos National Laboratory (LANL)
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Cray Petascale Systems
|
|
Title: Overview of Node Health Checker
Abstract: As the size of Cray systems increases, the meantime to failure of an
individual compute node decreases. As a result, Cray's Node Health
Checker (NHC) is playing an ever increasing role in ensuring job
completion and system administrator sanity. NHC, marketed as NodeKARE,
is a system management tool that runs a series of built-in and system
administrator defined health tests on compute nodes. NHC automatically sequesters unhealthy nodes, dumps them for future debugging, and returns them to the pool of available nodes with a reboot. This increases the availability of nodes while simultaneously decreasing administrator intervention. Sequestering unhealthy nodes prevents them from causing job failure and saves tens of thousands of hours of compute time. This paper provides an overview of NHC.
|
Author(s):
Sollom, Jason, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
System Operations
|
|
Title: Producing weather forecasts on time in Denmark using PBS Professional
Abstract: Running a mix of research jobs together with jobs, that shall run in a timely manner to a predefined schedule, adds an extra layer of complexity to job scheduling. More traditional approaches for ensuring resource availability include suspend/resume or checkpoint/restart, which are either not available on XT systems or have practical limitations. This presentation outlines the use of advance reservation scheduling on an XT system doing Numerical Weather Prediction (NWP) at the Danish Meteorological Institute (DMI). A basic and simple strategy for exploiting the advance reservation scheduling feature of PBS Professional to ensure resource availability at predefined timeslots is presented along with experiences made over the past years. |
Author(s):
Lorenzen, Thomas, Presenter Danish Meteorological Institute (DMI)
Olason, Thor Danish Meteorological Institute (DMI)
Iversen, Frithjov Cray Inc. (CRAY)
Palazzi, Paolo Cray Inc. (CRAY)
|
Suggested Technical Category:
System Operations
|
|
Title: Building an Electronic Knowledge Base to Aide in Support of Jaguar
Abstract: Supporting the world's largest XT5 can be a daunting task. By designing, developing and implementing an online knowledge base, the Oak Ridge Leadership Computing Facility (OLCF) has provided a valuable user support resources to its users. This paper will describe the process and results of this development effort. |
Author(s):
Whitten, Robert , Presenter Oak Ridge National Laboratory (ORNL)
Barker, Ashley , Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Documentation
|
|
Title: Large-scale performance analysis of PFLOTRAN with Scalasca
Abstract: The PFLOTRAN code for reactive multiphase flow and transport has featured prominently in US Department of Energy SciDAC and INCITE programs, where its execution performance with up to 128k processor cores on Cray XT and IBM BG/P systems has been analyzed. Although the complexities of PFLOTRAN executions employing PETSc, LAPACK, BLAS, HDF5 and MPI libraries at large scale were challenging, the open-source Scalasca [www.scalasca.org] toolset was able to provide valuable insight into a variety of performance-limiting aspects. |
Author(s):
Wylie, Brian, Presenter Juelich Supercomputing Centre
|
Suggested Technical Category:
Tools
|
|
Title: Deploying SLURM on XT, XE, and Future Cray Systems
Abstract: We describe porting the open-source SLURM resource manager to the Cray BASIL/ALPS interface; and report on experiences of using it on our main 20-cabinet Cray XT5 production platform, as well as several development systems of a heterogeneous multi-cluster environment. Since some of these systems are GPU-based, we also discuss issues in extending the existing interface to future systems (Cray XE6 with GPU accelerators), in order to take advantage of SLURMs cutting-edge GPU support. |
Author(s):
Renker, Gerrit, Presenter CSCS?Swiss National Supercomputing Centre (CSCS)
Stringfellow, Neil CSCS?Swiss National Supercomputing Centre (CSCS)
Jette, Morris CSCS?Swiss National Supercomputing Centre (CSCS)
Auble, Danny CSCS?Swiss National Supercomputing Centre (CSCS)
Alam, Sadaf CSCS?Swiss National Supercomputing Centre (CSCS)
|
Suggested Technical Category:
System Operations
|
|
Title: Application Characteristics and Performance on a Cray XE6
Abstract: In this paper, we will explore the performance of two applications on a Cray XE6 and their performance improvement from previous machines, including the XT5 and the XT6. These two applications show different scaling effects as we go from machine to machine and we will explore the differences in the applications to explain these differences. We will use profiling and other tools to better understand resource contention within and between nodes and how that changes with the evolution of the machines with changes in processors and network. |
Author(s):
Vaughan, Courtenay, Presenter Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: The NCRC Grid Scheduling Environment
Abstract: In support of the NCRC, a joint computing center between NOAA and ORNL, a grid-based scheduling infrastructure was designed to allow geographically separate computing resources to be used as production resources in climate and weather research workflows. These workflows require job coordination between the two centers in order to provide a complete workflow of data staging, computation, post-analysis and archival. This paper details the design, implementation and initial production phase of the infrastructure and lessons learned from the process. |
Author(s):
Indiviglio, Frank, Presenter National Climate-Computing Research Center (NCRC)
Maxwell, Don, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
System Operations
|
|
Title: Future proofing WL-LSMS: Preparing for first principles thermodynamics calculations on accelerator and multicore architectures
Abstract: The WL-LSMS code has a very good track record for scaling on massively parallel architectures and achieves a performance of approx. 1.8 PF on the current Jaguar system at ORNL. Yet the code architecture assumes a distributed memory with a single thread of execution per MPI rank, which is not a good fit for multicore nodes and the emerging accelerator based architectures. This talk will present the ongoing work to restructure the WL-LSMS code to take advantage of these new architectures and continue to work efficiently during the next decade. |
Author(s):
Eisenbach, Markus, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: "Case Studies From the OLCF Center for Application Acceleration Readiness: The Importance of Realizing Hierarchical Parallelism in the Hybrid Multicore Era"
Abstract: We will present several case studies - including kernel identification and work performed to date - for our suite of Early Science applications targeted for Titan. This work has, we believe, made clear a productive path forward for application developers to maximize effective use of hybrid architectures, where the realization of hierarchical parallelism - from distributed-memory to SMP-like to vector-like - is the essential ingredient. |
Author(s):
Messer, Bronson , Presenter Oak Ridge National Laboratory (ORNL)
Kendall, Ricky Oak Ridge National Laboratory (ORNL)
Graham, Richard Oak Ridge National Laboratory (ORNL)
Hernandez, Oscar Oak Ridge National Laboratory (ORNL)
Levesque, John Cray Inc. (CRAY)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: User application monitoring through assessment of abnormal behavior recorded in RAS logs
Abstract: Abnormal status of an application is typically detected by "hard" evidence, e.g., out of memory, segmentation fault. However, such information only provides clues for the notification of abnormal termination of the application; lost are any implications as to the application's termination with respect to the particular context of the platform. Restated, the generic exception the application reports is devoid of the overall system context that is captured elsewhere in the system, e.g., RAS logs. In this paper we present an "activity entropy" based application monitoring framework that extracts both facts (events) and context with regard to applications from RAS logs, and maps them into entropy scores that represent degrees of "unusualness" for applications. The paper describes our results from applying the framework to the Cray "Jaguar" system at Oak Ridge National Laboratory, and discusses how it identified applications running abnormally and implications based on the type of abnormality.
|
Author(s):
Park, Byung H., Presenter Oak Ridge National Laboratory (ORNL)
Gunasekaran, Raghul Oak Ridge National Laboratory (ORNL)
Naughton, Thomas Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Environmental Monitoring
|
|
Title: DVS, GPFS and External Lustre at NERSC - How It's Working on Hopper
Abstract: Providing flexible, reliable and high performance access to user data is an ongoing problem for HPC centers. This paper will discuss how NERSC has partitioned its storage resources between local and global filesystems, and the configuration, functionality, operation and performance of these several different parallel filesystems in use on NERSC's Cray XE6, Hopper. |
Author(s):
Butler, Tina , Presenter National Energy Research Scientific Computing Center (NERSC)
Lee, Rei National Energy Research Scientific Computing Center (NERSC)
Butler, Gregory National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Mass Storage
|
|
Title: Scalability of Paraview's Coprocesing Capability
Abstract: For exceedingly large high performance computing runs, writing all data to disk is unmanageably slow. It becomes necessary to have analysis and visualization communicate with the simulation in memory instead of through disk, retaining access to all available data for analysis. The open source visualization and analysis tool, Paraview, has recently added a coprocessing API allowing it to be linked into simulation codes. We will demonstrate scalability of Paraview coprocessing on up to 64000 cores on the new NNSA platform, Cielo. |
Author(s):
Fabian, Nathan, Presenter Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Libraries
|
|
Title: A Performance Comparison Framework for Numerical Libraries on Cray XT5 System
Abstract: Kraken, a Cray XT5 operated by the National Institute for Computational Sciences (NICS) enables scientific discoveries of researchers nationwide by providing leading-edge computational resources. Numerical libraries are frequently used by applications which are linked to different vendor libraries such as Libsci, ACML and MKL. Choosing the most efficient library for a given application is essential for achieving good performance. The performance comparison framework designed at NICS will help researchers determine the fastest library choices for their application. |
Author(s):
Hadri, Bilel, Presenter National Institute for Computational Sciences (NICS)
You, Haihang National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Libraries
|
|
Title: An Overview of the Chapel Programming Language
Abstract: Chapel is a new parallel programming language being developed under the DARPA High Productivity Computing Systems program (HPCS). In this tutorial, we will present an introduction to Chapel, from context and motivation to a detailed description of Chapel via example computations.
|
Author(s):
Choi, Sung-Eun Cray Inc. (CRAY)
Chamberlain, Bradford, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
Tutorial Session
|
|
Title: Application Performance Evaluation Studies of Multi-Core Nodes and the Gemini Network
Abstract: The UK National HPC Service (HECToR) has recently been upgraded from Cray XT4 with quad-core nodes to a Cray XT6 with 24-core nodes and then to a Cray XE6 with the Gemini network. We examine the performance implications for a range of applications including detailed performance profiling studies to examine the effect of the increase in the number of cores and the change in architecture from XT series Seastar to XE6 Gemini. |
Author(s):
Ashworth, Mike, Presenter EPCC (EPCC)
Guo, Xiaohu EPCC (EPCC)
Pickles, Stephen EPCC (EPCC)
Plummer, Martin EPCC (EPCC)
Porter, Andrew EPCC (EPCC)
Sunderland, Andrew EPCC (EPCC)
Todorov, Ilian EPCC (EPCC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Authoring User-Defined Domain Maps in Chapel
Abstract: One of the most promising features of Cray's parallel Chapel programming language is its support for 'user-defined domain maps' which permit advanced users to specify their own implementation for a parallel, distributed array that supports high-level global array operations. In choosing to write a domain map the user has control over high-level decisions like how data and iterations are divided among the target nodes of the machine as well as finer-grained decisions like the memory layout used to store the array's indices and values. In this paper, we give an overview of Chapel's user-defined domain map strategy and provide a summary of the developer's interface used to specify them. |
Author(s):
Chamberlain, Bradford Cray Inc. (CRAY)
Choi, Sung-Eun Cray Inc. (CRAY)
Iten, David Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Cray's Lustre model and road-map
Abstract: Since 2003, Cray, our customers and the wider HPC community have developed Lustre as a key technology component for our success. In order to ensure that Lustre will continue to grow and develop Cray has played a founding role, with other leaders in the HPC community, in launching OpenSFS and have joined the two other open Lustre consortia, HPCFS and EOFSCS. Cray plans to incorporate new Lustre features, produced through the efforts of these consortia and their member companies, into its products. This paper will lay out the support model and new software release details for Cray's use of Lustre in CLE and esFS in 2011, 2012, and beyond. |
Author(s):
Spitz, Cory, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Mass Storage
|
|
Title: Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms
Abstract: We investigate the ability of MPI implementations to perform truly asynchronous communication with nonblocking point-to-point calls on current highly parallel systems, including the Cray XT and XE series. For cases where no automatic overlap of communication with computation is available, we demonstrate several different ways of establishing explicitly asynchronous communication by variants of functional decomposition using OpenMP threads or tasks, implement these methods in application codes, and show the resulting performance benefits. The impact of node topology and the possible use of simultaneous multithreading (SMT) is studied in detail. |
Author(s):
Hager, Georg, Presenter Erlangen Regional Computing Center
Keller, Rainer High Performance Computing Center Stuttgart (HLRS)
Zeiser, Thomas Erlangen Regional Computing Center
Habich, Johannes Erlangen Regional Computing Center
Schoenemeyer, Thomas CSCS?Swiss National Supercomputing Centre (CSCS)
Wellein, Gerhard Erlangen Regional Computing Center
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The Hopper System: How the largest XE6 in the world went from requirements to reality
Abstract: This paper will discuss the entire process of acquiring and deploying Hopper from the first vendor market surveys to providing 3.8 million hours of production cycles per day for NERSC users. Installing the latest system at NERSC has been both a logistical and technical adventure. Balancing compute requirements with power, cooling, and space limitations drove the initial choice and configuration of the XE6, and a number of first-of-a-kind features implemented in collaboration with Cray have resulted in a high performance, usable, and reliable system. |
Author(s):
Carter, Jonathan, Presenter National Energy Research Scientific Computing Center (NERSC)
Butler, Tina National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Facilities and Site Prep
|
|
Title: Performance characterization and implications for magnetic fusion co-design applications
Abstract: Co-design in high performance computing is a process that tightly couples applications and computer hardware architecture aimed at developing designs for the exascale level. In order to effectively perform the co-design process, it is important to study the current performance of the targeted applications. In this paper we present detailed benchmarking results for a set of magnetic fusion applications with a wide variety of underlying mathematical models including a particle code, and grid-based codes requiring both implicit and explicit numerical solvers. The analysis focuses on profiling these codes in terms of critical performance characteristics, which include such metrics as scalability, memory/network bandwidth limitations, communication versus computation and I/O versus computation. We compare and describe the available tools for performing this sort of study. The magnetic fusion codes represent a suite of applications that were selected as part of the co-design effort. Results are given for Cray XT4 and XE6 platforms.
KEYWORDS: HPC, exascale computing, applications, benchmarking, profiling, performance characterization |
Author(s):
Narayanan, Praveen, Presenter National Energy Research Scientific Computing Center (NERSC)
Koniges, Alice , Presenter National Energy Research Scientific Computing Center (NERSC)
Oliker, Leonid National Energy Research Scientific Computing Center (NERSC)
Preissl, Robert National Energy Research Scientific Computing Center (NERSC)
Williams, Samuel National Climate-Computing Research Center (NCRC)
Wright, Nicholas National Energy Research Scientific Computing Center (NERSC)
Ethier, Stephane Princeton Plasma Physics Laboratory
Wang, Weixing Princeton Plasma Physics Laboratory
Umansky, Maxim Lawrence Livermore National Laboratory
Xu, Xueqiao Lawrence Livermore National Laboratory
Candy, Jeff General Atomics
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Topology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus
Abstract: Application performance was improved in Cray XT supercomputers using node ordering with a simple, one-dimensional allocation strategy but improvements were dependent on sizes and shapes of systems and applications so a new “thicker” ordering was developed to improve bisection bandwidth of placed applications, both large and small. The Cray XE line provides yet another challenge – differing speeds depending on the axis travelled which result in different bi-section bandwidth characteristics. To meet this challenge an enhanced ordering was developed which varies with system size, benefiting a wide range of applications, all without user input. This paper summarizes the approach to placement that the Cray Application Level Placement Scheduler (ALPS) now offers based on the underlying node topology, the reasons for this approach, and the variations that sites can choose to optimize for their specific machines. |
Author(s):
Albing, Carl, Presenter Cray Inc. (CRAY)
Troullier, Norm Cray Inc. (CRAY)
Whalen, Stephen Cray Inc. (CRAY)
Olson, Ryan Cray Inc. (CRAY)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Tips and tricks for diagnosing Lustre problems on Cray systems
Abstract: As a distributed parallel file system, Lustre is prone to many failure modes. The manner in which it breaks can make diagnosis and serviceability difficult. Cray deploys Lustre file systems at extreme scales, which compounds the difficulties. This paper discusses tips and tricks for diagnosing and correcting Lustre problems for both CLE and esFS installations. It will cover common failure scenarios including node crashes, deadlocks, hardware faults, communication failures, scaling problems, performance issues, and routing problems. Lustre issues specific to Cray Gemini networks are addressed as well. |
Author(s):
Spitz, Cory, Presenter Cray Inc. (CRAY)
Koehler, Ann, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Mass Storage
|
|
Title: The Chapel Tasking Layer Over Qthreads
Abstract: The Chapel compiler provides an abstraction of compute tasks that allows for the use of external libraries to provide the task management functionality. This paper describes the experiences and insights learned in porting Chapel to use the qthread lightweight threading library for task management. |
Author(s):
Wheeler, Kyle, Presenter Sandia National Laboratories (SNLA)
Chamberlain, Brad Cray Inc. (CRAY)
Murphy, Richard Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Compilers
|
|
Title: Targeting AVX-enabled processors using PGI Compilers and Tools
Abstract: AMD and Intel will release new microprocessors in 2011 based on the extended AVX architecture. In this paper we will show examples of compiler code generation and new library and tools capabilities which support these new processors. Performance data comparing the new platform vs. previous generations will also be included. |
Author(s):
Leback, Brent, Presenter The Portland Group
|
Suggested Technical Category:
Compilers
|
|
Title: Porting solvers to a multi-processor, multi-core, multi-gpu platform using the PGI Accelerator Model and PGI CUDA Fortran.
Abstract: In this paper we present work and results from a port of a standard solver library to a heterogeneous computing platform, using a minimally-intrusive set of accelerator directives and language extensions. Results comparing alternative coding solutions are given. |
Author(s):
Leback, Brent, Presenter The Portland Group
|
Suggested Technical Category:
Compilers
|
|
Title: Real-Time System Log Monitoring/Analytics Framework
Abstract: Analyzing system logs provides useful insights for identifying system/application anomalies and helps in better usage of system resources. Nevertheless, it is simply not practical to scan through the raw log messages on a regular basis for large-scale systems. First, the sheer volume of unstructured log messages affects the readability, and secondly correlating the log messages to system events is a daunting task. These factors limit large-scale system logs primarily for generating alerts on known system events, and post-mortem diagnosis for identifying previously unknown system events that impacted the systems performance. In this paper, we describe a log monitoring framework that enables prompt analysis of system events in real-time. Our web-based framework provides a summarized view of console, netwatch, consumer, and apsched logs in real-time. The logs are parsed and processed to generate views of applications, message types, individual/group of compute nodes, and in sections of the compute platform. Also from past application runs we build a statistical profile of user/application characteristics with respect to known system events, recoverable/non-recoverable error messages and resources utilized. The web-based tool is developed for Jaguar XT5 at the Oak Ridge Leadership Computing facility. |
Author(s):
Gunasekaran, Raghul Oak Ridge National Laboratory (ORNL)
Park, Byung Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
System Operations
|
|
Title: Deployment and Implementation of a Workflow-Oriented Data Storage and Transfer Framework
Abstract: Many emerging compute environments increasingly handle complex task workflows utilizing geographically separate resources. These resources may include data sources, compute platforms, pre- or post-processing platforms, and archive locations. The National Climate-Computing Research Center's Gaea deployment includes multiple Lustre filesystems and workflow-integrated data management subsystems. This paper details Gaea's storage system architecture, deployment, implementation and early experiences in support of its workflow software. |
Author(s):
Fuller, Douglas, Presenter National Climate-Computing Research Center (NCRC)
|
Suggested Technical Category:
Mass Storage
|
|
Title: High Performance Network Intrusion Detection in the HPC environment
Abstract: Today's high-performance, High-bandwidth systems require innovative approaches in order to provide effective Network Intrusion Detection. The National Energy Research Scientific Computing Center utilizes the advanced Intrusion Detection System, Bro, to detect and/or deflect network attacks which may threaten the open computing environment. We describe several novel approaches to security issues, such as intrusion detection on encrypted streams, and clustering of IDS nodes to accomodate high-bandwidth network requirements, and future challenges in this area. |
Author(s):
Mellander, Jim, Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Operations
|
|
Title: Memphis on an XT5
Abstract: Memphis is a tool that makes use of Instruction Based Sampling (IBS)
hardware counters, available in recent AMD processors, to help
pinpoint the sources of memory system performance problems. This
presentation will describe our experiences porting Memphis to a test
XT5 system at ORNL, including modifications required by Compute Node
Linux to the kernel module that interfaces with IBS, and low impact
modifications to the batch queue that enable the module's use at
runtime. The presentation will also include results from running
production applications, such as CAM/HOMME, through Memphis on the
XT5, highlighting performance problems pointed out by the tool.
|
Author(s):
McCurdy, Collin, Presenter Oak Ridge National Laboratory (ORNL)
Vetter, Jeffrey Oak Ridge National Laboratory (ORNL)
Worley, Patrick Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Cosmic Microwave Background Data Analysis at the Peta-Scale and Beyond
Abstract: The analysis of Cosmic Microwave Background (CMB) data is an ongoing high performance computing challenge. For more than a decade now the size of CMB data sets has tracked Moore's Law, and we expect this to continue for at least the next 15 years. In this talk we will review the work done to date to follow this scaling, and discuss the steps we are taking to continue to do so to the peta-scale and beyond. |
Author(s):
Borrill, Julian, Presenter National Energy Research Scientific Computing Center (NERSC)
Cantalupo, Christopher National Energy Research Scientific Computing Center (NERSC)
Kisner, Theodore National Energy Research Scientific Computing Center (NERSC)
Stompor, Radek National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Software's One-sided Challenge
Abstract: All Cray machines of recent times have scaled really well for MPI
applications. The XE is no exception. However, introducing the PGAS
programming model with direct user access to the network interface
hardware required many changes and unprecedented cooperation between a lot of different functional development groups.
This paper describes how a team was able to deal with a number of
challenging design and implementation issues, some over a very short
period of time, in developing, testing, and releasing the software to
support the rollout of the new system.
|
Author(s):
Kaplan, Larry, Presenter Cray Inc. (CRAY)
Froese, Edwin Cray Inc. (CRAY)
Godfrey, Forest Cray Inc. (CRAY)
Gorodetsky, Igor Cray Inc. (CRAY)
Johns, Chris Cray Inc. (CRAY)
Kelly, Matt Cray Inc. (CRAY)
Shields, Brent Cray Inc. (CRAY)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Multi-TB file IO for Semantic Databases on the XMT
Abstract: Cray has developed an I/O library for the XMT that allows applications to store and restore multi-terabyte files rapidly and accurately. The newest version of this library features dynamically scaled transfers which maximize the speed of large transfers and minimize the setup time of small ones. A new daemon facilitates these features without any user intervention. Based on the elements of this design, Cray expects to be offering new approaches to hybrid computational environments using the XMT
|
Author(s):
Waters, Rolland, Presenter Cray Inc. (CRAY)
Lund, Eric Cray Inc. (CRAY)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Cielo Full-System Simulations of Multi-beam Laser-Plasma Interaction in NIF Experiments
Abstract: pF3D simulates laser-plasma interactions in experiments at the National Ignition Facility (NIF), the home of the world's most powerful laser. Simulations of recent NIF experiments require 100 billion zones and have been run on Cielo, a Cray XE6 with a Panasas parallel file system located at Los Alamos National Laboratory. This paper compares the techniques used to obtain good I/O and message passing scaling in ~100,000 processor runs on Cielo and on BlueGene/P systems with Lustre and GPFS parallel file systems. |
Author(s):
Langer, Steven, Presenter Lawrence Livermore National Laboratory
Still, Bert Lawrence Livermore National Laboratory
Hinkel, Denise Lawrence Livermore National Laboratory
Williams, Ed Lawrence Livermore National Laboratory
Gamblin, Todd Lawrence Livermore National Laboratory
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Update on Lustre
Abstract: Whamcloud is a startup focused on high-end filesystems: Lustre on Linux for HPC. The talk will introduce the company, its offerings and discuss the roadmap for continuing development in a vendor neutral manner for the open source technology. |
Author(s):
Gorda, Brent, Presenter Whamcloud
|
Suggested Technical Category:
Mass Storage
|
|
Title: DVS and DSL Implementation at NCAR
Abstract: The presentation will detail NCAR's experience with setting up DVS(Data Virtualization Service) and DSL(Dynamic Shared Libraries) on Lynx, the NCAR's Cray XT5m System.
Driven by local users' need at NCAR, we implemented GPFS shared files
systems from external servers to the XT5m system using DVS. Setting up a four service nodes as DVS servers for stripe-parallel mode, we were able to mount several GPFS file systems and an external Lustre, making all of them available on compute nodes as well as on service nodes.
We also implemented DSL support on our Cray XT5m. Lynx allows
dynamically linked applications with shared libraries to run on the
compute nodes. This also allows us to define and use our own DSOs
(Dynamic Shared Objects).
|
Author(s):
Elahi, Irfan, Presenter National Center for Atmospheric Research (NCAR)
Heo, Junseong National Center for Atmospheric Research (NCAR)
|
Suggested Technical Category:
Architecture
|
|
Title: The Dynamics of Turbulent Transport
Abstract: One of the great open problems in physics is a fundamental understanding
of turbulence and turbulent transport. The issues involved in this have
underpinned a number of “Grand Challange” scale problems from climate to
plasmas. One of the difficulties in studying turbulent transport is the
extreme multi-scale nature of the problem to be solved. In the past few
years a great deal of progress has been made on understanding turbulent
transport in the presence of flows using computational methods. This
new understanding includes classes of non-diffusive transport which have
implications for all areas that model turbulent transport as a diffusive
process. To continue to improve our ability to address the multi-scale
nature of the problem, the Parareal technique (parallelization in time)
has been applied to turbulence models. The implication of these
advances on transport modeling and large scale turbulence simulations
will be discussed. as will the application of the Parareal technique to
turbulence. This work has been performed with the support of DOE-OFES
grants and computational support from ARSC. |
Author(s):
Newman, David Arctic Region Supercomputing Center (ARSC) University of Alaska Fairbanks
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Keynote.
|
|
Title: Porting a Particle Transport code to GPGPU using hybrid
MPI / OpenMP / Cuda programming models
Abstract: This paper will present the work undertaken through collaboration with EPCC, required to port AWE’s benchmark code, Chimaera, to NVidia Fermi GPU using Double Precision arithmetic and hybrid MPI / OpenMP / Cuda programming. This paper will describe these algorithmic changes and performance results obtained from both the original evaluation project and from the final ported application code using Cuda and OpenMP. |
Author(s):
Pringle, Gavin EPCC (EPCC)
Barrett, Dave AWE PLC (AWE)
Bell, Ron EPCC (EPCC)
Hepwood, Claire, Presenter AWE PLC (AWE)
Johnston, Chris, Presenter EPCC (EPCC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Greenland ice sheet flow computations: scaling-up to high spatial
resolution and fast time-scale boundary processes
Abstract: Scientists need physics-based models which connect warming in the
polar regions to the behavior of ice sheets---especially their sea
level contribution---but understanding of ice flow dynamics remains
limited. Models which connect hard-to-observe local processes to
global flow consequences are not as mature as for other climate
components. Fortunately, ice sheet modeling is growing up. Modelers
are converging on effective tools to make high resolution
multi-physics simulations on supercomputers actually useful. I'll
address these challenges for a Greenland ice sheet model: Does 1 km
ice-sheet-wide resolution resolve fast flow in 5 km wide fjords? Are
long-distance stress transmissions in floating or well-lubricated ice,
and the fast time-scale processes at its boundary, modeled well-enough
to capture observed ice sheet changes? (Can we solve such huge linear
systems at every timestep?) What is the physical limit on the speed
of flowing ice?
|
Author(s):
Bueler, Ed, Presenter Arctic Region Supercomputing Center (ARSC) University of Alaska Fairbanks, Mathematical Sciences Dept.
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Invited Paper
|
|
Title: PBS Plug-ins: a Run-time Environment for Agility and Innovation
Abstract: Nitzberg's definition of HPC is "computing that requires pushing the limits of today's technology just beyond where it works well". HPC combines leading-edge hardware, operating systems, networks, enterprise utilities, applications, and more. The key to making HPC work is agility. PBS plug-ins allow you to quickly and easily integrate, extend, and customize PBS Professional to meet the unique requirements and constantly-evolving demands in your enterprise. Plus, by providing a well-defined, modular platform, standardizing on Python "hooks", and embedding it everywhere, our run-time environment enables sharing and reusing innovative plug-ins throughout the PBS community. This presentation provides a technical look at PBS plug-ins, including real examples (yes, code) to get you started. |
Author(s):
Nitzberg, Bill, Presenter Altair
|
Suggested Technical Category:
System Operations
|
|
Title: Information Environment in JAIST
Abstract: The Center for Information Science supports users world-class research and educational environment by providing a high-speed advanced information environment. A high-speed, high-availability network provides the foundation for the high performance file servers, massively parallel computers, and various servers that have enabled JAIST since its foundation to continuously provide users a convenient information environment. I would like to introduce the information environment in JAIST and my research of bio-fluid mechanics. |
Author(s):
Matsuzawa, Teruo, Presenter Japan Advanced Institute of Science and Technology (JAIST)
|
Suggested Technical Category:
System Operations
|
|
Title: Cray goes Bright for HPC Services
Abstract: Bright Cluster Manager provides complete, end-to-end cluster management in one integrated solution: deployment, provisioning, monitoring, and management. Its intuitive GUI provides complete system visibility and ease of use for multiple clusters simultaneously; it’s powerful cluster shell enables automated tasks and intervention. Bright scales from desk-side to TOP500 installations. Cray Custom Engineering has pioneered the use of Bright Cluster Manager® for external HPC systems: large-scale Lustre file systems, login servers, data movers, pre- and post-processing servers. Cray has also leveraged Bright to create additional services. This presentation is an overview of Bright Cluster Manager and its capabilities. |
Author(s):
van Leeuwen, Matthijs, Presenter Bright Computing
|
Suggested Technical Category:
System Operations
|
|
Title: Using Platform LSF Workload Scheduler with CLE
Abstract: The Platform LSF 8.0 Workload Scheduler is now available on the Cray Linux environment. In addition to offering advanced scheduling capabilities, Platform LSF supports Cluster Compatibility Mode (CCM) . In this session, we will present an overview of the Platform LSF & Cray Linux environment integration and give several examples to illustrate the scheduling capabilities of LSF on Cray XT/XE and XTm/XEm systems. |
Author(s):
Bozzo-Rey, Mehdi, Presenter Platform Computing (Gold Sponsor)
|
Suggested Technical Category:
System Operations
|
|
Title: Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability
Abstract: In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year. |
Author(s):
Kovatch, Patricia, Presenter National Institute for Computational Sciences (NICS)
Andrews, Phil National Institute for Computational Sciences (NICS)
Hazlewood, Victor National Institute for Computational Sciences (NICS)
Baer, Troy National Institute for Computational Sciences (NICS)
Ezell, Matt National Institute for Computational Sciences (NICS)
Braby, Ryan National Institute for Computational Sciences (NICS)
Brook, Glenn National Institute for Computational Sciences (NICS)
Whitt, Justin National Institute for Computational Sciences (NICS)
Samuel, Tabitha National Institute for Computational Sciences (NICS)
Crosby, Lonnie National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Operations
|
|
Title: Storage Adventures At Petascale & Beyond
Abstract: The first Petaflop system was put into production in 2008. By 2011, several systems in the 10's of Petaflops will be deployed. The race to
Exascale has begun and the face of storage will change dramatically as we push new boundaries of I/O and archive scalability. Very large compute systems are creating unprecedented challenges for the storage systems that support them. This talk will explore experiences and concepts which DDN has recently gained through deploying many of the world's fastest HPC file systems, including:
* Maximizing cluster performance over the aggregate of the production lifespan, not just during benchmarking
* Exploiting NAND-based storage devices for intelligent storage tiering, buffering and automatic I/O path acceleration
* Data protection technologies which eliminate silent data corruption across scalable storage pools |
Author(s):
Miller, Keith, Presenter DataDirect Networks
|
Suggested Technical Category:
Architecture
|
|
Title: Scheduling Multi-Petaflop Systems and New Technologies
Abstract: New multi-petaflop HPC systems present greater scheduling challenges to their effective utilization. New technologies, such as GPGPU, likewise present scheduling difficulties. The Moab scheduler has recent enhancements that address scheduling larger quantities of nodes, cores, and new resource types. This presentation introduces the latest Moab enhancements that make effective resource scheduling easier and more efficient. |
Author(s):
Brown, Gary, Presenter
|
Suggested Technical Category:
System Operations
|