|
Title: Cray OS Road Map
Abstract: This paper will discuss Cray's operating system road map. This includes the compute node OS, the service node OS, the network stack, file systems, and administrative tools. Coming changes will be previewed, and themes of future releases will be discussed. |
Author(s):
Carroll, Charlie, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
System Operations
|
|
Title: A Pedagogical Approach to User Assistance
Abstract: This presentation will focus on a pedagogical approach to providing user assistance. By making user education the central theme in training, outreach, and user assistance activities, a set of competencies can be developed that encompasses the knowledge required for productive use of leadership-class computing resources such as the Cray XT5 Jaguar system. |
Author(s):
Whitten Jr., Robert, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Consulting
|
|
Title: A Scalable Boundary Adjusting High-Resolution Technique For Turbulent Flows
Abstract: To accurately resolve turbulent flow structures high-fidelity simulations require the use of millions of grid points. The Compact Accurately Boundary Adjusting high-Resolution Technique (CABARET) is capable of producing accurate results with at least 10 times more efficiency than conventional schemes. CABARET is based on a local second-order finite difference scheme which lends itself extremely well to large scale distributed systems. For Reynolds numbers of 10^4 the method gives rapid convergence without requiring additional preconditioning for Mach numbers as low as 0.05. In this paper we shall discuss the implementation and performance of the CABARET method on the HECToR XT4/6 system. We shall describe the development and optimization of an irregular parallel decomposition for the hexahedral numerical grid structure. Scalability of the code will be discussed in relation to i) the effectiveness of the load balancing for grids generated from the partitioning method ii) compiler performance and iii) efficient use of MPI and memory utilisation. |
Author(s):
Karabasov, Sergey University of Cambridge, Department of Engineering, Division of Turbomachinery
Ridley, Phil, Presenter HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Optimisation of the I/O for Distributed Data Molecular Dynamics Applications
Abstract: With the increase in size of HPC facilities it is not only the parallel performance of applications that is preventing greater exploitation, in many cases it is the I/O which is the bottleneck. This is especially the case for distributed data algorithms. In this paper we will discuss how the I/O in the distributed data molecular dynamics application DL_POLY_3 has been optimised. In particular we shall show that extensive data redistribution specifically to allow best use of the I/O subsystem can result in a code that scales to many more processors, despite the large increase in communications required. |
Author(s):
Smith, William HPCX Consortium (HPCX) Bill.Smith@stfc.ac.uk
Todorov, Ilian HPCX Consortium (HPCX) Ilian.Todorov@stfc.ac.uk
Bush, Ian, Presenter HECToR
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Automatic Library Tracking Database
Abstract: The National Institute for Computational Sciences and the National Center for Computational Sciences (both located at Oak Ridge National Laboratory) have been working on an automatic library tracking database whose purpose is to track which libraries are used on their Cray XT5 Supercomputers. The database stores the libraries that are used at link time and it records which executable is run during a batch job. With this data, many operationally important questions can be answered like which libraries are most frequently used and who is using deprecated libraries or applications. The infrastructure design and reporting mechanisms will be presented with production data to this point. |
Author(s):
Jones, Nicholas National Institute for Computational Sciences (NICS)
Fahey, Mark, Presenter National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Libraries
|
|
Title: DMAPP—An API for One-sided Program Models on Baker Systems
Abstract: Baker Systems and follow-on systems will deliver a network with advanced remote memory access capabilities. A new API (DMAPP) has been developed to expose these capabilities to one-sided program models. This paper presents the DMAPP API as well as some preliminary performance data. |
Author(s):
ten Bruggencate, Monika, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Using Quality of Service for Scheduling on Cray XT Systems
Abstract: The University of Tennessee's National Institute for Computational Sciences (NICS) operates two Cray XT systems for the U.S. National Science Foundation (NSF): Kraken, an 88-cabinet XT5 system, and Athena, a 48-cabinet XT4 system. Access to Kraken is allocated through the NSF's Teragrid allocations process, while Athena is currently being dedicated to individual projects on a quarterly basis; as a result, the two systems have somewhat different scheduling goals. However, user projects on both systems have sometimes required the use of quality of service (QoS) levels for scheduling of certain sets of jobs. We will present case studies of three situations where QoS levels were used to fulfill specific requirements: two on Kraken in fully allocated production service, and one on Athena while dedicated to an individual project. These case studies will include lessons learned about impact on other users and unintended side effects. |
Author(s):
Baer, Troy, Presenter National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Operations
|
|
Title: Use of the Cray XT5 Architecture to Push the Limits of WRF Beyond One Billion Gridpoints
Abstract: The Arctic Region Supercomputing Center (ARSC) Weather Research and Forecasting (WRF) model benchmark suite continues to push software and available hardware limits by successfully running a 1km resolution case study composed of more than one billion grid points. Simulations of this caliber are important for providing detailed weather forecasts over the rugged Alaska terrain and are intended for benchmarking on systems with tens of thousands of cores. In pursuing these large scale simulations, we have incurred numerical, software and hardware limitations that have required us to use various parallel I/O schemes and to explore different PBS "aprun" options. In this paper we will discuss issues encountered while gradually expanding the problem sizes in which WRF can operate and our solutions in running high resolution and/or large-scale WRF simulations on the Cray XT5 architecture. |
Author(s):
Nudson, Oralee, Presenter Arctic Region Supercomputing Center (ARSC)
Morton, Dr. Don Arctic Region Supercomputing Center (ARSC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: MRNet: A Scalable Infrastructure for development of parallel tools and applications
Abstract: MRNet is a customizable, high-throughput communication software system for parallel tools and applications. It reduces the cost of these tools' activities by incorporating a tree-based overlay network (TBON) of processes between the tool's front-end and back-ends. MRNet was recently ported and released for Cray XT systems. In this talk we describe the main features that make MRNet well-suited as a general facility for building scalable parallel tools. We present our experiences with MRNet and examples of its use. |
Author(s):
Miller, Barton, Presenter University of Wisconsin
Roth, Philip Oak Ridge National Laboratory (ORNL)
DeRose, Luiz Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory and Torus Topologies in the Cray XT5 environment
Abstract: MPI Collective operations tend to play a large role in limiting the scalability of high-performance scientific simulation codes. As such, developing methods for improving the scalability of these operations is critical to improving the scalability of such applications. Using infrastructure recently developed in the contest of the FASTOS program we will study the performance of blocking collective operations, as well as those of the recently added MPI nonblocking collective operations taking into account both shared memory and network topologies. |
Author(s):
Graham, Richard, Presenter Oak Ridge National Laboratory (ORNL)
Ladd, Joshua Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Networking
|
|
Title: 2DECOMP&FFT—A Highly Scalable 2D Decomposition Library and FFT Interface
Abstract: As part of a HECToR distributed CSE support project, a general-purpose 2D decomposition (also known as 'pencil' or 'drawer' decomposition) communication library has been developed. This Fortran library provides a powerful and flexible framework to build applications based on 3D Cartesian data structures and spatially implicit numerical schemes (such as compact finite difference method or spectral method). The library also supports shared-memory architecture which becomes increasingly popular. A user-friendly FFT interface has been built on top of the communication library to perform distributed multi-dimensional FFTs. Both the decomposition library and the FFT interface scale well to tens of thousands of cores on Cray XT systems. The library has been applied to Incompact3D, a CFD application performing large-scale Direct Numerical Simulations of turbulence, enabling exciting scientific studies to be conducted. |
Author(s):
Li, Ning, Presenter HECToR
Laizet, Sylvain Imperial College London
|
Suggested Technical Category:
Libraries
|
|
Title: Mixed Mode computation in CASINO
Abstract: CASINO is a quantum Monte Carlo code that solves many particle Schroedinger equation with the help of configurations of random walkers. This method is suitable for parallel computation because it has a very good computation/communication ratio. The standard parallel algorithm increases the computation speed by distributing equally the configurations among the available processors. For a computation with P processing elements the computation time for Nc configurations is proportional with Nc*tc/P, where tc is the average time taken for one configuration step. On petascale computers one can have more processing elements than configurations and besides that for models with more that 1000 electrons tc increases significantly. We present a mixed mode implementation of CASINO that takes advantage of the architectures with large numbers of multicore processors to improve computation speed by using multiple OpenMP threads for the computation of each configuration step. |
Author(s):
Anton, Lucian, Presenter HPCX Consortium (HPCX)
Alfe, Dario University College of London, London, UK
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Regression Testing on Petaflop Computational Resources
Abstract: As the complexity of supercomputers increases, it is becoming more difficult to measure how system performance changes over time. Routine system checks performed after scheduled maintenance or emergency downtime give administrators an instantaneous glimpse of system performance; however, rigorous testing, such as that performed for machine acceptance, provides more in-depth information on system performance. Both routine and rigorous testing is necessary to fully characterize system performance, and a mechanism to store and compare previous results is needed to determine the change in system performance over time. A regression testing framework has been developed at the National Institute for Computational Sciences (NICS) which provides a mechanism to measure the change in system performance over time. These performance results can also be correlated to system events such as downtimes, system upgrades, or any other documented system change. We will describe the design and implementation of the regression testing framework, including the development of test suites, interfaces to the batch system, and the extraction of performance data. The import of extracted data into a relational database for long- term storage, report generation, and real- time analysis will also be discussed. |
Author(s):
McCarty, Mike, Presenter National Institute for Computational Sciences (NICS)
Baer, Troy National Institute for Computational Sciences (NICS)
Crosby, Lonnie National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
System Operations
|
|
Title: Combining Open MP and MPI within GLOMAP Mode to Take Advantage of Multiple Core Processors: An Example of Legacy Software Keeping Pace with Hardware Developments
Abstract: The MPI version of GLOMAP MODE is being used in production runs for research into atmospheric science. The memory requirement prohibits use of high resolution scenarios so 32 MPI tasks is the usual decomposition. One way to attempt higher resolution simulations is to under-populate the nodes, making more memory available per MPI task. Although this is wasteful of resource, it does provide a shorter time per existing simulation. The NAG Ltd DCSE service has examined the code and introduced Open MP so that the otherwise "idle" cores can contribute to the MPI task. This improves the performance so that the additional cost of a simulation is reduced. |
Author(s):
Richardson, Mark, Presenter HPCX Consortium (HPCX) Numerical Algorithms Group
Mann, Graham HPCX Consortium (HPCX) Univeristy of Leeds
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Improving the Performance of CP2K on the Cray XT
Abstract: CP2K is a freely available and increasingly popular Density Functional Theory code for the simulation of a wide range of systems. It is heavily used on many Cray XT systems, including 'HECToR' in the UK and 'Monte Rosa' in Switzerland. We describe performance optimisations made to the code in several key areas, including 3D Fourier Transforms, and present the implementation of a load balancing scheme for multi-grids. These result in performance gains of around 30% on 256 cores (for a generally representative benchmark) and up to 300% on 1024 cores (for non-homogeneous systems). Early results from the implementation of hybrid MPI/OpenMP parallelism in the code are also presented. |
Author(s):
Bethune, Iain, Presenter Edinburgh Parallel Computing Centre (EPCC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The NEMO Ocean Modelling Code: A Case Study
Abstract: We present a case study of a popular ocean modelling code, NEMO, on the Cray XT4 HECToR system. HECToR is the UK's high-end computing resource for academic users. Two different versions of NEMO have been investigated. The performance and scaling of the code has been evaluated and optimised by investigating the choice of grid dimensions, by examining the use of land versus ocean grid cells and also by checking for memory bandwidth problems. The code was profiled and the time spent carrying out file input/output was identified to be a potential bottleneck. We present a solution to this problem which gives a significant saving in terms of runtime and disk space usage. |
Author(s):
Reid, Fiona, Presenter Edinburgh Parallel Computing Centre (EPCC)
|
Suggested Technical Category:
Joint Session
Joint Session, Tutorial or Other
Technical Category suggested:
3rd Party Applications/User Code Optimization
|
|
Title: Optimising and Configuring the Weather Research and Forecast Model On the Cray XT
Abstract: The Weather Research and Forecast (WRF) Model is a well-established and widely-used application. Designed and written to be highly scalable, the code has a large number of configuration options at both compile- and run-time. We report the results of an investigation into the effect of these options on the performance of WRF on a Cray XT4 with a typical scientific use-case. Covering areas such as MPI/OpenMP comparison, cache usage and I/O performance, we discuss the implications for both regular WRF users and the authors of other application codes. |
Author(s):
Porter, Andrew, Presenter HPCX Consortium (HPCX)
Ashworth, Mike HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: A Hybrid MPI/Openmp Code Employing a High-Order Compact Scheme for the Simulation of Hypersonic Aerodynamics
Abstract: High order compact schemes are excellent candidates for Direct Numerical Simulation and Large Eddy Simulation of flow fields. We have devised a high order compact scheme suitable for the simulation of hypersonic flows, to exploit both shared and distributed memory paradigms. Our hybrid application, employing both MPI and OpenMP standards, has been tested on HECToR. |
Author(s):
Fico, Vincenzo, Presenter HPCX Consortium (HPCX)
Emerson, David HPCX Consortium (HPCX)
Reese, Jason University of Strathclyde
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: High Performance Computing Driven Software Development for Next-Generation Modeling of the World’s Oceans
Abstract: The Imperial College Ocean Model (ICOM) is an open-source next generation ocean model build upon finite element methods and anisotropic unstructured adaptive meshing. Since 2009, a project has been funded by EPSRC to optimize the ICOM for the UK national HPC service, Hector. Extensive use of profiling tools such as CrayPAT and Vampir, has been made in order to understand performance issues of the code on the Cray XT4. Of particular interest is the scalability of the sparse linear solvers and the algebraic multigrid preconditioners required to solve the system of equations. Scalability of model I/O have been examined and we have implemented a parallel I/O strategy in the code for the Lustre filesystem. |
Author(s):
Guo, Xiaohu , Presenter HPCX Consortium (HPCX)
Kramer, Stephan Department of Earth Science and Engineering, Imperial College London
Ashworth, Mike HPCX Consortium (HPCX)
Gorman, Gerard Department of Earth Science and Engineering, Imperial College London
Piggott, Matthew Department of Earth Science and Engineering, Imperial College London
Sunderland, Andrew HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: XT System Reliability: Metrics, Trends, and Actions
Abstract: In 2009, the XT product family saw a significant improvement in overall reliability as measured by Cray’s support organization. This paper will discuss the reliability trends that have been observed and the main reasons for the improvements. We will also discuss the tools used to collect the field data, the metrics generated by Cray to evaluate XT product reliability and the actions taken as a result of this analysis. |
Author(s):
Johnson, Steve Cray Inc. (CRAY)
|
Suggested Technical Category:
System Operations
|
|
Title: Multi-Core Aware Performance Optimization of Halo Exchanges in Ocean Simulations
Abstract: The advent of multi-core brings new opportunities for performance optimization in MPI codes. For example, the cost of performing a halo exchange in a finite-difference simulation can be reduced by choosing a partition into sub-domains that takes full advantage of the faster shared-memory mechanisms available for MPI communication between tasks on the same node. We have implemented these ideas in the Proudman Oceanographic Laboratory Coastal-Ocean Modelling System, and find that multi-core aware optimizations can offer significant peformance benefit, especially on hex-core systems. |
Author(s):
Pickles, Stephen, Presenter HPCX Consortium (HPCX)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Scaling Applications on Cray XT systems
Abstract: In this tutorial we will present tools and techniques for application performance tuning on the Cray XT system, with focus on multi-core processors. Attendees will learn about the Cray XT architecture and its programming environment. They will have an initial understanding of potential causes of application performance bottlenecks, and how to identify some of these bottlenecks using the Cray Performance tools. In addition, attendees will learn advanced techniques to deal with scaling problems and how to access the on-line documentation for user help. Attendees will also have some exposure to the Cray debugging support tools, which provide innovative techniques to debug applications at scale. |
Author(s):
DeRose, Luiz, Presenter Cray Inc. (CRAY)
Levesque, John, Presenter Cray Inc. (CRAY)
Moench, Bob, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
This is a proposal for a morning tutorial.
|
|
Title: The Cray Programming Environment: Current Status and Future Directions
Abstract: The Cray Programming Environment has been designed to address issues of scale and complexity of high end HPC systems. Its main goal is to hide the complexity of the system, such that applications can achieve the highest possible performance from the hardware. In this talk I will present the recent activities and future directions of the Cray Programming Environment, which consists of state of the art compiler, tools, and libraries, supporting a wide range of programming models. |
Author(s):
DeRose, Luiz, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Jaguar - The World's Most Powerful Computer System
Abstract: At the SC'09 conference in November 2009, Jaguar was crowned as the world's fastest computer by the web site www.Top500.org. In this paper, we will describe Jaguar, present results from a number of benchmarks and applications, and talk about future computing in the Oak Ridge Leadership Computing Facility. |
Author(s):
Bland, Arthur, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Joint Session
Joint Session, Tutorial or Other
Technical Category suggested:
Major Systems
|
|
Title: General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications
Abstract: GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL utilizes it to gather hardware performance counter data. GPTL built with PAPI support is installed on the jaguar machine at ORNL. |
Author(s):
Rosinski, James, Presenter Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Performance Analysis of Pure MPI Versus MPI+OpenMP for Jacobi Iteration and a 3D FFT on the Cray XT5
Abstract: Today many high performance computers are collections of shared memory compute nodes with each compute node having one or more multi-core processors. When writing parallel programs for these machines, one can use pure MPI or various hybrid approaches using MPI and OpenMP. Since OpenMP threads are lighter weight than MPI processes, one would expect that hybrid approaches will achieve better performance and scalability than pure MPI. In practice this is not always the case. This paper investigates the performance and scalability of pure MPI versus hybrid MPI+OpenMP for Jacobi iteration and a 3D FFT on the Cray XT5. |
Author(s):
Weiss, Olga Iowa State University
Luecke, Glenn, Presenter Iowa State Unversity
|
Suggested Technical Category:
Programming Environment
|
|
Title: Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms
Abstract: Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper, we will examine the effect of the choice of programming model upon performance and overall memory usage. We will study how to make efficient use of the memory system and explore the advantages and disadvantage of MPI, OpenMP, and UPC on the Cray XT5 multicore platforms for several synthetic and application benchmarks. |
Author(s):
Shan, Hongzhang, Presenter National Energy Research Scientific Computing Center (NERSC)
Shalf, John National Energy Research Scientific Computing Center (NERSC)
Wright, Nick National Energy Research Scientific Computing Center (NERSC)
Jin, Haoqiang NAS Systems Division (NAS)
|
Suggested Technical Category:
Programming Environment
|
|
Title: MPI Queue Characteristics of Large-scale Applications.
Abstract: Applications running at scale have varying communication characteristics. By employing the PERUSE introspection interface of Open MPI, this paper evaluates several large-scale simulations running production-level input data-sets on the jaguar installation at ORNL. Maximum number of queued messages, average duration of unexpected receives and late sender and receiver information as a functions of job size is being presented. |
Author(s):
Keller, Rainer, Presenter Oak Ridge National Laboratory (ORNL)
Graham, Richard L. Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Tools
|
|
Title: Tools, Tips and Tricks for Managing Cray XT Systems
Abstract: Managing large complex systems requires processes beyond what is taught in vendor training classes. Many sites must manage multiple systems from different vendors. This paper covers a collection of techniques to enhance the usability, reliability and security of Cray XT systems. A broad range of activities, from complex tasks like security, integrity and environmental checks of the Cray Linux Environment, to relatively simple things like making 'rpm -qa' available to users will be discussed. Some techniques will be XT specific, such as monitoring L0/L1 environment, but others will be generic, such as security tools adapted from other systems and re-spun as necessary for the XT Cray Linux Environment. |
Author(s):
Carlson, Kurt, Presenter Arctic Region Supercomputing Center (ARSC)
|
Suggested Technical Category:
System Operations
|
|
Title: Collecting Application-Level Job Completion Statistics
Abstract: Job failures are common on large high performance computing systems, but logging, analyzing, and understanding the low-level error messages can be difficult on Cray XT systems. This paper describes a set of tools to log and analyze applications in real-time as they run on the system. By obtaining more information about typical error scenarios, system administrators can work to resolve the underlying issues and educate users. |
Author(s):
Ezell, Matthew, Presenter National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
System Operations
|
|
Title: ALPS, Topology, and Performance
Abstract: Application performance can be improved or reduced depending on the compactness of the set of nodes on which an application is placed (as demonstrated convincingly by PSC at a recent CUG). This paper describes the approach to placements that ALPS now uses based on the underlying node topology, the reasons for this approach, and the variations that sites can use to optimize for their specific machine and workload. |
Author(s):
Albing, Carl, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Dynamic Shared Libraries and Virtual Cluster Environment
Abstract: Cray is expanding system functionality to support Dynamic Shared Libraries (DSL) on compute nodes, and the ability to run a wide range of packaged ISV applications on compute nodes. Built upon Data Virtualization Service (DVS), a more standard Linux runtime environment is distributed across the system by the DSL capability via DVS Server nodes to the Compute Node clients. The CLE Virtual Cluster Environment (VCE) adds a further layer of functionality, by supporting natively installed and executed ISV applications. This three component solution allows customers to meet a wide range of runtime environment demands with limited impact and complexity while increasing productivity. |
Author(s):
Schildt, Jason, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tools
|
|
Title: Resiliency Features in the Next Generation Cray Gemini Network
Abstract: As system sizes scale to ever increasing numbers of nodes and network links, network failures become an increasingly important problem to address. With its next generation high speed network (code named Gemini), Cray will introduce a number of new resiliency features in this area. These features, including network link failover, are discussed in this paper as well as a comparison to other, more familiar, network technologies such as Ethernet and Infiniband. |
Author(s):
Godfrey, Forest, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Architecture
|
|
Title: Scalasca Performance Analyses of PRACE Petascale Prototype Systems and Applications
Abstract: The open-source Scalasca toolset [www.scalasca.org] for scalable performance analysis of large-scale parallel applications using MPI and OpenMP on a range of HPC computer systems. During a HPC-Europa2 visit to EPCC during 2009, Scalasca support for Cray XT systems was improved for the PGI programming environment and extended to also encompass the GNU, PathScale, Intel and Cray compilers. Application benchmark analyses with Scalasca on HECToR [.ac.uk], Louhi [.csc.fi], Rosa [.cscs.ch] and Kraken [nics.tennessee.edu] are compared with other PRACE petascale prototype systems. |
Author(s):
Wylie, Brian, Presenter Juelich Supercomputing Centre
|
Suggested Technical Category:
Tools
|
|
Title: Using I/O Servers to Improve Performance on Cray XT Technology
Abstract: Amdhal's Law proposes that parallel codes are combinations of parallel and serial tasks. In many cases these tasks are inherently parallel and can be decomposed and performed asynchronously. Each tasks operates on a dedicated subset of processors with highly scalable tasks operating on a very large numbers of processors and less scalable tasks (like IO) operating on a smaller number. By moving to this Multiple Instruction Multiple Data paradigm codes can achieve greater parallel efficiency and scale further. This paper specifically addresses the implementation and experiences of adapting several codes important to HECToR to offload writing output data onto a set of dedicated server processors. |
Author(s):
Edwards, Thomas, Presenter Cray Inc. (CRAY)
Roy, Kevin Cray Inc. (CRAY)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Petascale Debugging
Abstract: The need for debugging at scale is well known - yet machine sizes have raced ahead of the levels reachable by debuggers for many years. This paper outlines major development of Allinea's DDT debugging tool to introduce production-grade petascale debugging on the Oak Ridge Jaguar XT5 system. The resulting scalable architecture is raising the bar of usability and performance in a debugger by multiple orders of magnitude - and has already achieved record 225,000 core debugging at ORNL. |
Author(s):
Lecomber, David, Presenter Allinea Software
January, Chris Allinea Software
O'Connor, Mark Allinea Software
|
Suggested Technical Category:
Tools
|
|
Title: PRACE Application Enabling Work at EPCC
Abstract: The Partnership for Advanced Computing in Europe (PRACE) created the prerequisites for a pan-European HPC service, consisting of several tier-0 centres. PRACE's aim has now moved to the implementation of this service. The now completed work looked into all aspects of the pan-European service, including the contractual and organisational issues, the system managment, application enableling and future computer technologies. This talk discusses the work done by EPCC on the application codes HELIUM (from Queen's University Belfast, UK) and NAMD (from University of Illinois at Urbana Champaign, US) with a particular focus on the work carried out for the Prace prototype Louhi, which is Cray XT5 at CSC in Finland. We will also include a performance comparison with non-Cray systems available to PRACE. |
Author(s):
Guo, Xu, Presenter Edinburgh Parallel Computing Centre (EPCC)
Hein, Joachim Edinburgh Parallel Computing Centre (EPCC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Imperative Recovery for Lustre
Abstract: Recovery times for Lustre failover are mainly a function of the overriding bulk data timeout because clients must timeout to a server twice before initiating contact with its backup. As a result, failover completion times exceeding ten minutes are common. During failover and recovery, all IO operations stall and the long duration can lead to job timeouts, poor system utilization, and increased administrator load. To improve overall failover times we are implementing Imperative Recovery, the framework by which Lustre can initiate and finish failover without waiting for long timeouts. Imperative Recovery directs clients to switch server connections based on automatic processing of node health data. With these changes and Version Based Recovery, it is possible to begin recovery very fast, reducing overall failover times to a few minutes. This paper discusses Imperative Recovery from a system perspective and characterizes the speedup achieved. |
Author(s):
Spitz, Cory, Presenter Cray Inc. (CRAY)
Henke, Nic Cray Inc. (CRAY)
Horn, Chris Cray Inc. (CRAY)
|
Suggested Technical Category:
Mass Storage
|
|
Title: Towards European Training Network in Computational Science
Abstract: The implementation phase of The Partnership for Advanced Computing in Europe (PRACE) project will develop and maintain a European training network in the field of computational science. Its key ingredients are solid contacts between the partner organisations and European research centres, as well as establishing new links to universities. In this talk, I will review the completed training-related activities of the preparatory phase of PRACE as well as plans for the implementation phase. |
Author(s):
Manninen, Pekka, Presenter CSC ? Scientific Computing Ltd. (CSC)
Turunen, Ari CSC ? Scientific Computing Ltd. (CSC)
|
Suggested Technical Category:
Training
|
|
Title: XGC1: Performance on the 8-core and 12-core Cray XT5 systems at ORNL
Abstract: The XGC1 code is used to model multiscale tokamak plasma turbulence dynamics in realistic edge geometry. In June 2009, XGC1 demonstrated nearly linear weak and strong scaling out to 150,000 cores on a a Cray XT5 with 8-core nodes when solving problems of relevance to running experiments on the ITER tokamak. Here we compare performance, and discuss further performance optimizations, when running XGC1 on an XT5 with 12-core nodes on up to 224,000 cores. |
Author(s):
Worley, Patrick, Presenter Oak Ridge National Laboratory (ORNL)
Adams, Mark Columbia University
D'Azevedo, Eduardo Oak Ridge National Laboratory (ORNL)
Chang, C-S New York University
Ku, Seung-Hoe new York University
McCurdy, Collin Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: RAVEN: RAS data Analysis through Visually Enhanced Navigation
Abstract: Supercomputer RAS data contain various signatures regarding system status, thus are routinely examined to detect and diagnose faults. However, due to voluminous sizes of logs generated during faulty situations, a comprehensive investigation that requires comparisons of different types of RAS logs over both spatial and temporal dimensions is often beyond the capacity of human operators, which leaves a cursory look to be the only feasible option. As an effort to better embrace informative but huge supercomputer RAS data in a fault diagnosis/detection process, we present a GUI tool called RAVEN that visually overlays various types of RAS logs on a physical system map where correlations between different fault types can be easily observed in terms of their quantities and locations at a given time. RAVEN also provides an intuitive fault navigation mechanism that helps examine logs by clustering them to their common locations, types, or user applications. By tracing down notable fault patterns reflected on the map and their clustered logs, and superimposing user application data, RAVEN, which has been adopted at National Institute of Computational Science (NICS) at the University of Tennessee, identified root causes of several system failures logged in Kraken XT5. |
Author(s):
Park, Byung-Hoon Oak Ridge National Laboratory (ORNL)
Heo, Junseong National Institute for Computational Sciences (NICS)
Kora, Guruprasad Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Environmental Monitoring
|
|
Title: Application Acceleration on Current and Future Cray Platforms
Abstract: Application codes in a variety of areas are being updated for performance on the latest architectures. We describe current bottlenecks and performance improvement areas for applications including plasma physics, chemistry related to carbon capture and sequestration, and material science. |
Author(s):
Koniges, Alice, Presenter National Energy Research Scientific Computing Center (NERSC)
Kim, Jihan National Energy Research Scientific Computing Center (NERSC)
Preissl, Robert National Energy Research Scientific Computing Center (NERSC)
Fagnan, Kirsten National Energy Research Scientific Computing Center (NERSC)
Shalf, John National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Automatic Iterative Optimization of Parallel Applications
Abstract: Manual software optimization is effective, but also time-consuming, and can thus benefit from complementary automatic optimization schemes. This paper describes a novel cross-platform framework that is able to optimize parallel applications by tuning three sets of parameters: compiler options, environment variables and internal program parameters. The optimization is carried out using a genetic algorithm, where the trial simulations may be run in parallel allowing the optimization algorithm to scale. The performance of this framework is assessed on a Cray XT5 by optimizing both real world applications, as well as well-known synthetic benchmarks such as the High-Performance Linpack (HPL) benchmark. The results show that our optimization framework increases the performance of the test cases significantly. |
Author(s):
von Alfthan, Sebastian, Presenter CSC ? Scientific Computing Ltd. (CSC)
Lehto, Olli-Pekka CSC ? Scientific Computing Ltd. (CSC)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Improving the Performance of COSMO-CLM
Abstract: The COSMO-Model, originally developed by Deutscher Wetterdienst, is a non-hydrostatic regional atmospheric model which can be used for numerical weather prediction and climate simulations and is now in use by a number of weather services for operational forecasting (e.g. MeteoSwiss). One current software engineering goal is to improve its scaling characteristics on multicore architectures by making it a hybrid MPI-OpenMP code. We will present hybridization strategies for different components of the model, show some first performance results, and discuss the impact on further development of the model. |
Author(s):
Cordery, Mathew, Presenter CSCS?Swiss National Supercomputing Centre (CSCS) CSCS
Sawyer, Will CSCS?Swiss National Supercomputing Centre (CSCS) CSCS
Schaettler, Ulrich CSCS?Swiss National Supercomputing Centre (CSCS) Deutscher Wetterdienst
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Overview and Performance Evaluation of Cray LibSci Products
Abstract: This talk serves as a both an introduction to the Cray scientific library suite and as a tutorial on obtaining advanced perform with applications that utilize scientific libraries. The talk will include a thorough and frank performance evaluation of all scientific library products on Cray XT systems, including dense kernels on single core and multiple cores, dense linear solvers and eigensolvers in serial and parallel, serial and distributed Fourier Transforms and Sparse kernels within sparse iterative solvers. The emphasis will be on usage and how to increase performance by using different algorithms or libraries, better configurations, or advanced controls of the scientific libraries. |
Author(s):
Tate, Adrian, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Libraries
Joint Session, Tutorial or Other
Technical Category suggested:
it makes sense to join this talk with any other internal PE software talks such as compilers, tools, PE overview etc.
|
|
Title: Evaluation of Productivity and Performance Characteristics of CCE CAF and UPC Compilers
Abstract: The Co-Array Fortran (CAF) and Unified Parallel C (UPC) functional compilers available with the Cray Compiler Environment (CCE) on the Cray XT5 platform offer an integrated framework for code development and execution for Partitioned Global Address Space (PGAS) programming paradigm together with message-passing MPI and shared-memory OpenMP programming models. Using micro-benchmarks, conformance test cases and micro-kernels of representative scientific calculations, we attempt to evaluate the following characteristics of the CCE PGAS compilers: (1) usability of the framework for code development and execution; (2) completeness and integrity of code generation; (3) efficiency of the generated code particularly usage of the communication layer (GASNet on SeaStar2); and (4) tools availability for performance measurement and diagnostics. Our initial results show that the current version of compiler provides a highly productive code development environment for CAF or UPC code development on our target Cray XT5 platform. At the same time however, we observe that the code transformation and generation processes are unable to aggregate remote memory access for simple access patterns causing significant slowdown. We will compare and contrast code generation with two multi-platform PGAS compilers: Berkley UPC environment that uses the Intrepid UPC compiler and the g95 CAF compiler extensions. In the full paper version, we would also include comparative results using the Rice CAF 2.0 compiler, if it becomes available in due time. |
Author(s):
Alam, Sadaf, Presenter CSCS?Swiss National Supercomputing Centre (CSCS)
Cordery, Matthew CSCS?Swiss National Supercomputing Centre (CSCS)
Sawyer, William CSCS?Swiss National Supercomputing Centre (CSCS)
Stitt, Tim CSCS?Swiss National Supercomputing Centre (CSCS)
Stringfellow, Neil CSCS?Swiss National Supercomputing Centre (CSCS)
|
Suggested Technical Category:
Compilers
|
|
Title: An Alliance for Computing at the Extreme Scale
Abstract: Los Alamos and Sandia National Laboratories have formed a new high performance computing center, the Alliance for Computing at the Extreme Scale (ACES). The two labs will jointly architect, develop, procure and operate capability systems for DOE’s Advanced Simulation and Computing Program. This presentation will discuss (1) a petascale production capability system, Cielo, that will be deployed in late 2010, (2) a technology roadmap for exascale computing and (3) a new partnership with Cray on advanced interconnect technologies. |
Author(s):
Dosanjh, Sudip, Presenter Sandia National Laboratories (SNLA)
Morrison, John, Presenter Los Alamos National Laboratory
Ang, James Sandia National Laboratories (SNLA)
Koch, Ken Los Alamos National Laboratory
|
Suggested Technical Category:
Architecture
|
|
Title: File System Monitoring as a Window Into User I/O Requirements
Abstract: The effective management of HPC I/O resources requires an understanding of user requirements, so the National Energy Research Scientific Computing center (NERSC) annually surveys its project leads for their anticipated needs. With the advent of detailed monitoring on the Lustre prarallel file system of the Franklin Cray XT it becomes possible to compare actual experience with the expectations presented in the surveys. A correlation of the Lustre Monitoring Tool (LMT) data with job log statistics reveals I/O behavior on a per-project basis. This feedback for both the ussers and the center enhances NERSC's ability to manage and provision Franklin's I/O subsytem as well as to plan for future I/O requirments. |
Author(s):
Uselton, Andrew, Presenter National Energy Research Scientific Computing Center (NERSC)
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Mass Storage
|
|
Title: Correlating Log Messages for System Diagnostics
Abstract: In large-scale computing systems the sheer volume of log generated has challenged the interpretation of log messages for debugging and monitoring purposes. For a non-trivial event, the Jaguar XT5 at the Oak Ridge Leadership Computing Facility with more than eighteen thousand compute nodes would generate a few hundred thousand log entries in less than a minute. Determining the root cause of such events requires analyzing and understanding these log messages. Most often, these log messages are best understood when they are interpreted collectively rather than being read as individual messages. In this paper, we present our approach to interpreting log messages by identifying commonalities and grouping them into clusters. Given a set of log messages within a time interval, we parse and group the messages based on source, target, and/or error type, and correlate the messages with hardware and application information. We monitor the XT5’s console, netwatch and sys log and show how such grouping of log messages help in detecting system events. By intelligent grouping and correlation of events from multiple sources we are able to provide system administrators with meaningful information in a concise format for root cause analysis. |
Author(s):
Gunasekaran, Raghul Oak Ridge National Laboratory (ORNL)
Park, Byung Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
Geist, Al Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Operations
|
|
Title: Improving the Productivity of Scalable Application Development with TotalView
Abstract: Scientists and engineers who set out to solve grand computing challenges need TotalView at their side. The TotalView debugger provides a powerful and scalable tool for analyzing, diagnosing, debugging and troubleshooting a wide variety of different problems that might come up in the process of such achievements. These teams, and teams of scientists pursuing a wide range of computationally complex problems on Cray XT systems are frequently diverse and geographically distributed. These groups work collaboratively on complex applications in a computational environment that they access through a batch resource management system. This talk will explore the productivity challenges faced by scientists and engineers in this environment -- highlighting both long standing (but perhaps unfamiliar) and recently introduced capabilities that TotalView users on Cray can take advantage of to boost their productivity. The list of capabilities will include the CLI, subset attach, Remote Display Client, TVScript, MemoryScape's reporting, and ReplayEngine. |
Author(s):
Gottbrath, Chris, Presenter TotalView Technologies
|
Suggested Technical Category:
Programming Environment
|
|
Title: Lessons Learned in Deploying the World's Largest Scale Lustre File System
Abstract: The Spider parallel file system at Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) is the world’s largest scale Lustre file system. It has nearly 27,000 file system clients, 10 PB of capacity, and over 240 GB/s of demonstrated I/O bandwidth. In full-scale production for over 6 months, Spider provides a high performance parallel I/O environment to a diverse portfolio of computational resources. These range from the high end, multi-Petaflop Jaguar XT5, the mid-range, 260 Teraflop Jaguar XT4, to the low end, with numerous systems supporting development, visualization, and data analytics. Throughout this period we have had a number of critical design points reinforced while learning a number of lessons on designing, deploying, managing, and using a system of this scale. This paper details our operational experience with the Spider file system, focusing on observed reliability (including MTTI and MTTF), manageability, and system performance under a diverse workload. |
Author(s):
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Leverman, Dustin Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Simmons, James Oak Ridge National Laboratory (ORNL)
Wang, Feiyi
|
Suggested Technical Category:
Mass Storage
|
|
Title: What is a 200,000 CPUs Petaflop Computer Good For (a Theoretical Chemist Perspective)?
Abstract: We describe the efforts undertaken to efficiently parallelize the computational chemistry code NWChem on the Cray XT hardware using the Global Arrays/ARMCI middleware. We show how we can now use 200K+ processors to address complex scientific problems. |
Author(s):
Apra, Edoardo, Presenter Oak Ridge National Laboratory (ORNL)
Tipparaju, Vinod Oak Ridge National Laboratory (ORNL)
Olson, Ryan Cray Inc. (CRAY)
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Reducing Application Runtime Variability on Jaguar XT5
Abstract: Operating system (OS) noise is defined as interference generated by the OS that prevents the compute core from performing “useful” work. Compute node kernel daemons, network interfaces, and other OS related services are major sources of such interference. This interference on individual compute cores can vary in duration and frequency and can cause de-synchronization (jitter) in collective communication tasks and thus results in variable (degraded) overall parallel application performance. This behavior is more observable in large-scale applications using certain types of collective communication primitives, such as MPI_Allreduce. This paper presents our efforts towards reducing the overall effect of OS noise on our large-scale parallel applications. Our tests were performed on the quad-core Jaguar, the Cray XT5 at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). At the time of these tests, Jaguar was a 1.4 PFLOPS supercomputer with 144,000 compute cores and 8 cores per node. The technique we used was to aggregate and merge all OS noise sources onto a single compute core for each node. The scientific application was then run on the remaining seven cores in each node. Our results show that we were able to improve the MPI_Allreduce performance by two orders of magnitude and to boost the Parallel Ocean Program (POP) performance over 30% using this technique. |
Author(s):
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, Dave Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Becklehimer, Jeff Cray Inc. (CRAY)
Larkin, Jeff Cray Inc. (CRAY)
|
Suggested Technical Category:
Tuning and OS Optimization
|
|
Title: Franklin Job Completion Analysis
Abstract: The NERSC Cray XT4 machine Franklin has been in production for 3000+ users since October 2007, where about 1800 jobs were run each day. There has been an on-going effort to better understand how well these jobs run, whether failed jobs are due to application errors or system issues, and to further reduce system related job failures. In this paper, we will talk about the progress we made in tracking job completion status, in identifying job failure root cause, and in expediting resolution of job failures, such as hung jobs, that are caused by system issues. In addition, we will present some Cray software design enhancements we requested to help us track application progress and identify errors. |
Author(s):
He, Yun (Helen), Presenter National Energy Research Scientific Computing Center (NERSC)
Lin, Hwa-Chun Wendy National Energy Research Scientific Computing Center (NERSC)
Yang, Woo-Sun National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Consulting
Joint Session, Tutorial or Other
Technical Category suggested:
note: The more appropriate category for this paper would be: "User Support" (which is missing) under "Systems Support" category. Thanks!
|
|
Title: An Overview of the Chapel Programming Language and Implementation
Abstract: Chapel is a new parallel programming language under development at Cray Inc. as part of the DARPA High Productivity Computing Systems (HPCS) program. Chapel has been designed to improve the productivity of parallel programmers working on large-scale supercomputers as well as small-scale, multicore computers and workstations. It aims to vastly improve programmability over current parallel programming models while supporting performance and portability at least as good as today's technologies. In this tutorial, we will present an introduction to Chapel, from context and motivation to a detailed description of Chapel via many example computations. This tutorial will focus on writing Chapel programs for both multi-core and distributed-memory computers. We will explore the optimizations added to the Chapel implementation this past year that helped with the most recent Chapel HPCC entry. |
Author(s):
Deitz, Steve, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Tutorial
Joint Session, Tutorial or Other
Technical Category suggested:
Language / Programming Environment / Compiler Tutorial
|
|
Title: Interactions Between Application Communication and I/O Traffic on the Cray XT High Speed Network
Abstract: The massive size of modern leadership computing resources often leads to the discovery of application performance bottlenecks not seen at smaller scales. Many of these performance bottlenecks originate within individual applications; however, recent application testing on a Cray XT5 indicates that an application's I/O pattern can negatively impact the communication performance of another application via interaction over the shared high speed network (HSN). This study seeks to identify and to quantify such interactions on the HSN of Kraken, the Cray XT5 operated by the National Institute for Computational Sciences (NICS). |
Author(s):
Brook, R. Glenn, Presenter National Institute for Computational Sciences (NICS)
Crosby, Lonnie D. National Institute for Computational Sciences (NICS)
|
Suggested Technical Category:
Networking
|
|
Title: Five Powerful Chapel Idioms
Abstract: The Chapel parallel programming language, under development at Cray Inc., has the potential to deliver high performance to more programmers with less effort than current practices provide. This is especially the case with the many-core architectures that are already becoming more and more prevalent. This paper presents five reasons why: 1. Chapel supports easy-to-use asynchronous and synchronous remote tasks, 2. Chapel supports local and remote transactions, 3. Chapel supports simple data-parallel abstractions when applicable, 4. Chapel supports user-defined data distributions, and 5. Chapel supports arbitrarily nested parallelism. |
Author(s):
Deitz, Steve, Presenter Cray Inc. (CRAY)
Chamberlain, Brad Cray Inc. (CRAY)
Choi, Sung-Eun Cray Inc. (CRAY)
Iten, David Cray Inc. (CRAY)
Prokowich, Lee Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Thermodynamics of Magnetic Systems from First Principles: WL-LSMS
Abstract: We describe a method to combine classical thermodynamic Monte Carlo calculations (the Wang-Landau method) with a first principles electronic structure calculation, specifically our locally selfconsistent multiple scattering (LSMS) code. The combined code shows superb scaling behavior on massively parallel computers and is able to calculate the transition temperature of Fe without external parameters. The code was the recipient of the 2009 Gordon-Bell prize for peak performance. |
Author(s):
Eisenbach, Markus, Presenter Oak Ridge National Laboratory (ORNL)
Nicholson, Donald Oak Ridge National Laboratory (ORNL)
Brown, Gregory Florida State University
Zhou, Chengang J P Morgan Chase & Co
Larkin, Jeff Cray Inc. (CRAY)
Schulthess, Thomas CSCS?Swiss National Supercomputing Centre (CSCS)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: Parallelism in System Tools
Abstract: The Cray XT, when employed in conjunction with the Lustre filesystem, provides the ability to generate huge amounts of data in the form of many files. This is accommodated by satisfying the requests of multiple Lustre clients in parallel. In contrast, a single service node (Lustre client) cannot provide timely management for such datasets. Consequently, as the dataset enters the 10+ TB range and/or hundreds of thousands of files, using traditional UNIX tools like cp, tar, or “find . –exec ... ;” to manage these datasets causes the impact to user productivity to become substantial. For example, it would take about 12 hours to copy a 10 TB dataset from the service node via cp if dedicated resources were employed. In general, it is not practical to schedule dedicated resources for a data copy and, as a result, a typical duty factor of 4X is incurred. This means that, in practice, it would take 48 hours to perform a serial copy of a 10 TB dataset. Over the next three to four years, datasets are likely to grow by a factor of 4X. At that point, the simple copy of a dataset may be expected to take over a week and represents significant impediment to the investigation of science. In this paper, we introduce the Lustre User Toolkit for Cray XT, developed at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF) and demonstrate that, by optimizing and parallelizing system tools, an order of magnitude performance increase or more can be achieved, thereby reducing or eliminating the bottleneck. The conclusion is self-evident: parallelism in system tools is vital to managing large datasets. |
Author(s):
Matney, Sr., Kenneth, Presenter Oak Ridge National Laboratory (ORNL)
Shipman, Galen Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Other
Joint Session, Tutorial or Other
Technical Category suggested:
Systems Support - Tools
|
|
Title: Analyzing Multicore Characteristics for a Suite of Applications on an XT5 System
Abstract: In this paper, we will explore the performance of applications important to Sandia on an XT5 system with dual socket AMD 6 core Istanbul nodes. We will explore scaling as a function of the number of cores used on each node and determine the effective core utilization as core count increases. We will then analyze these results using profiling to better understand resource contention within and between nodes. |
Author(s):
Vaughan, Courtenay, Presenter Sandia National Laboratories (SNLA)
Doerfler, Douglas Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
3rd Party Applications
|
|
Title: External Services on the Cray XT5 System Hopper
Abstract: Cray External Service offerings such as login nodes, data mover nodes, and file systems which are external to the main XT system, provide an opportunity to make Cray XT High Performance Computing resources more robust and accessible to end users. This paper will discuss our experiences using external services on Hopper, a Cray XT5 system at the National Energy Research Scientific Computing (NERSC) Center. It will describe the motivation for externalizing services, early design decisions, security issues, implementation challenges and production feedback from NERSC users. |
Author(s):
Antypas, Katie National Energy Research Scientific Computing Center (NERSC)
Butler, Tina National Energy Research Scientific Computing Center (NERSC)
Carter, Jonathan , Presenter National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Architecture
|
|
Title: The Evolution of a Petascale Application: Work on CHIMERA.
Abstract: CHIMERA is a multi-dimensional radiation hydrodynamics code designed to study core-collapse supernovae. We will review several recent enhancements to CHIMERA designed to better exploit features of the CRAY XT architecture, as well as some forward-looking work to take advantage of the next generation of Cray supercomputers. |
Author(s):
Messer, Bronson , Presenter Oak Ridge National Laboratory (ORNL)
Bruenn, Stephen Florida Atlantic University
Hix, Raph Oak Ridge National Laboratory (ORNL)
Mezzacappa, Anthony Oak Ridge National Laboratory (ORNL)
Blondin, John North Carolina State University
|
Suggested Technical Category:
User Code Optimization
|
|
Title: The Graph 500
Abstract: New large-scale informatics applications require radically different architectures from those optimizing for 3D Physics. The 3D physics community is represented in the Top 500 list by a LINPACK as a single, simple, dense algebra benchmark. Informally, the Cray XMT performs significantly better than other known architectures on large-scale graph problems, which is a core informatics application kernel. The Graph 500 list, to be introduced at Supercomputing 2010, will formalize a single, unified graph benchmark for the informatics community to rally around and to precipitate innovation in the informatics space. This paper will discuss the need for this kind of benchmark, the benchmark itself, an initial set of results on a small subset of platforms (including XMT), and why those platforms are fundamentally different from other classes of supercomputer. |
Author(s):
Murphy, Ricahrd, Presenter Sandia National Laboratories (SNLA)
Ang, Jim Sandia National Laboratories (SNLA)
Henrickson, Bruce Sandia National Laboratories (SNLA)
Rodrigues, Arun Sandia National Laboratories (SNLA)
Barrett, Brian Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Architecture
|
|
Title: Performance Monitoring Tools for Large Scale Systems
Abstract: Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are insufficient to process the increasingly large and diverse volume of performance and status log data produced by the world’s largest systems. In addition to the large data volume, the wide variety of systems employed by the largest computing facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation. This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory’s Leadership Computing Facility, including the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools. |
Author(s):
Shipman, Galen, Presenter Oak Ridge National Laboratory (ORNL)
Dillow, David Oak Ridge National Laboratory (ORNL)
Hill, Jason Oak Ridge National Laboratory (ORNL)
Miller, Ross Oak Ridge National Laboratory (ORNL)
Oral, Sarp Oak Ridge National Laboratory (ORNL)
Maxwell, Don Oak Ridge National Laboratory (ORNL)
Wang, Feiyi Oak Ridge National Laboratory (ORNL)
|
Suggested Technical Category:
Environmental Monitoring
|
|
Title: Multi-core Programming Paradigms and MPI Message Rates - A Growing Concern?
Abstract: The continued growth in per-node core count in high performance computing platforms has lead the community to investigate alternatives to an MPI-everywhere programming environment. A hybrid programming environment, in which MPI is used for coarse grained, inter-node parallelism and a threaded environment (pthreads, OpenMP, etc.) is used for fine-grained, intra-node parallelism presents an appealing target for future applications. At the same time, memory and network bandwidth both continue to grow at a significantly slower pace than processor performance. This trend, combined with increased parallelism due to larger machine sizes, will drive applications away from the bandwidth-limited BSP model to one with a higher number of smaller messages, which avoids unnecessary memory-to-memory copies inside a single node. The increase in small message transfers requires a higher message rate from a single node. Current network designs rely on a number of tasks on a single node injecting messages into the network in order to achieve optimal message rates. This paper quantifies the impact of local process count on node-level message rate for Cray XT5 hardware. The results are an important metric in designing both MPI implementations and applications for the hybrid programming future. |
Author(s):
Hemmert, Scott Sandia National Laboratories (SNLA)
|
Suggested Technical Category:
Networking
|
|
Title: Cray Debugging Support Tools for Petascale Applications
Abstract: As HPC systems have gotten ever larger, the amount of information associated with a debugging failing parallel application has grown beyond what the beleaguered applications developer has the time, resources, and wherewithal to analyze. With the release of the Cray Debugging Support package, Cray introduces several innovative methods of attacking this vexing problem. FTD (Fast Track Debugging) achieves debugging at fully optimized speeds. STAT (Stack Trace Analysis Tool) facilitates the evaluation and study of hung applications. ATP (Abnormal Termination Processing) captures a STAT-like view of applications that have taken a fatal trap. And Guard, the Cray comparative debugger, delivers an automated search for the location of program errors by comparing a working version of an application against a failing version. This paper describes and explores each of the above technologies. |
Author(s):
Moench, Bob, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Running Hadoop on a Cray XT System
Abstract: Hadoop is an open source implementation of the MapReduce programming model popularized by Google. Hadoop has been heavily adopted in the Web 2.0 community and is now making inroads in the scientific and research communities. The flexibility of the MapReduce programming model combined with the power of the Cray XT can impact the size and nature of scientific explorations. In this paper will explain the motivations for using Hadoop and describe the steps to deploy the framework on a Cray XT. We will examine some of the configuration options and their impact on performance. We will compare the performance of several applications running in Hadoop on the Cray system with the performance on standard Hadoop deployments on clusters and Cloud systems. We will conclude with some assessment on the feasibility and efficacy of running Hadoop on HPC systems and future work. |
Author(s):
Canon, Shane, Presenter National Energy Research Scientific Computing Center (NERSC)
Ramakrishnan, Lavanya National Energy Research Scientific Computing Center (NERSC)
Jackson, Keith Lawrence Berkeley National Lab
Shalf, John National Energy Research Scientific Computing Center (NERSC)
|
Suggested Technical Category:
Programming Environment
|
|
Title: Validating File System Permissions on Multi-OS Systems
Abstract: The Cray XT series of HPC computers presents the system security officer and system administrator with a range of operating systems (Linux, CNL, CVN), job launch (shell/exec, ALPS, yod) and file systems (UFS, NFS, Lustre, LibSysIO, DVS). Available open-source packages do not span this range of requirements. As the system integrator, Cray provides the fundamentals for validating that file system permissions are correctly enforced. However, Due to Sandia's security requirements, we were forced to develop a software tool for checking POSIX permission handling across multiple combinations of OS's and file systems. This paper presents the architecture and design of a novel Lisp-based POSIX file system validation tool that uses multi-methods and object-oriented programming to validate tester-specified combinations of access patterns. |
Author(s):
Ballance, Robert Sandia National Laboratories (SNLA) Sandia National Laboratories
|
Suggested Technical Category:
System Operations
|
|
Title: Diagnosis and Remediation of Performance Anomalies on the Cray XMT
Abstract: The primary advantage shared memory parallel computers have over distributed memory systems is a simplified programming model in which data does not need to be replicated or distributed. In practice, however, there are limitations to the amount of concurrency a program can exploit on shared memory systems because there is no concurrency in the atomic operations or mutual exclusion locks these systems use to modify shared data. We examine a number of common parallel programming idioms and discuss their practical limitations when executed on the Cray XMT. Using the XMT's compiler analysis (canal) and runtime event (traceview) tools we are able to understand the performance anomaly of "Hotspotting" in which performance degrades non-linearly due to serialization on shared data and the way that interacts with the runtime's scheduling of software threads onto hardware streams. The tools can also be used to understand other symptoms of performance problems such as stream starvation, which may be far removed from their root cause. We offer techniques for identifying these situations, and remedies for reducing or eliminating them. |
Author(s):
Mogill, Jace, Presenter Pacific Northwest National Laboratory
|
Suggested Technical Category:
User Code Optimization
|
|
Title: Overview of the Current and Future Cray CX Product Family
Abstract: Please join us for a detailed product briefing of an exciting new product from Cray. This new product will be a significant enhancement to the Cray portfolio, and will expand the range of capabilities and programming models available to our customers and prospects. |
Author(s):
Miller, Ian, Presenter Cray Inc. (CRAY)
|
Suggested Technical Category:
Architecture
|