Final Program

scray.jpg (5654 Byte)
Seymour Cray

CUG - A Forum for HPC Users

 

Cray User Group Origin2000 Workshop
October 11-13, 1998
Denver, Colorado

Final Program

All meetings will be held in Colorado E

Sunday
7:00-8:00 AM Breakfast in Colorado F (provided by CUG)
8:00-10:00 Tutorial: Origin2000 Optimization, Charles Grassl, SGI
10:00-10:30 Break in Colorado E (provided by CUG)
10:30-10:45 Welcome, Sally Haerer, CUG President, Gary Jensen, NCSA, and Mick Dungworth, SGI
10:45-11:15 Ocean Nested Grid Models and Moderately Parallel Environment, Germana Peggion, University of Southern Mississippi
11:15-12:00 Optimization and Parallelization of a vector code (C90) for ORIGIN 2000 performance: What we accomplished in 7 days, Punyam Satya-narayna, Raytheon Systems Company at ARL MSRC, Aberdeen, Maryland, Phil Mucci, Computer Science Department, University of Tennessee, Knoxville, Ravi Avancha, Mechanical Engineering Dept., Iowa State University, Ames, Iowa
12:00-1:00 Lunch in Colorado F (provided by CUG)
1:00-2:00 Loop Level Parallelism Using Moderate Sized Parallel Processors: Performance Issues, (in three parts) Daniel M. Pressel, Computer Scientist, U.S. Army Research Laboratory
2:00-2:30 Software Road map, Kathy Nottingham, SGI
2:30-3:00 Managing Origin Resources using Job Performance Monitor, Michael Shapiro, NCSA
3:00-3:30 Break in Colorado E (provided by CUG)
3:30-4:00 NCAR Experiences with the Origin 2000-128 CPU 250 MHZ, Bill Anderson, Barb Bateman, Mary Ann Ciuffini, Steve Gombosi, NCAR
4:00-5:00 Open Session-Interactive version of the Top5 Origin2000 Issues posted on the Origin Repository at Boston University, Moderator: Larry Smarr, NCSA
7:00-9:00 PM Dinner in Colorado F (provided by CUG)
Monday
7:00-8:00 AM Breakfast in Colorado F (provided by CUG)
8:00-10:00 Tutorial: System Administration for Large Origin2000's, Betsy Zeller, SGI
10:00-10:30 Break in Colorado E (provided by CUG)
10:30-11:15 Hardware Roadmap, Rick Bahr, VP Engineering, SGI
11:15-12:00 Scalable Operating Environments, Kent Koeninger, SGI
12:00-1:00 Lunch in Colorado F (provided by CUG)
1:00-1:45 Shared Memory Multi-Level Parallelism for CFD, OVERFLOW-MLP: A Case Study, James R. Taft, Sierra Software, Inc., NASA AMES Research Center
1:45-2:30 Results from a 3D Rayleigh-Taylor Instability Simulation on the Mountain Blue Supercomputer, R. P. Weaver, Los Alamos National Laboratory, P. Fay, Intel Corporation/Sandia National Laboratory, M.L. Gittings, Los Alamos National Laboratory and the Science Applications International Corporation, M.L. Clover, Los Alamos National Laboratory, A. Martinez, Los Alamos National Laboratory, D. Model, Los Alamos National Laboratory
2:30-3:00 Storage System Design for Origin-Class Parallel Calculations in Electromagnetics and Oceanography, Matthew T. O'Keefe, University of Minnesota
3:00-3:30 Break in Colorado E (provided by CUG)
3:30-4:00 An Evaluation of Barrier Synchronization on the Origin 2000, Rick Kufrin, NCSA
4:00-4:30 A Parallel Neural Network Training Code for Control of Dynamical Systems, Javier Vitela, Universidad Nacional Autónoma de México
4:30-5:00 Supercomputing Solutions with ANSYS on the Cray T90 and Origin2000, Gene Poole, SGI and John Vandeventer, Boeing
6:00-7:00 PM Reception in Colorado F (provided by SGI)
7:00-8:00 PM Dinner in Colorado F (provided by CUG)
Tuesday
7:00-8:00 AM Breakfast in Colorado F (provided by CUG)
8:00-10:00 Tutorial: Parallel Programming on Origin2000 using MPI, Open MP, and SHMEM, Karl Feind and Ramesh Menon, SGI
10:00-10:30 Break in Colorado E (provided by CUG)
10:30-11:00 Origin2000 Service Update, Dave Walls, SGI
11:00-12:00 Systems Experience with the Origin2000's at Los Alamos, Daryl Grunau, Amos Lovato, Velda Volz, Susan Coghlan, Dean Prichard, Joe Kleczha and Curt Canada, LANL
12:00-12:30 Workshop Wrap-up, Sally Haerer and Gary Jensen
12:30 End

Abstracts

Optimization and Parallelization of a Vector Code (C90) for ORIGIN 2000 Performance: What we accomplished in 7 days
Punyam Satya-narayna, Raytheon Systems Company at ARL MSRC, Aberdeen, Maryland,
Phil Mucci, Computer Science Department, University of Tennessee, Knoxville
Ravi Avancha, Mechanical Engineering Dept, Iowa State University, Ames, Iowa

High Performance Computing (HPC) centers receive requests for assistance in parallelization and optimization of legacy FORTRAN codes (often vectorized codes). The migration process from vector machines to NUMA machines is often time consuming and invariably the question arises, how long does it take and what can we accomplish in a few weeks time? We took a serial vector code optimized for a Cray C90 and decided first to go through the exercise of Single Processor Optimization and Tuning (SPOT) and then embark on Multiprocessor Tuning (MUT). This involved several steps, SPOT meant (1) starting from existing tuned code, (2) getting the right answer, (3) finding out where to tune, (4) letting the compiler do the work, and (5) tuning for cache performance. And in the second step, MUT involved (1) parallelization of the code, (2) bottleneck identification, (3) fixing of false sharing, (4) tuning for data placement, and (5) performance analysis at every step. The last step also involved learning relevant tools.

All the above steps for HPC have been well advertised in technical publications, but the question remains as to what can be done in a short period to run a decent parallel code on the ORIGINs? Our code performs direct and large eddy simulations of turbulent flows (around critical components of gas turbine engine) with heat transfer using a compressible formulation of the Navier-Stokes equations. We will report on how many of the above 10 steps we accomplished in 7 days.

 

Loop Level Parallelism, Using Moderate Sized Parallel Processors, Performance Issues, (in three parts.)
Daniel M. Pressel, Computer Scientist, U.S. Army Research Laboratory

1) Experiences Using Loop-Level Parallelism to Port an Implicit CFD Code to the 128 Processor Origin 2000.

The code was originally written as an out of core solver for Cray Vector machines. When modified to run as an in core solver and placed on an SGI R8000 Power Challenge, it was expected to run at about 1/3 of the speed that it ran on one processor of a C90. In reality, it slowed down by a factor of 40! Extensive serial optimization improved things to the expected slowdown of a factor of 1/3.

It was then necessary to parallelize the code. It was clear that the code did not easily lend itself to using message-passing code. One could have solved this by using Domain Decomposition, but that would have changed the convergence property of the algorithm (or alternatively required substantial modifications to the algorithm). Instead we were able to parallelize the code with no changes to the algorithm or it's convergence properties by using loop-level parallelism.

On a 128 processor Origin 2000, the resulting code has demonstrated speedups relative to the vector code running on one processor of a Cray C90 of up to a factor of 27. Relative to running the job on one processor of an O2K, we saw a speedup of up to a factor of 70.

2) In Support of Using Moderate Sized Parallel Processors

Traditionally efforts to parallelize programs have centered on using large numbers of processors. There is an important reason for this. The processors that these machines were based on were so weak that in order to achieve "Super Computer" levels of performance one needed to use massive numbers of processors. Unfortunately, this had an unfortunate consequence. Many of the most efficient algorithms in use on serial/ vector computers did not support high levels of parallelism. Therefore, many efforts to use MPP's were based on algorithms that either had their computational efficiency degraded, or were less than optimal to begin with.

Since the processors used in current machines are significantly faster than those used in the earlier machines, it is now possible to revisit these algorithms. In many cases it should be possible to parallelize these algorithms on moderate sized machines. In some cases this will require using loop-level parallelism, while in other cases traditional message passing will work just fine. However this also has some consequences when it comes to procurement policies. In particular, this means that the systems will have to be configured with more memory per processor (actually, probably more of everything per processor).

It is our belief that in one case that we have worked on, the results when using a 128 processor Origin 2000 were roughly equivalent to using at least 500 processors on a traditional MPP (e.g. Cray T3E or IBM SP) using traditional approaches to parallelization.

3) Performance and Optimization Issues for the Origin 2000

There are a number of issues related to getting good performance out of RISC-based SMP's such as the Origin 2000. The most obvious of these have to do with writing efficient serial code, and then creating high quality parallel code. However, that is only half of the battle.

It is also important to consider issues at the system level. Some examples of this are:

  • Why paging is such a bad idea.
  • The way the performance can degrade when the load factor exceeds the number of processors.
  • Various issues that go into determining the optimal number of processors to assign to each job.
  • Issues relating to hardware configuration (matching the hardware to the needs of the user community).
  • Issues relating to software configuration (e.g., some hints regarding systune parameters and the setting of some environment variables).

Ocean Nested Grid Models and Moderately Parallel Environment
Germana Peggion, University of Southern Mississippi

This study addresses the feasibility of a two-way nesting algorithm in a moderately parallel environment. The Princeton Ocean Model (POM) is the model of choice for the development of a procedure in which modules, corresponding to 1) a coarse-grid resolution model for the Gulf of Mexico (GOM) and 2) a fine-grid resolution model for the Mississippi Bight (MB) are executed in parallel and communicate the interfacing variables to each other. The approach offers the benefit of modeling coastal environments, taking into account the mutual interactions between the shallow and deep waters, without the computational burden of configuring the basin domain at the high resolution required by coastal applications.

The coupled system is highly portable. C-preprocessor directives control whether the models are executed independently or in parallel, the choice of the communication algorithms, and the message-passing libraries. There are two options for controlling the communications between the coarse and fine grids based on the PVM and MPI message-passing libraries. Currently, the simulations are executed on the Origin 2000 and Power Challenger platforms. A version with PVM message passing software is available for the C-90 and Cray YMP, but has not been extensively tested, yet.

The POM code is not optimized, yet and it would be nice to have a copy that is optimized for the origin 'natural' parallelization.

The communications between modules are not the most central point. Soon I'm planing to add new equation (codes) for each domain like bio-sediment models. I would probably achieve a 'mild' functional parallelization. For the grid size I'm working it could be a good approach.

 Storage System Design for Origin-Class Parallel Calculations in Electromagnetics and Oceanography
Matthew T. O'Keefe, Associate Professor, University of Minnesota

State-of-the-art parallel calculations using Origin-class hardware require fast, scalable storage systems. To this end, we have designed and developed a file system known as GFS that allows multiple SGI machines to share disks across a storage network (Fibre Channel). Like the Cray Shared File System (SFS) for UNICOS, this approach allows all machines to have high-bandwidth access to all storage devices on the network.

Scalable networked storage interfaces like Fibre Channel allow computer architects to design systems with many shared storage devices, increasing the performance and reliability of the design. The source code for GFS, distributed under the GNU Public License, is free for anyone to use, modify, and re-distribute and can be found on the Web at "http://gfs.lcse.umn.edu".

We are exploiting the shared storage environment provided by IRIX and GFS to more efficiently execute and post-process data associated with our parallel calculations in electromagnetics (see "http://www.lcse.umn.edu/~hayes/") and in oceanography (see "http://www-mount.ee.umn.edu/~okeefe/micom/"). Our talk will conclude with a brief description of these calculations and performance results to date.

NCAR Experiences with the Origin 2000-128 CPU 250 MHZ
Bill Anderson, Barb Bateman, Mary Ann Ciuffini, Steve Gombosi, NCAR

This paper will begin by presenting the National Center for Atmospheric Research's (NCAR) current hardware and software configuration of a 128 processor Origin 2000 and the types of jobs that are run on this system. Problems and shortcomings experienced when doing initial installs and upgrades and applying patches on this system will be discussed. System reliability since install will be mentioned. The paper will discuss how NCAR is managing resource allocation with the limited functioning tools that are provided by SGI. Security vulnerabilities that were encountered by NCAR on this system and the actions that NCAR took to protect the system will be presented. The paper will discuss disk I/O, network and swap performance as well as some aspects of application performance. Performance tools will be addressed. What is broken and what works with SGI's accounting will be revealed. Experiences with SGI technical support and suggestions for improvement will be brought up.

Supercomputing Solutions with ANSYS on a the Cray T90 and Origin2000
Gene Poole, SGI/Cray and John Vandeventer, Boeing

This talk will describe some challenging FEM analysis projects that require extensive compute resources. The analysis projects involve large FEM models with multiple load step nonlinear contact element analyses. The projects typically require results with fast turn around time and immediate interaction and feedback as part of an aggressive design process. The talk will feature an example of a recent project at The Boeing Company used in the design of a landing gear assembly. The analysis runs were completed on CRAY T90 and SGI Origin2000 computer systems at Boeing. Improvements in performance are described, including the use of a special CRAY sparse solver on the CRAY T90 system and parallel processing on both CRAY and SGI systems. Interactions with performance engineers at Silicon Graphics reduced the individual analysis times from days to overnight. The talk will describe multiple factors that impact the ability to do large-scale nonlinear analyses and the importance of interaction and contributions between software, hardware and design engineers.

An Evaluation of Barrier Synchronization on the Origin 2000
Rick Kufrin, NCSA

Processor synchronization is a fundamental task in parallel computing. Efficient barrier synchronization primitives can be critical in developing scalable applications, regardless of the target underlying architecture. We describe a series of experiments on the SGI/Cray Origin 2000 distributed shared memory supercomputer using several different barrier implementations, some of which utilize special-purpose instructions available on this machine to achieve improved performance. Results show that the scalability of an application can be significantly affected by the choice of programming model and barrier primitive, especially for applications that require frequent processor synchronization.

 Shared Memory Multi-Level Parallelism for CFD OVERFLOW-MLP: A Case Study
James R. Taft, Sierra Software, Inc., NASA AMES Research Center

High Performance Computing (HPC) platforms are continually evolving toward systems with larger and larger CPU counts. In recent years these systems almost universally utilize standard off-the shelf microprocessors at the heart of their design. Virtually all hardware vendors have adopted this design approach as it dramatically reduces their costs for building large systems.

Unfortunately, systems built from commodity parts usually force researchers to embark on large code conversion efforts to take advantage of any possible potential performance. Historically, this has been a daunting, and often unsuccessful, task. For those who attempted it, the effort often consumed many man-years of effort. Codes used in heavy production environments were often deemed to be impossible to convert before the effort was even begun.

Several events have occurred within the past few years that are changing that attitude. First, the vendors are now building systems that share many of the hardware features of the Cray vector systems synonymous with production sites. In particular, the new designs are moving toward large CPU count true shared memory SMP architectures, albeit with non-uniform memory access (NUMA). Second, the new RISC instruction sets, and smart vector aware compilers are supporting reasonable single CPU performance on classic Cray vector code once the data vectors reside in cache. These new attributes have opened up the possibility of approaching high performance and large scale parallelism in an entirely new way that is more intuitive and simpler to implement.

Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems.

This high level of performance is achieved via what is generically termed Shared Memory Multi-Level Parallelism (MLP). This programming approach is an alternative to the message-passing paradigm of MPI. It offers parallelism both at the fine and coarse-grained level, with communication latencies that are approximately 100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts.

NAS has developed a particular MLP strategy for the production CFD code, OVERFLOW, that is simple to use and highly effective. The latest implementation of this technique is found in OVERFLOW-MLP. The initial effort to convert OVERFLOW to MLP required only a few man weeks and a few hundred lines of code changes. OVERFLOW was chosen as the test bed for the MLP development effort because of its heavy use at NAS, and the fact that it is a large code composed of approximately 100,000 lines of FORTRAN. It's large size insured that OVERFLOW represented one of the toughest tests for the resiliency of the MLP programming approach.

The MLP technique itself is simple. It draws on the natural coarse-grained parallelism found in the multi-zonal flow codes that are the state of the art today. Multi-zonal codes like OVERFLOW attempt to solve for the 3-D flowfield using a patchwork of many smaller 3-D zones quilted together to represent the total fluid volume to be examined. At the end of each time step boundary condition data is exchanged between the smaller zones, but the remainder of the time is spent doing computations independent of other zones. MLP simply assigns independent processes to solve for the flow in the many 3-D zones in a parallel fashion, and uses the technique of a shared memory arena to pass boundary data to neighboring zones as needed. For OVERFLOW, this amounts to modifying the main program and 5 other routines for a total of a few hundred lines of change.

Doing the computation of zones in parallel is not new; the MPI version of OVERFLOW already attempts this process. The unique feature of the MLP approach is that it does so with no message passing and only a few hundred lines of code changes. The end result is that the code is simple to maintain, continues to execute well on C90 systems, and now executes well on parallel systems at very high-sustained levels of performance.

The MLP method is general, and applicable to a large class of NASA CFD codes. The MLP methodology and techniques are described in detail below. The method is first discussed in general terms to provide an understanding of how it may be applied to many popular production CFD codes. This is followed by discussion of the application of the technique to the OVERFLOW code. Finally, a selected set of performance results is presented for large real problems on machines varying in size from 8 to 256 CPUs. Problems as large as 35 million points are considered.

The new techniques of Multi-Level Parallelism developed under the O2000 Optimization Effort have demonstrated dramatic cost and performance benefits for production CFD codes at NASA. The popular CFD code, OVERFLOW, has sustained 20 GFLOPs in performance when solving the largest CFD problem ever attempted at NAS. If success continues, the newly developed MLP techniques will allow fast code conversions, a dramatic reduction in run times for the largest CFD problems, and allow this on platforms that are an order of magnitude lower in cost than typical traditional vector supercomputer resources.



Managing Origin Resources using Job Performance Monitor
Michael Shapiro, NCS

NCSA has 768 processors of SGI origin-2000, currently split into 8 machines, running IRIX 6.4 or IRIX 6.5. One interactive host with 32 CPUs and 4G memory and 7 batch hosts with 32, 64, or 128 CPUs and 12G to 64G memory.

Interactive user needs are limited to short duration runs and 8 wide parallelism. Batch needs range from small debug type jobs to 128 wide/64G of memory jobs. The majority of jobs are 16 or fewer processors and 2G or less memory with 50 to 200 hour run times.

This talk discusses how NCSA monitors and controls resource usage on these machines.



Results from a 3D Rayleigh-Taylor Instability Simulation on the Mountain Blue Supercomputer
R. P. Weaver1, P. Fay2, M.L. Gittings1,3, M.L. Clover1, and A. Martinez1, D. Model1
1
Los Alamos National Laboratory
2Intel Corporation/Sandia National Laboratory
3Science Applications International Corporation

Rayleigh-Taylor Instability (RTI) simulations have been run with the RAGE Continuous Adaptive Mesh Refinement (CAMR) Eulerian code on the Los Alamos National Laboratory's Mountain Blue supercomputer. The RAGE code is part of the CRESTONE ASCI project at Los Alamos. The main goal of this project is to investigate the use of Continuous Adaptive Mesh Refinement (CAMR) techniques for application to 3D stockpile stewardship codes. These RTI were run on both the Sandia National Laboratory ASCI Red machine (MPP) and the Los Alamos Blue Mountain machine (SMP) in order to compare each machines efficiency at running this CAMR code. The Red simulation was run on as many as 1000 pe, while the Blue simulation was run continuously on 5 boxes of 62 pe. The Blue simulation took 360 hours (or ~15 days) to complete, and is equivalent to running continuously on a single Blue processor for 12.7 years! Results of both 2D and 3D multimode simulations will be presented, together with various forms of visualizations of the 3D simulation including movies of the relevant isosurfaces and the entire volume.


A Parallel Neural Network Training Code for Control of Dynamical Systems
Javier Vitela, Universidad Nacional Autónoma de México

In this paper we present a parallel neural network training code that makes use of MPI, a portable message-passing environment. The sequential algorithm is presented after which a parallel training algorithm is discussed. A performance analysis is reported which compares results of a performance theoretical model with actual measurements. The analysis is made for three different load assignment schemes: two static and one quasi-static. This analysis is important because since optimal load balance can not be achieved because the work load information is not available a priori. The speed-up results obtained are compared with those corresponding to the bin-packing load balance scheme with perfect load prediction, based on a priori knowledge of the computing effort.

 Systems Experience with the Origin2000's at Los Alamos
Daryl Grunau, Amos Lovato, Velda Volz, Susan Coghlan, Dean Prichard, Joe Kleczha and Curt Canada, LANL

No abstract available.

Questions and information: contact Gary Jensen at (303) 530-0354, or guido@ncsa.uiuc.edu
Created: September 22, 1998; Revised October 5, 1998

© Cray User Group Inc. All rights reserved.  Page last modified: 27 Aug 01