|
| |
Cray User Group
Origin2000 Workshop
October 11-13, 1998
Denver, Colorado
Final Program
All meetings will be held in Colorado E
| Sunday |
|
| 7:00-8:00 AM |
Breakfast in Colorado F (provided by CUG) |
| 8:00-10:00 |
Tutorial: Origin2000 Optimization, Charles
Grassl, SGI |
| 10:00-10:30 |
Break in Colorado E (provided by CUG) |
| 10:30-10:45 |
Welcome, Sally Haerer, CUG
President, Gary Jensen, NCSA, and Mick Dungworth, SGI |
| 10:45-11:15 |
Ocean Nested Grid Models and Moderately
Parallel Environment, Germana Peggion, University of Southern Mississippi |
| 11:15-12:00 |
Optimization and Parallelization of a
vector code (C90) for ORIGIN 2000 performance: What we accomplished in 7 days, Punyam
Satya-narayna, Raytheon Systems Company at ARL MSRC, Aberdeen, Maryland, Phil Mucci,
Computer Science Department, University of Tennessee, Knoxville, Ravi Avancha, Mechanical
Engineering Dept., Iowa State University, Ames, Iowa |
| 12:00-1:00 |
Lunch in Colorado F (provided by CUG) |
| 1:00-2:00 |
Loop Level Parallelism Using Moderate
Sized Parallel Processors: Performance Issues, (in three parts) Daniel M. Pressel,
Computer Scientist, U.S. Army Research Laboratory |
| 2:00-2:30 |
Software Road map, Kathy
Nottingham, SGI |
| 2:30-3:00 |
Managing Origin Resources using Job
Performance Monitor, Michael Shapiro, NCSA |
| 3:00-3:30 |
Break in Colorado E (provided by CUG) |
| 3:30-4:00 |
NCAR Experiences with the Origin
2000-128 CPU 250 MHZ, Bill Anderson, Barb Bateman, Mary Ann Ciuffini, Steve
Gombosi, NCAR |
| 4:00-5:00 |
Open Session-Interactive version of the
Top5 Origin2000 Issues posted on the Origin Repository at Boston University, Moderator:
Larry Smarr, NCSA |
| 7:00-9:00 PM |
Dinner in Colorado F (provided by CUG) |
| Monday |
|
| 7:00-8:00 AM |
Breakfast in Colorado F (provided by CUG) |
| 8:00-10:00 |
Tutorial: System Administration for
Large Origin2000's, Betsy Zeller, SGI |
| 10:00-10:30 |
Break in Colorado E (provided by CUG) |
| 10:30-11:15 |
Hardware Roadmap, Rick Bahr, VP
Engineering, SGI |
| 11:15-12:00 |
Scalable Operating Environments, Kent
Koeninger, SGI |
| 12:00-1:00 |
Lunch in Colorado F (provided by CUG) |
| 1:00-1:45 |
Shared Memory Multi-Level Parallelism
for CFD, OVERFLOW-MLP: A Case Study, James R. Taft, Sierra Software, Inc., NASA
AMES Research Center |
| 1:45-2:30 |
Results from a 3D Rayleigh-Taylor
Instability Simulation on the Mountain Blue Supercomputer, R. P. Weaver, Los Alamos
National Laboratory, P. Fay, Intel Corporation/Sandia National Laboratory, M.L. Gittings,
Los Alamos National Laboratory and the Science Applications International Corporation,
M.L. Clover, Los Alamos National Laboratory, A. Martinez, Los Alamos National Laboratory,
D. Model, Los Alamos National Laboratory |
| 2:30-3:00 |
Storage System Design for Origin-Class
Parallel Calculations in Electromagnetics and Oceanography, Matthew T. O'Keefe,
University of Minnesota |
| 3:00-3:30 |
Break in Colorado E (provided by CUG) |
| 3:30-4:00 |
An Evaluation of Barrier
Synchronization on the Origin 2000, Rick Kufrin, NCSA |
| 4:00-4:30 |
A Parallel Neural Network Training Code
for Control of Dynamical Systems, Javier Vitela, Universidad Nacional Autónoma de
México |
| 4:30-5:00 |
Supercomputing Solutions with ANSYS on
the Cray T90 and Origin2000, Gene Poole, SGI and John Vandeventer, Boeing |
| 6:00-7:00 PM |
Reception in Colorado F (provided by SGI) |
| 7:00-8:00 PM |
Dinner in Colorado F (provided by
CUG) |
| Tuesday |
|
| 7:00-8:00 AM |
Breakfast in Colorado F (provided by CUG) |
| 8:00-10:00 |
Tutorial: Parallel Programming on
Origin2000 using MPI, Open MP, and SHMEM, Karl Feind and Ramesh Menon, SGI |
| 10:00-10:30 |
Break in Colorado E (provided by CUG) |
| 10:30-11:00 |
Origin2000 Service Update, Dave
Walls, SGI |
| 11:00-12:00 |
Systems Experience with the
Origin2000's at Los Alamos, Daryl Grunau, Amos Lovato, Velda Volz, Susan Coghlan,
Dean Prichard, Joe Kleczha and Curt Canada, LANL |
| 12:00-12:30 |
Workshop Wrap-up, Sally Haerer
and Gary Jensen |
| 12:30 |
End |
Abstracts
Optimization and Parallelization of a Vector Code (C90) for
ORIGIN 2000 Performance: What we accomplished in 7 days
Punyam Satya-narayna, Raytheon Systems Company at ARL MSRC, Aberdeen, Maryland,
Phil Mucci, Computer Science Department, University of Tennessee, Knoxville
Ravi Avancha, Mechanical Engineering Dept, Iowa State University, Ames, Iowa
High Performance Computing (HPC) centers receive requests for
assistance in parallelization and optimization of legacy FORTRAN codes (often vectorized
codes). The migration process from vector machines to NUMA machines is often time
consuming and invariably the question arises, how long does it take and what can we
accomplish in a few weeks time? We took a serial vector code optimized for a Cray C90 and
decided first to go through the exercise of Single Processor Optimization and Tuning
(SPOT) and then embark on Multiprocessor Tuning (MUT). This involved several steps, SPOT
meant (1) starting from existing tuned code, (2) getting the right answer, (3) finding out
where to tune, (4) letting the compiler do the work, and (5) tuning for cache performance.
And in the second step, MUT involved (1) parallelization of the code, (2) bottleneck
identification, (3) fixing of false sharing, (4) tuning for data placement, and (5)
performance analysis at every step. The last step also involved learning relevant tools.
All the above steps for HPC have been well advertised in technical publications, but the
question remains as to what can be done in a short period to run a decent parallel code on
the ORIGINs? Our code performs direct and large eddy simulations of turbulent flows
(around critical components of gas turbine engine) with heat transfer using a compressible
formulation of the Navier-Stokes equations. We will report on how many of the above 10
steps we accomplished in 7 days.
Loop Level Parallelism, Using Moderate Sized Parallel Processors,
Performance Issues, (in three parts.)
Daniel M. Pressel, Computer Scientist, U.S. Army Research Laboratory
1) Experiences Using Loop-Level Parallelism to Port an Implicit CFD
Code to the 128 Processor Origin 2000.
The code was originally written as an out of core solver for Cray Vector machines. When
modified to run as an in core solver and placed on an SGI R8000 Power Challenge, it was
expected to run at about 1/3 of the speed that it ran on one processor of a C90. In
reality, it slowed down by a factor of 40! Extensive serial optimization improved things
to the expected slowdown of a factor of 1/3.
It was then necessary to parallelize the code. It was clear that the code did not easily
lend itself to using message-passing code. One could have solved this by using Domain
Decomposition, but that would have changed the convergence property of the algorithm (or
alternatively required substantial modifications to the algorithm). Instead we were able
to parallelize the code with no changes to the algorithm or it's convergence properties by
using loop-level parallelism.
On a 128 processor Origin 2000, the resulting code has demonstrated speedups relative to
the vector code running on one processor of a Cray C90 of up to a factor of 27. Relative
to running the job on one processor of an O2K, we saw a speedup of up to a factor of 70.
2) In Support of Using Moderate Sized Parallel Processors
Traditionally efforts to parallelize programs have centered on using large numbers of
processors. There is an important reason for this. The processors that these machines were
based on were so weak that in order to achieve "Super Computer" levels of
performance one needed to use massive numbers of processors. Unfortunately, this had an
unfortunate consequence. Many of the most efficient algorithms in use on serial/ vector
computers did not support high levels of parallelism. Therefore, many efforts to use MPP's
were based on algorithms that either had their computational efficiency degraded, or were
less than optimal to begin with.
Since the processors used in current machines are significantly faster than those used in
the earlier machines, it is now possible to revisit these algorithms. In many cases it
should be possible to parallelize these algorithms on moderate sized machines. In some
cases this will require using loop-level parallelism, while in other cases traditional
message passing will work just fine. However this also has some consequences when it comes
to procurement policies. In particular, this means that the systems will have to be
configured with more memory per processor (actually, probably more of everything per
processor).
It is our belief that in one case that we have worked on, the results when using a 128
processor Origin 2000 were roughly equivalent to using at least 500 processors on a
traditional MPP (e.g. Cray T3E or IBM SP) using traditional approaches to parallelization.
3) Performance and Optimization Issues for the Origin 2000
There are a number of issues related to getting good performance out of RISC-based SMP's
such as the Origin 2000. The most obvious of these have to do with writing efficient
serial code, and then creating high quality parallel code. However, that is only half of
the battle.
It is also important to consider issues at the system level. Some examples of this are:
- Why paging is such a bad idea.
- The way the performance can degrade when the load factor exceeds the
number of processors.
- Various issues that go into determining the optimal number of
processors to assign to each job.
- Issues relating to hardware configuration (matching the hardware to
the needs of the user community).
- Issues relating to software configuration (e.g., some hints regarding
systune parameters and the setting of some environment variables).
Ocean Nested Grid Models and Moderately Parallel Environment
Germana Peggion, University of Southern Mississippi
This study addresses the feasibility of a two-way nesting algorithm
in a moderately parallel environment. The Princeton Ocean Model (POM) is the model of
choice for the development of a procedure in which modules, corresponding to 1) a
coarse-grid resolution model for the Gulf of Mexico (GOM) and 2) a fine-grid resolution
model for the Mississippi Bight (MB) are executed in parallel and communicate the
interfacing variables to each other. The approach offers the benefit of modeling coastal
environments, taking into account the mutual interactions between the shallow and deep
waters, without the computational burden of configuring the basin domain at the high
resolution required by coastal applications.
The coupled system is highly portable. C-preprocessor directives control whether the
models are executed independently or in parallel, the choice of the communication
algorithms, and the message-passing libraries. There are two options for controlling the
communications between the coarse and fine grids based on the PVM and MPI message-passing
libraries. Currently, the simulations are executed on the Origin 2000 and Power Challenger
platforms. A version with PVM message passing software is available for the C-90 and Cray
YMP, but has not been extensively tested, yet.
The POM code is not optimized, yet and it would be nice to have a copy that is optimized
for the origin 'natural' parallelization.
The communications between modules are not the most central point. Soon I'm planing to add
new equation (codes) for each domain like bio-sediment models. I would probably achieve a
'mild' functional parallelization. For the grid size I'm working it could be a good
approach.
Storage System Design for Origin-Class Parallel
Calculations in Electromagnetics and Oceanography
Matthew T. O'Keefe, Associate Professor, University of Minnesota
State-of-the-art parallel calculations using Origin-class hardware
require fast, scalable storage systems. To this end, we have designed and developed a file
system known as GFS that allows multiple SGI machines to share disks across a storage
network (Fibre Channel). Like the Cray Shared File System (SFS) for UNICOS, this approach
allows all machines to have high-bandwidth access to all storage devices on the network.
Scalable networked storage interfaces like Fibre Channel allow computer architects to
design systems with many shared storage devices, increasing the performance and
reliability of the design. The source code for GFS, distributed under the GNU Public
License, is free for anyone to use, modify, and re-distribute and can be found on the Web
at "http://gfs.lcse.umn.edu".
We are exploiting the shared storage environment provided by IRIX and GFS to more
efficiently execute and post-process data associated with our parallel calculations in
electromagnetics (see "http://www.lcse.umn.edu/~hayes/") and in oceanography
(see "http://www-mount.ee.umn.edu/~okeefe/micom/"). Our talk will conclude with
a brief description of these calculations and performance results to date.
NCAR Experiences with the Origin 2000-128 CPU 250 MHZ
Bill Anderson, Barb Bateman, Mary Ann Ciuffini, Steve Gombosi, NCAR
This paper will begin by presenting the National Center for
Atmospheric Research's (NCAR) current hardware and software configuration of a 128
processor Origin 2000 and the types of jobs that are run on this system. Problems and
shortcomings experienced when doing initial installs and upgrades and applying patches on
this system will be discussed. System reliability since install will be mentioned. The
paper will discuss how NCAR is managing resource allocation with the limited functioning
tools that are provided by SGI. Security vulnerabilities that were encountered by NCAR on
this system and the actions that NCAR took to protect the system will be presented. The
paper will discuss disk I/O, network and swap performance as well as some aspects of
application performance. Performance tools will be addressed. What is broken and what
works with SGI's accounting will be revealed. Experiences with SGI technical support and
suggestions for improvement will be brought up.
Supercomputing Solutions with ANSYS on a the Cray T90 and
Origin2000
Gene Poole, SGI/Cray and John Vandeventer, Boeing
This talk will describe some challenging FEM analysis projects that
require extensive compute resources. The analysis projects involve large FEM models with
multiple load step nonlinear contact element analyses. The projects typically require
results with fast turn around time and immediate interaction and feedback as part of an
aggressive design process. The talk will feature an example of a recent project at The
Boeing Company used in the design of a landing gear assembly. The analysis runs were
completed on CRAY T90 and SGI Origin2000 computer systems at Boeing. Improvements in
performance are described, including the use of a special CRAY sparse solver on the CRAY
T90 system and parallel processing on both CRAY and SGI systems. Interactions with
performance engineers at Silicon Graphics reduced the individual analysis times from days
to overnight. The talk will describe multiple factors that impact the ability to do
large-scale nonlinear analyses and the importance of interaction and contributions between
software, hardware and design engineers.
An Evaluation of Barrier Synchronization on the Origin 2000
Rick Kufrin, NCSA
Processor synchronization is a fundamental task in parallel
computing. Efficient barrier synchronization primitives can be critical in developing
scalable applications, regardless of the target underlying architecture. We describe a
series of experiments on the SGI/Cray Origin 2000 distributed shared memory supercomputer
using several different barrier implementations, some of which utilize special-purpose
instructions available on this machine to achieve improved performance. Results show that
the scalability of an application can be significantly affected by the choice of
programming model and barrier primitive, especially for applications that require frequent
processor synchronization.
Shared Memory Multi-Level Parallelism for CFD OVERFLOW-MLP:
A Case Study
James R. Taft, Sierra Software, Inc., NASA AMES Research Center
High Performance Computing (HPC) platforms are continually evolving
toward systems with larger and larger CPU counts. In recent years these systems almost
universally utilize standard off-the shelf microprocessors at the heart of their design.
Virtually all hardware vendors have adopted this design approach as it dramatically
reduces their costs for building large systems.
Unfortunately, systems built from commodity parts usually force researchers to embark on
large code conversion efforts to take advantage of any possible potential performance.
Historically, this has been a daunting, and often unsuccessful, task. For those who
attempted it, the effort often consumed many man-years of effort. Codes used in heavy
production environments were often deemed to be impossible to convert before the effort
was even begun.
Several events have occurred within the past few years that are changing that attitude.
First, the vendors are now building systems that share many of the hardware features of
the Cray vector systems synonymous with production sites. In particular, the new designs
are moving toward large CPU count true shared memory SMP architectures, albeit with
non-uniform memory access (NUMA). Second, the new RISC instruction sets, and smart vector
aware compilers are supporting reasonable single CPU performance on classic Cray vector
code once the data vectors reside in cache. These new attributes have opened up the
possibility of approaching high performance and large scale parallelism in an entirely new
way that is more intuitive and simpler to implement.
Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that
the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the
Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD
production codes at sustained rates far exceeding processing rates possible on dedicated
16 CPU Cray C90 systems.
This high level of performance is achieved via what is generically termed Shared Memory
Multi-Level Parallelism (MLP). This programming approach is an alternative to the
message-passing paradigm of MPI. It offers parallelism both at the fine and coarse-grained
level, with communication latencies that are approximately 100 times lower than typical
MPI implementations on the same platform. Such latency reductions offer the promise of
performance scaling to very large CPU counts.
NAS has developed a particular MLP strategy for the production CFD code, OVERFLOW, that is
simple to use and highly effective. The latest implementation of this technique is found
in OVERFLOW-MLP. The initial effort to convert OVERFLOW to MLP required only a few man
weeks and a few hundred lines of code changes. OVERFLOW was chosen as the test bed for the
MLP development effort because of its heavy use at NAS, and the fact that it is a large
code composed of approximately 100,000 lines of FORTRAN. It's large size insured that
OVERFLOW represented one of the toughest tests for the resiliency of the MLP programming
approach.
The MLP technique itself is simple. It draws on the natural coarse-grained parallelism
found in the multi-zonal flow codes that are the state of the art today. Multi-zonal codes
like OVERFLOW attempt to solve for the 3-D flowfield using a patchwork of many smaller 3-D
zones quilted together to represent the total fluid volume to be examined. At the end of
each time step boundary condition data is exchanged between the smaller zones, but the
remainder of the time is spent doing computations independent of other zones. MLP simply
assigns independent processes to solve for the flow in the many 3-D zones in a parallel
fashion, and uses the technique of a shared memory arena to pass boundary data to
neighboring zones as needed. For OVERFLOW, this amounts to modifying the main program and
5 other routines for a total of a few hundred lines of change.
Doing the computation of zones in parallel is not new; the MPI version of OVERFLOW already
attempts this process. The unique feature of the MLP approach is that it does so with no
message passing and only a few hundred lines of code changes. The end result is that the
code is simple to maintain, continues to execute well on C90 systems, and now executes
well on parallel systems at very high-sustained levels of performance.
The MLP method is general, and applicable to a large class of NASA CFD codes. The MLP
methodology and techniques are described in detail below. The method is first discussed in
general terms to provide an understanding of how it may be applied to many popular
production CFD codes. This is followed by discussion of the application of the technique
to the OVERFLOW code. Finally, a selected set of performance results is presented for
large real problems on machines varying in size from 8 to 256 CPUs. Problems as large as
35 million points are considered.
The new techniques of Multi-Level Parallelism developed under the O2000 Optimization
Effort have demonstrated dramatic cost and performance benefits for production CFD codes
at NASA. The popular CFD code, OVERFLOW, has sustained 20 GFLOPs in performance when
solving the largest CFD problem ever attempted at NAS. If success continues, the newly
developed MLP techniques will allow fast code conversions, a dramatic reduction in run
times for the largest CFD problems, and allow this on platforms that are an order of
magnitude lower in cost than typical traditional vector supercomputer resources.
Managing Origin Resources using Job Performance Monitor
Michael Shapiro, NCS
NCSA has 768 processors of SGI origin-2000, currently split into 8
machines, running IRIX 6.4 or IRIX 6.5. One interactive host with 32 CPUs and 4G memory
and 7 batch hosts with 32, 64, or 128 CPUs and 12G to 64G memory.
Interactive user needs are limited to short duration runs and 8 wide parallelism. Batch
needs range from small debug type jobs to 128 wide/64G of memory jobs. The majority of
jobs are 16 or fewer processors and 2G or less memory with 50 to 200 hour run times.
This talk discusses how NCSA monitors and controls resource usage on these machines.
Results from a 3D Rayleigh-Taylor Instability Simulation on the Mountain Blue
Supercomputer
R. P. Weaver1, P. Fay2, M.L. Gittings1,3, M.L. Clover1,
and A. Martinez1, D. Model1
1Los Alamos National Laboratory
2Intel Corporation/Sandia National Laboratory
3Science Applications International Corporation
Rayleigh-Taylor Instability (RTI) simulations have been run with the
RAGE Continuous Adaptive Mesh Refinement (CAMR) Eulerian code on the Los Alamos National
Laboratory's Mountain Blue supercomputer. The RAGE code is part of the CRESTONE ASCI
project at Los Alamos. The main goal of this project is to investigate the use of
Continuous Adaptive Mesh Refinement (CAMR) techniques for application to 3D stockpile
stewardship codes. These RTI were run on both the Sandia National Laboratory ASCI Red
machine (MPP) and the Los Alamos Blue Mountain machine (SMP) in order to compare each
machines efficiency at running this CAMR code. The Red simulation was run on as many as
1000 pe, while the Blue simulation was run continuously on 5 boxes of 62 pe. The Blue
simulation took 360 hours (or ~15 days) to complete, and is equivalent to running
continuously on a single Blue processor for 12.7 years! Results of both 2D and 3D
multimode simulations will be presented, together with various forms of visualizations of
the 3D simulation including movies of the relevant isosurfaces and the entire volume.
A Parallel Neural Network Training Code for Control of Dynamical Systems
Javier Vitela, Universidad Nacional Autónoma de México
In this paper we present a parallel neural network training code
that makes use of MPI, a portable message-passing environment. The sequential algorithm is
presented after which a parallel training algorithm is discussed. A performance analysis
is reported which compares results of a performance theoretical model with actual
measurements. The analysis is made for three different load assignment schemes: two static
and one quasi-static. This analysis is important because since optimal load balance can
not be achieved because the work load information is not available a priori. The speed-up
results obtained are compared with those corresponding to the bin-packing load balance
scheme with perfect load prediction, based on a priori knowledge of the computing effort.
Systems Experience with the Origin2000's at Los Alamos
Daryl Grunau, Amos Lovato, Velda Volz, Susan Coghlan, Dean Prichard, Joe Kleczha
and Curt Canada, LANL
No abstract available.

Questions and information: contact Gary Jensen at (303) 530-0354, or
guido@ncsa.uiuc.edu
Created: September 22, 1998; Revised October 5, 1998
|