General Purpose Timing Library (GPTL): A Tool for Characterizing Performance of Parallel and Serial Applications

Presented at Cray User Group 2009, May 4-7, 2009

Jim Rosinski
Oak Ridge National Laboratory
865 241-3026
rosinskijm@ornl.gov www.burningserver.net/rosinski

ABSTRACT:: GPTL is an open source profiling library that reports a variety of performance statistics. Target codes may be parallel via threads and/or MPI. The code regions to be profiled can be hand-specified by the user, or GPTL can define them automatically at function-level granularity if the target application is built with an appropriate compiler flag. Output is presented in a hierarchical fashion that preserves parent-child relationships of the profiled regions. If the PAPI library is available, GPTL utilizes it to gather hardware performance counter data. GPTL built with PAPI support is installed on the XT4 and XT5 machines at ORNL.

1 Introduction

The process of optimizing performance of multi-threaded, multi-tasked, computationally intensive applications can benefit greatly from simple and easy to use tools which highlight hot spots and performance bottlenecks. Originally designed as part of the Community Atmosphere Model (CAM) at NCAR [1], the most basic functionality of the General Purpose Timing Library (GPTL) provides the ability to manually instrument Fortran, C, or C++ application codes by defining code regions with a sequence of "GPTLstart" and "GPTLstop" calls. Timing statistics for these regions are then reported at the end of the run. By default, statistics on memory usage and estimates of the overhead incurred by the library itself are also output. The set of regions can be nested, with the output presented in a hierarchical "call tree" form. When a given region has multiple parents, normally the parent which calls it the most times is listed in the call tree. Further information about the other parents is then presented after the call tree is complete.

In addition to this manual instrumentation, regions can also be defined automatically at the function level by using certain flags provided by the GNU, PGI, and Pathscale compilers. This is done with GNU and Pathscale compilers by using the flag "-finstrument-functions". The PGI flag to do the equivalent is "-Minstrument:functions". This option is available as of PGI release 8.0.2. Manually defining regions (i.e. adding "GPTLstart" and "GPTLstop" calls) in an application can seamlessly be inserted in a code which is also auto-instrumented.

If the PAPI [2]library is available, GPTL can automatically access all available PAPI counters. Required PAPI function calls are made automatically by GPTL. A set of PAPI-based derived events is also available. Examples of derived events include computational intensity and instructions per cycle.

Example output from auto-instrumenting the HPCC benchmark suite is shown here:


Stats for thread 0:
                                             Called  Recurse Wallclock max       min       FP_OPS   e6_/_sec CI       
  total                                            1     -      64.021    64.021    64.021 3.50e+08     5.47 7.20e-02 
    HPCC_Init                                     11      10     0.157     0.157     0.000    95799     0.61 8.90e-02 
*     HPL_pdinfo                                 120     118     0.019     0.018     0.000    96996     4.99 8.56e-02 
*       HPL_all_reduce                             7     -       0.043     0.036     0.000      448     0.01 1.03e-02 
*         HPL_broadcast                           21     -       0.041     0.036     0.000      126     0.00 6.72e-03 
        HPL_pdlamch                                2     -       0.004     0.004     0.000    94248    21.21 1.13e-01 
*       HPL_fprintf                              240     120     0.001     0.000     0.000     1200     0.93 6.67e-03 
      HPCC_InputFileInit                          41      40     0.001     0.001     0.000      194     0.27 8.45e-03 
        ReadInts                                   2     -       0.000     0.000     0.000       12     3.00 1.61e-02 
    PTRANS                                        21      20    22.667    22.667     0.000 4.19e+07     1.85 3.19e-02 
      MaxMem                                       5       4     0.000     0.000     0.000      796     2.70 1.79e-02 
*     iceil_                                     132     -       0.000     0.000     0.000      792     2.88 1.75e-02 
*     ilcm_                                       14     -       0.000     0.000     0.000       84     2.71 1.71e-02 
      param_dump                                  18      12     0.000     0.000     0.000       84     0.82 7.05e-03 
      Cblacs_get                                   5     -       0.000     0.000     0.000       30     1.43 1.67e-02 
      Cblacs_gridmap                              35      30     0.005     0.001     0.000      225     0.05 1.79e-03 
*       Cblacs_pinfo                               7       1     0.000     0.000     0.000       40     3.08 1.54e-02 
*     Cblacs_gridinfo                             60      50     0.000     0.000     0.000      260     2.28 2.10e-02 
      Cigsum2d                                     5     -       0.088     0.047     0.000      165     0.00 6.37e-03 
      pdmatgen                                    20     -      21.497     1.213     0.942 4.00e+07     1.86 3.08e-02 
*       numroc_                                   96     -       0.000     0.000     0.000      576     2.87 1.69e-02 
*       setran_                                   25     -       0.000     0.000     0.000      150     2.94 1.72e-02 
*       pdrand                               3.7e+06   2e+06    15.509     0.041     0.000 1.72e+07     1.11 2.24e-02 
        xjumpm_                                57506   57326     0.219     0.030     0.000   230384     1.05 2.66e-02 
        jumpit_                                60180   40120     0.214     0.021     0.000   280840     1.32 2.18e-02 
      slboot_                                      5     -       0.000     0.000     0.000       30     1.30 1.01e-02 
      Cblacs_barrier                              10       5     0.481     0.167     0.000       50     0.00 3.26e-03 
      sltimer_                                    10     -       0.000     0.000     0.000      614     3.05 1.90e-02 
*       dwalltime00                               15     -       0.000     0.000     0.000      150     2.54 2.57e-02 
*       dcputime00                                15     -       0.000     0.000     0.000      373     3.06 1.91e-02 
*         HPL_ptimer_cputime                      17     -       0.000     0.000     0.000      170     2.66 2.29e-02 
      pdtrans                                     14       9     0.124     0.045     0.000   573505     4.61 1.36e-01 
        Cblacs_dSendrecv                          12       8     0.115     0.042     0.000       56     0.00 2.24e-03 
      pdmatcmp                                     5     -       0.448     0.295     0.003 1.29e+06     2.87 2.94e-01 
*       HPL_daxpy                               2596     -       0.008     0.000     0.000 1.34e+06   177.06 4.40e-01 
*       HPL_idamax                              2966     -       0.007     0.000     0.000   767291   104.75 4.15e-01 
...

Function names on the left of the output are indented to indicate their parent, and depth in the call tree. An asterisk next to an entry means it has more than one parent. Other entries in this output show the number of invocations, number of recursive invocations, wallclock timing statistics, and PAPI-based information. In this example, HPL_daxpy produced 1.34e6 floating point operations, 177.06 MFlops/sec, and had a computational intensity (floating point ops per memory reference) of 0.415.

GPTL is thread-safe. When regions are invoked by multiple threads, per-thread statistics are automatically generated. The results are presented both in the hierarchical "call tree" format mentioned above, and also summarized across threads on a per-region basis.

2 Basic usage

The following simple Fortran code illustrates basic usage of the GPTL library. It is a manually-instrumented code, with an OpenMP section.


program omptest
  implicit none
  include 'gptl.inc'                 ! Fortran GPTL include file
  integer :: ret, iter
  integer, parameter :: nompiter = 2 ! Number of OMP threads

  ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error
  ret = gptlsetoption (gptloverhead, 0)       ! Turn off overhead estimate
  ret = gptlsetutr (gptlnanotime)             ! Set underlying timer
  ret = gptlinitialize ()                     ! Initialize GPTL

  ret = gptlstart ('total')                   ! Start a timer

!$OMP PARALLEL DO PRIVATE (iter)   ! Threaded loop
  do iter=1,nompiter
    ret = gptlstart ('A')          ! Start a timer
    ret = gptlstart ('B')          ! Start another timer
    ret = gptlstart ('C')
    call sleep (iter)              ! Sleep for "iter" seconds
    ret = gptlstop ('C')           ! Stop a timer
    ret = gptlstart ('CC')
    ret = gptlstop ('CC')
    ret = gptlstop ('A')
    ret = gptlstop ('B')         
  end do
  ret = gptlstop ('total')
  ret = gptlpr (0)                 ! Print timer stats
  ret = gptlfinalize ()            ! Clean up
end program omptest

All calls to gptlsetoption() and gptlsetutr() are optional. They change the default behavior of the library. An arbitrary number of these calls can occur prior to initializing GPTL. In this example, the first gptlsetoption() call instructs GPTL to abort whenever an error occurs. The default behavior is to return an error code. The second call turns off printing of statistics which estimate the overhead incurred by the library itself. The call to gptlsetutr() changes which underlying procedure to use when gathering wallclock timing statistics. The default is gettimeofday(). This routine is ubiquitous, but generally does not have very good granularity and can be expensive to call. On x86-based systems, a time stamp counter can be read via an instruction (rdtsc). It has far better granularity and overhead characteristics than gettimeofday(). Passing gptlnanotime to gptlsetutr() instructs GPTL to use rdtsc to gather all wallclock timing statistics.

After all gptlsetoption() and gptlsetutr() calls, one call to gptlinitialize() is required. This initializes the library, in particular the thread-safe aspects. Currently, for OpenMP threaded codes the value of OMP_NUM_THREADS is used to initialize threaded parts of GPTL.

After gptlinitialize() has been called, an arbitrary sequence of gptlstart() and gptlstop() pairs can be coded by the user. In the above example, some of these occur within a threaded loop. Prior to program termination, a call to gptlpr() causes the library to print the current state of all regions that have been defined by these pairs of calls to gptlstart() and gptlstop(). In an MPI code, one would normally pass the MPI rank of the calling process to gptlpr(). The output file will be named timing.number, where "number" is the MPI rank. This provides a simple way to avoid namespace conflicts. Here's the output from running the simple example code from above:


Stats for thread 0:
             Called Recurse Wallclock max       min      
  total             1   -       2.000     2.000     2.000
    A               1   -       1.000     1.000     1.000
      B             1   -       1.000     1.000     1.000
        C           1   -       1.000     1.000     1.000
        CC          1   -       0.000     0.000     0.000
Total calls           = 5
Total recursive calls = 0

Stats for thread 1:
        Called Recurse Wallclock max       min      
  A            1   -       2.000     2.000     2.000
    B          1   -       2.000     2.000     2.000
      C        1   -       2.000     2.000     2.000
      CC       1   -       0.000     0.000     0.000
Total calls           = 4
Total recursive calls = 0

Same stats sorted by timer for threaded regions:
Thd      Called Recurse Wallclock max       min      
000 A           1   -       1.000     1.000     1.000
001 A           1   -       2.000     2.000     2.000
SUM A           2   -       3.000     2.000     1.000

000 B           1   -       1.000     1.000     1.000
001 B           1   -       2.000     2.000     2.000
SUM B           2   -       3.000     2.000     1.000

000 C           1   -       1.000     1.000     1.000
001 C           1   -       2.000     2.000     2.000
SUM C           2   -       3.000     2.000     1.000

000 CC          1   -       0.000     0.000     0.000
001 CC          1   -       0.000     0.000     0.000
SUM CC          2   -       0.000     0.000     0.000

2.1 Explanation of results

Region names are listed on the far left. A "region" is defined in the application by calling GPTLstart(), then GPTLstop() for the same input (character string) argument. Indenting of the names preserves parent-child relationships between the regions. In the example, we see that region "A" was contained in "total", "B" contained in "A", and regions "C" and "CC" both contained in "B".

Reading across the output from left to right, the next column is labelled "Called". This is the number of times the region was invoked. If any regions were called recursively, that information is printed next. In this case there were no recursive calls, so just a "-" is printed. Total wallclock time for each region is printed next, followed by the max and min values for any single invocation. In this simple example each region was called only once, so "Wallclock", "max", and "min" are all the same.

Since this is a threaded code run with OMP_NUM_THREADS=2, statistics for the second thread are also printed. It starts at "Stats for thread 1:" The output shows that thread 1 participated in the computations for regions "A", "B", "C", and "CC", but not "total". This is reflected in the code itself, since only the master thread was active when start and stop calls were made for region "total".

After the per-thread statistics section, the same information is repeated, sorted by region name if more than one thread was active. This section is delimited by the string "Same stats sorted by timer for threaded regions:". This region presentation order makes it easier to inspect for load balance across threads. The leftmost column is thread number, and the region names are not indented. A sum across threads for each region is also printed, and labeled "SUM".

3 PAPI interface and PAPI-based derived counters

If the PAPI library is installed (http://icl.cs.utk.edu/papi), GPTL also provides a convenient mechanism to access all available PAPI events. An event is enabled by calling gptlsetoption() with the appropriate counter. For example, to measure floating point operations, one would code the following:


 ret = gptlsetoption (PAPI_FPOPS, 1)

The second argument "1" passed to gptlsetoption() in this example means to enable the option. Passing a zero would mean to disable the option.

The GPTL library handles details of invoking the PAPI library properly. This includes things such as initializing PAPI, calling the appropriate thread initialization routines, setting up an event list, and of course reading the counters.

In addtion to PAPI preset and native events, GPTL defines derived events which are based on PAPI counters. The file gptl.h from the GPTL distribution contains a list of available derived events. An example is computational intensity, defined as floating point operations per memory reference. This event is enabled just like any other PAPI event:


 ret = gptlsetoption (GPTL_CI, 1)

Of course these events can only be enabled if the PAPI counters they require are available on the target architecture. On the XT4 and XT5 machines at ORNL, GPTL_CI is defined by PAPI_FP_OPS / PAPI_L1_DCA.

4 Auto-profiling

Hand-instrumenting many regions of large application codes can be tedious. And inserting ifdefs around "start" and "stop" calls to enable or disable timing of regions can unduly complicate the application source code. An alternative is to use the auto-profiling hooks provided by the GNU, Pathscale, and PGI compilers. The compiler flag which enables auto-profiling with the GNU and Pathscale compilers is -finstrument-functions. The equivalent is attained with PGI compiler releases 8.0.2 and later by compiling the target application code with -Minstrument:functions.

Using these hooks, one can automatically generate GPTL start/stop pairs at function entry and exit points. In this way, only the main program needs to contain GPTL function calls. All other application routines will be automatically instrumented. In addition to providing function-level profiling, this is a nice way to generate a dynamic call tree for an entire application.

4.1 Converting addresses to names

When the appropriate auto-profiling compiler flag is set, the compiler generates calls to __cyg_profile_func_enter (void *this_fn, void *call_site) at function start, and __cyg_profile_func_exit (void *this_fn, void *call_site) at function exit. GPTL uses these hooks to start and stop timers (and PAPI counters if enabled) for these regions. When the regions are printed, the names will be addresses rather than human-readable function names. A post-processing perl script is provided to do the translation.

The following example shows how to enable auto-profiling with a simple C code. It uses PAPI to count total instructions, and the derived event instructions per cycle is also enabled. Note that function B has multiple parents, and GPTL reports the multiple parent information in the output produced by the call to GPTLpr_file().

main.c:


#include <gptl.h>
#include <papi.h>

int main ()
{
  void do_work (void);
  int i, ret;
  ret = GPTLsetoption (GPTL_IPC, 1);     // Count instructions per cycle
  ret = GPTLsetoption (PAPI_TOT_INS, 1); // Print total instructions
  ret = GPTLsetoption (GPTLoverhead, 0); // Don't print overhead estimate
  ret = GPTLinitialize ();               // Initialize GPTL
  ret = GPTLstart ("main");              // Start a manual timer
  do_work ();                            // Do some work 
  ret = GPTLstop ("main");               // Stop the manual timer
  ret = GPTLpr_file ("outfile");         // Write output to "outfile"
}

subs.c:

#include <unistd.h>

extern void A(void);
extern void AA(void);
extern void B(void);

void do_work ()
{
  A ();
  AA ();
  B ();
}

void A ()
{
  B ();
}

void AA ()
{
}

void B ()
{
  sleep (1);
}

Compile all but main.c with auto-instrumentation, then link and run. Useful auto-instrumentation of the main program is not possible, because the call to GPTLinitialize() must be done manually and needs to preceed all calls to GPTLstart() and GPTLstop().

% gcc -c main.c
% gcc -finstrument-functions subs.c main.o -lgptl -lpapi
% ./a.out

Next, convert the auto-instrumented output to human-readable form:

% hex2name.pl a.out outfile > outfile.converted

Output file outfile.converted looks like this:


Stats for thread 0:
                       Called  Recurse Wallclock max       min       IPC      TOT_INS  e6_/_sec 
  main                       1     -       2.000     2.000     2.000 2.81e-01    18060     0.01 
    do_work                  1     -       2.000     2.000     2.000 2.61e-01    12547     0.01 
      A                      1     -       1.000     1.000     1.000 3.01e-01     4958     0.00 
*       B                    2     -       2.000     1.000     1.000 1.09e-01     2812     0.00 
      AA                     1     -       0.000     0.000     0.000 7.77e-01      488   244.00 
Total calls           = 6
Total recursive calls = 0

Multiple parent info (if any) for thread 0:
Columns are count and name for the listed child
Rows are each parent, with their common child being the last entry, which is indented
Count next to each parent is the number of times it called the child
Count next to child is total number of times it was called by the listed parents

       1 A                         
       1 do_work                         
       2   B

4.1.1 Explanation of the above output

PAPI event "Total instructions executed" (PAPI_TOT_INS) and derived event "Instructions per cycle" (GPTL_IPC) were enabled. To compute instructions per cycle, GPTL made the PAPI library call to count total cycles (PAPI_TOT_CYC) in addition to the already-enabled event PAPI_TOT_INS. When GPTLpr_file() was called, it computed:

      GPTL_IPC = PAPI_TOT_INS / PAPI_TOT_CYC;

Note the asterisk in front of region "B". This indicates that region "B" had multiple parents. It is presented as a child of region "A" because that is the first region that invoked it. Information about other parents is presented after the main call tree. It shows that region "B" had two parents, "A", and "do_work". Each parent invoked "B" once, for a total of 2 calls.

5 Internals

The basic data structure internal to GPTL is a linked list. Contents of each entry in the linked list are defined below. Most of the entries should be self-explanatory. GPTL needs to keep track of such quantities as "current accumulated wallclock time", as well as "last wallclock timestamp", values of PAPI counters, number of calls to the timer, etc.


typedef struct TIMER {
  char name[MAX_CHARS+1];   /* timer name (user input) */
#ifdef HAVE_PAPI
  Papistats aux;            /* PAPI stats  */
#endif 
  Wallstats wall;           /* wallclock stats */
  Cpustats cpu;             /* cpu stats */
  unsigned long count;      /* number of start/stop calls */
  unsigned long nrecurse;   /* number of recursive start/stop calls */
  void *address;            /* address of timer: used only by _instr routines */
  struct TIMER *next;       /* next timer in linked list */
  struct TIMER **parent;    /* array of parents */
  struct TIMER **children;  /* array of children */
  int *parent_count;        /* array of call counts, one for each parent */
  unsigned int recurselvl;  /* recursion level */
  unsigned int nchildren;   /* number of children */
  unsigned int nparent;     /* number of parents */
  unsigned int norphan;     /* number of times this timer was an orphan */
  int num_desc;             /* number of descendants */
  bool onflg;               /* timer currently on or off */
} Timer;

Every time a new timer is started (i.e. first call to GPTLstart()), a new entry is added to the end of the list. This approach allows specification of an arbitrary number of user-defined (or compiler-defined) timers. In order to determine whether a timer is "new", a simple hash function is used to generate an index into an array of pointers to existing timers. This speed enhancement dramatically improves the performance of GPTL itself by avoiding having to traverse a linked list on every call to GPTLstart() or GPTLstop().

Thread-safety is attained by maintaining separate linked lists and hash tables for each thread. Separate per-thread information is then reported when timing information is written to the output file.

Hash collisions are handled by making each entry in the hash table an array. Entries which hash to the same value will have a multi-element, dynamically allocated array of pointers to the timers with that hash value. This array is searched linearly until the correct entry is found. This approach allows GPTL to handle an arbitrary number of collisions. Though there is an added searching expense incurred whenever a collision occurs.

GPTL maintains an internally defined "call stack" to keep track of the current active call tree whenever a new region (i.e. "timer") is entered. This provides an efficient means to record the region's parent in the call tree. A pointer to that parent is defined (or updated) every time the region is entered. This allows GPTL to learn which regions have multiple parents, who the parents are, and how many times the region is invoked by each parent. When timer contents are printed, the parent information embedded in each timer allows a tree data structure to be defined, with pointers from parents to each of their children. This tree is then traversed from the root to all the leaves (children) in order to print the contents of each timer. Depth in the call tree is recorded by indenting the region at print time to depict who the parent is.

When there are multiple parents, the default behavior is to print the timer only once, with its parent listed as the one who calls it the most frequently. Optionally, the user can request that regions with multiple parents be listed for each parent. Recursive effects can cause this option to generate lots of output when regions with many parents also have many children. Other printing options allow regions to be listed only once according to who their first parent is, or who their last parent is.

6 Future Work

High on the list of desired library enhancements is an option to output region information in XML format instead of only ASCII text. This would allow one to leverage the power of XML readers to expand or contract nested regions with the click of a mouse button. Such an ability could be very powerful for large application codes, particularly when navigating the call tree generated with auto-profiling enabled.

Some existing performance analysis tools (e.g. FPMPI) already have the ability to trap MPI calls in order to gather statistics related to message traffic (e.g. sync time, average message size, transfer time). This would be a relatively simple and powerful addition to GPTL, where such statistics could be embedded within already-defined regions.

It would be nice to enhance auto-profiling capability by requiring no modifications to application source in order to gather timing statistics. Phil Mucci has done this with PAPIEX[3]. Applying his approach to GPTL should be straightforward.

References

[1]W.D. Collins, et. al. Description of the NCAR Community Atmosphere Model. http://www.ccsm.ucar.edu/models/atm-cam/docs/description/node1.html

[2] Performance Application Programming Interface. http://icl.cs.utk.edu/papi/

[3] Phil Mucci. PAPIEX Home Page. http://icl.cs.utk.edu/~mucci/papiex/