Extrae is a dynamic instrumentation package to trace programs compiled and run with the shared memory model (like OpenMP and pthreads), the message passing (MPI) programming model or both programming models (different MPI processes using OpenMP or pthrads within each MPI process). Extrae generates trace files that can be latter visualized with Paraver .
The package is distributed in compressed tar format (e.g., extrae.tar.gz). To unpack it, execute from the desired target directory the following command line :
gunzip -c extrae.tar.gz | tar -xvf -
The unpacking process will create different directories on the current directory (see table 1.1).
There are some files within Extrae that contain references to libraries given at configure time. Because of this, you need to adapt the installation to your system. In order to do that Extrae provides an automatic mechanism that post-configures the package. Once you have installed Extrae , just set EXTRAE_HOME environment variable to the directory where you have untarred it and execute ${EXTRAE_HOME}/bin/extrae-post-installation-upgrade.sh. This script will guide you into some questions about the location of several libraries needed by Extrae . The script shows the current value for the library directories and gives the user the chance to change them. In case the libraries were unused at configure time, thet current value will be an empty string.
There are several included examples in the installation package. These examples are installed in ${EXTRAE_HOME}/share/example and cover different application types (including serial/MPI/OpenMP/CUDA/etc). We suggest the user to look at them to get an idea on how to instrument their application.
Once the package has been unpacked, set the EXTRAE_HOME environment variable to the directory where the package was installed. Use the export or setenv commands to set it up depending on the shell you use. If you use sh-based shell (like sh, bash, ksh, zsh, ...), issue this command
export EXTRAE_HOME=dirhowever, if you use csh-based shell (like csh, tcsh), execute the following command
setenv EXTRAE_HOME dirwhere dir refers where the Extrae was installed. Henceforth, all references to the usage of the environment variables will be used following the sh format unless specified.
Extrae is offered in two different flavors: as a DynInst-based application, or stand-alone application. DynInst is a dynamic instrumentation library that allows the injection of code in a running application without the need to recompile the target application. If the DynInst instrumentation library is not installed, Extrae also offers different mechanisms to trace applications.
Extrae needs some environment variables to be setup on each session. Issuing the command
source ${EXTRAE_HOME}/etc/extrae.sh
on a sh-based shell, or
source ${EXTRAE_HOME}/etc/extrae.csh
on a csh-based shell will do the work. Then copy the default XML configuration file1.1 into the working directory by executing
cp ${EXTRAE_HOME}/share/example/MPI/extrae.xml .
If needed, set the application environment variables as usual (like OMP_NUM_THREADS, for example), and finally launch the application using the ${EXTRAE_HOME}/bin/extrae instrumenter like:
${EXTRAE_HOME}/bin/extrae -config extrae.xml <program>
where <program> is the application binary.
Extrae needs some environment variables to be setup on each session. Issuing the command
source ${EXTRAE_HOME}/etc/extrae.sh
on a sh-based shell, or
source ${EXTRAE_HOME}/etc/extrae.csh
on a csh-based shell will do the work. Then copy the default XML configuration file1.1into the working directory by executing
cp ${EXTRAE_HOME}/share/example/MPI/extrae.xml .
and export the EXTRAE_CONFIG_FILE as
export EXTRAE_CONFIG_FILE=extrae.xml
If needed, set the application environment variables as usual (like OMP_NUM_THREADS, for example). Just before executing the target application, issue the following command:
export LD_PRELOAD=${EXTRAE_HOME}/lib/<lib>
where <lib> is one of those listed in Table 1.2.
3tabbg1tabbg2
0.85
|
Once the intermediate trace files (*.mpit files) have been created, they have to be merged (using the mpi2prv command) in order to generate the final Paraver trace file. Execute the following command to proceed with the merge:
${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -o output.prv
The result of the merge process is a Paraver tracefile called output.prv. If the -o option is not given, the resulting tracefile is called EXTRAE_Paraver_Trace.prv.
Extrae is a dynamic instrumentation package to trace programs compiled and run with the shared memory model (like OpenMP and pthreads), the message passing (MPI) programming model or both programming models (different MPI processes using OpenMP or pthreads within each MPI process). Extrae generates trace files that can be visualized with Paraver .
Extrae is currently available on different platforms and operating systems: IBM PowerPC running Linux or AIX, and x86 and x86-64 running Linux. It also has been ported to OpenSolaris and FreeBSD.
The combined use of Extrae and Paraver offers an enormous analysis potential, both qualitative and quantitative. With these tools the actual performance bottlenecks of parallel applications can be identified. The microscopic view of the program behavior that the tools provide is very useful to optimize the parallel program performance.
This document tries to give the basic knowledge to use the Extrae tool. Chapter 3 explains how the package can be configured and installed. Chapter 8 explains how to monitor an application to obtain its trace file. At the end of this document there are appendices that include: a Frequent Asked Questions appendix and a list of routines instrumented in the package.
Paraver is a flexible parallel program visualization and analysis tool based on an easy-to-use Motif GUI. Paraver was developed responding to the need of hacing a qualitative global perception of the application behavior by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to decide the points on which to invest the programming effort to optimize an application.
Expressive power, flexibility and the capability of efficiently handling large traces are key features addressed in the design of Paraver . The clear and modular structure of Paraver plays a significant role towers achieving these targets.
Some Paraver features are the support for:
One of the main features of Paraver is the flexibility to represent traces coming from different environments. Traces are composed of state records, events and communications with associated timestamp. These three elements can be used to build traces that capture the behavior along time of very different kind of systems. The Paraver distribution includes, either in its own distribution or as additional packages, the following instrumentation modules:
The Paraver distribution can be found at URL:
Paraver binaries are available for Linux/x86, Linux/x86-64 and Linux/ia64, Windows.
In the Documentation Tool section of the aforementioned URL you can find the Paraver Reference Manual and Paraver Tutorial in addition to the documentation for other instrumentation packages.
Extrae and Paraver tools e-mail support is tools@bsc.es.
There are many options to be applied at configuration time for the instrumentation package. We point out here some of the relevant options, sorted alphabetically. To get the whole list run configure -help. Options can be enabled or disabled. To enable them use -enable-X or -with-X= (depending on which option is available), to disable them use -disable-X or -without-X.
To build the instrumentation package, just issue make after the configuration.
To install the instrumentation package in the directory chosen at configure step (through -prefix option), issue make install.
The Extrae package contains some consistency checks. The aim of such checks is to determine whether a functionality is operative in the target (installation) environment and/or check whether the development of Extrae has introduced any misbehavior. To run the checks, just issue make check after the installation. Please, notice that checks are meant to be run in the machine that the configure script was run, thus the results of the checks on machines with back-end nodes different to front-end nodes (like BG/* systems) are not representative at all.
All commands given here are given as an example to configure and install the package, you may need to tune them properly (i.e., choose the appropriate directories for packages and so). These examples assume that you are using a sh/bash shell, you must adequate them if you use other shells (like csh/tcsh).
Before issuing the configure command, the following modules were loaded:
Configuration command:
./configure -with-papi=/opt/cray/papi/5.4.1.1
-with-mpi=/opt/cray/mpt/7.2.2/gni/mpich2-gnu/48
-with-unwind=/apps/daint/5.2.UP02/easybuild/software/libunwind/1.1-CrayGNU-5.2.40 -with-cuda=/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1 -enable-sampling -without-dyninst -with-binary-type=64 CC=gcc CXX=g++ MPICC=cc
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=/homec/jzam11/jzam1128/aplic/extrae/2.2.0 -with-papi=/homec/jzam11/jzam1128/aplic/papi/4.1.2.1 -with-bfd=/bgsys/local/gcc/gnu-linux_4.3.2/powerpc-linux-gnu/powerpc-bgp-linux -with-liberty=/bgsys/local/gcc/gnu-linux_4.3.2/powerpc-bgp-linux -with-mpi=/bgsys/drivers/ppcfloor/comm -without-unwind -without-dyninst
Build and installation commands:
make
make install
To enable parsing the XML configuration file, the libxml2 must be installed. As of the time of writing this user guide, we have been only able to install the static version of the library in a BG/Q machine, so take this into consideration if you install the libxml2 in the system. Similarly, the binutils package (responsible for translating application addresses into source code locations) that is available in the system may not be properly installed and we suggest installing the binutils from the source code using the BG/Q cross-compiler. Regarding the cross-compilers, we have found that using the IBM XL compilers may require using the XL libraries when generating the final application binary with Extrae, so we would suggest using the GNU cross-compilers (/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-*).
If you want to add libxml2 and binutils support into Extrae, your configuration command may resemble to:
./configure -prefix=/homec/jzam11/jzam1128/aplic/juqueen/extrae/2.2.1 -with-mpi=/bgsys/drivers/ppcfloor/comm/gcc -without-unwind -without-dyninst -disable-openmp -disable-pthread
-with-libz=/bgsys/local/zlib/v1.2.5
-with-papi=/usr/local/UNITE/packages/papi/5.0.1
-with-xml-prefix=/homec/jzam11/jzam1128/aplic/juqueen/libxml2-gcc
-with-binutils=/homec/jzam11/jzam1128/aplic/juqueen/binutils-gcc
-enable-merge-in-trace
Otherwise, if you do not want to add support for the libxml2 library, your configuration may look like this:
./configure -prefix=/homec/jzam11/jzam1128/aplic/juqueen/extrae/2.2.1 -with-mpi=/bgsys/drivers/ppcfloor/comm/gcc -without-unwind -without-dyninst -disable-openmp -disable-pthread
-with-libz=/bgsys/local/zlib/v1.2.5
-with-papi=/usr/local/UNITE/packages/papi/5.0.1 -disable-xml
In any situation, the build and installation commands are:
make
make install
Some extensions of Extrae do not work properly (nanos, SMPss and OpenMP) on AIX. In addition, if using IBM MPI (aka POE) the make will complain when generating the parallel merge if the main compiler is not xlc/xlC. So, you can either change the compiler or disable the parallel merge at compile step. Also, command ar can complain if 64bit binaries are generated. It's a good idea to run make with OBJECT_MODE=64 set to avoid this.
Configuration command:
CC=xlc CXX=xlC ./configure -prefix=PREFIX -disable-nanos -disable-smpss -disable-openmp -with-binary-type=32 -without-unwind -enable-pmapi -without-dyninst -with-mpi=/usr/lpp/ppe.poe
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=PREFIX -disable-nanos -disable-smpss -disable-openmp -disable-parallel-merge -with-binary-type=64 -without-unwind -enable-pmapi -without-dyninst -with-mpi=/usr/lpp/ppe.poe
Build and installation commands:
OBJECT_MODE=64 make
make install
Configuration command:
./configure -prefix=PREFIX -with-mpi=/home/harald/aplic/mpich/1.2.7 -with-papi=/usr/local/papi -enable-openmp -without-dyninst -without-unwind
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=PREFIX -with-mpi=/opt/osshpc/mpich-mx -with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=32 -with-unwind=$HOME/aplic/unwind/1.0.1/32 -with-elf=/usr -with-dwarf=/usr -with-dyninst=$HOME/aplic/dyninst/7.0.1/32
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=PREFIX -with-mpi=/opt/osshpc/mpich-mx -with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=64 -with-unwind=$HOME/aplic/unwind/1.0.1/64 -with-elf=/usr -with-dwarf=/usr -with-dyninst=$HOME/aplic/dyninst/7.0.1/64
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=PREFIX -with-mpi=/home/harald/aplic/openmpi/1.3.1 -with-dyninst=/home/harald/dyninst/7.0.1 -with-dwarf=/usr
-with-elf=/usr -with-unwind=/home/harald/aplic/unwind/1.0.1
-without-papi
Build and installation commands:
make
make install
Notice the "-disable-xmltest". As backends programs cannot be run in the frontend, we skip running the XML test. Also using a local installation of libunwind.
Configuration command:
CC=cc CFLAGS='-O3 -g' LDFLAGS='-O3 -g' CXX=CC CXXFLAGS='-O3 -g' ./configure -with-mpi=/opt/cray/mpt/4.0.0/xt/seastar/mpich2-gnu -with-binary-type=64 -with-xml-prefix=/sw/xt5/libxml2/2.7.6/sles10.1_gnu4.1.2 -disable-xmltest -with-bfd=/opt/cray/cce/7.1.5/cray-binutils -with-liberty=/opt/cray/cce/7.1.5/cray-binutils -enable-sampling -enable-shared=no -prefix=PREFIX -with-papi=/opt/xt-tools/papi/3.7.2/v23 -with-unwind=/ccs/home/user/lib -without-dyninst
Build and installation commands:
make
make install
The Intel MIC accelerators (also codenamed KnightsFerry - KNF and KnightsCorner - KNC) or Xeon Phi processors are not binary compatible with the host (even if it is an Intel x86 or x86/64 chip), thus the Extrae package must be compiled specially for the accelerator (twice if you want Extrae for the host). While the host configuration and installation has been shown before, in order to compile Extrae for the accelerator you must configure Extrae like:
./configure -with-mpi=/opt/intel/impi/4.1.0.024/mic -without-dyninst -without-papi -without-unwind -disable-xml -disable-posix-clock -with-libz=/opt/extrae/zlib-mic -host=x86_64-suse-linux-gnu -prefix=/home/Computational/harald/extrae-mic -enable-mic
CFLAGS="-O -mmic -I/usr/include" CC=icc CXX=icpc
MPICC=/opt/intel/impi/4.1.0.024/mic/bin/mpiicc
To compile it, just issue:
make
make install
If using the GNU toolchain to compile the library, we suggest at least using version 4.6.2 because of its enhaced in this architecture.
Configuration command:
CC=/gpfs/APPS/BIN/GCC-4.6.2/bin/gcc-4.6.2 ./configure -prefix=/gpfs/CEPBATOOLS/extrae/2.2.0
-with-unwind=/gpfs/CEPBATOOLS/libunwind/1.0.1-git
-with-papi=/gpfs/CEPBATOOLS/papi/4.2.0 -with-mpi=/usr -enable-posix-clock -without-dyninst
Build and installation commands:
make
make install
Configuration command:
export MP_IMPL=anl2 ./configure -prefix=PREFIX
-with-mpi=/gpfs/apps/MPICH2/mx/1.0.8p1..3/32
-with-papi=/gpfs/apps/PAPI/3.6.2-970mp -with-binary-type=64 -without-dyninst -without-unwind
Build and installation commands:
make
make install
Configuration command:
CC=xlc CXX=xlC ./configure -prefix=PREFIX -with-mpi=/opt/ibmhpc/ppe.poe -without-dyninst -without-unwind -without-papi
Build and installation commands:
make
make install
Configuration command:
./configure -prefix=PREFIX -with-mpi=/opt/ibmhpc/ppe.poe -without-dyninst -without-unwind -without-papi
Build and installation commands:
MP_COMPILER=gcc make
make install
Configuration command, enabling MPI, PAPI and online analysis over MRNet.
./configure -prefix=/zhome/academic/HLRS/xhp/xhpgl/tools/extrae/intel -with-mpi=/opt/cray/mpt/7.1.2/gni/mpich2-intel/140
-with-unwind=/zhome/academic/HLRS/xhp/xhpgl/tools/libunwind -without-dyninst -with-papi=/opt/cray/papi/5.3.2.1 -enable-online -with-mrnet=/zhome/academic/HLRS/xhp/xhpgl/tools/mrnet/4.1.0 -with-spectral=/zhome/academic/HLRS/xhp/xhpgl/tools/spectral/3.1 -with-synapse=/zhome/academic/HLRS/xhp/xhpgl/tools/synapse/2.0
Build and installation commands:
make
make install
With the following modules loaded
module swap PrgEnv-XXX/YYY PrgEnv-cray/5.2.40
module load cray-mpich
Configuration command, enabling MPI, PAPI with the following modules loaded
./configure -prefix=${PREFIX} -with-mpi=/opt/cray/mpt/7.1.1/gni/mpich2-cray/83 -with-binary-type=64 -with-unwind=/home/markomg/lib-without-dyninst -disable-xmltest -with-bfd=/opt/cray/cce/default/cray-binutils -with-liberty=/opt/cray/cce/default/cray-binutils -enable-sampling -enable-shared=no -with-papi=/opt/cray/papi/5.3.2.1
Build and installation commands:
make
make install
If you are interested on knowing how an Extrae package was configured execute the following command after setting EXTRAE_HOME to the base location of an installation
${EXTRAE_HOME}/etc/configured.sh
this command will show the configure command itself and the location of some dependencies of the instrumentation package.
Extrae is configured through a XML file that is set through the EXTRAE_CONFIG_FILE environment variable. The included examples provide several XML files to serve as a basis for the end user. For instance, the MPI examples provide four XML configuration files:
Please note that most of the nodes present in the XML file have an enabled attribute that allows turning on and off some parts of the instrumentation mechanism. For example, <mpi enabled="yes"> means MPI instrumentation is enabled and process all the contained XML subnodes, if any; whether <mpi enabled="no"> means to skip gathering MPI information and do not process XML subnodes.
Each section points which environment variables could be used if the tracing package lacks XML support. See appendix B for the entire list.
Sometimes the XML tags are used for time selection (duration, for instance). In such tags, the following postfixes can be used: n or ns for nanoseconds, u or us for microseconds, m or ms for milliseconds, s for seconds, M for minutes, H for hours and D for days.
The basic trace behavior is determined in the first part of the XML and contains all of the remaining options. It looks like:
<?xml version='1.0'?> <trace enabled="yes" home="@sed_MYPREFIXDIR@" initial-mode="detail" type="paraver" xml-parser-id="@sed_XMLID@" > < ... other XML nodes ... > </trace>
The <?xml version='1.0'?> is mandatory for all XML files. Don't touch this. The available tunable options are under the <trace> node:
See EXTRAE_ON, EXTRAE_HOME, EXTRAE_INITIAL_MODE and EXTRAE_TRACE_TYPE environment variables in appendix B.
The MPI configuration part is nested in the config file (see section 4.1) and its nodes are the following:
<mpi enabled="yes"> <counters enabled="yes" /> </mpi>
MPI calls can gather performance information at the begin and end of MPI calls. To activate this behavior, just set to yes the attribute of the nested <counters> node.
See EXTRAE_DISABLE_MPI and EXTRAE_MPI_COUNTERS_ON environment variables in appendix B.
The pthread configuration part is nested in the config file (see section 4.1) and its nodes are the following:
<pthread enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </pthread>
The tracing package allows to gather information of some pthread routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.
See EXTRAE_DISABLE_PTHREAD, EXTRAE_PTHREAD_LOCKS and EXTRAE_PTHREAD_COUNTERS_ON environment variables in appendix B.
The OpenMP configuration part is nested in the config file (see section 4.1) and its nodes are the following:
<openmp enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </openmp>
The tracing package allows to gather information of some OpenMP runtimes and outlined routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.
See EXTRAE_DISABLE_OMP, EXTRAE_OMP_LOCKS and EXTRAE_OMP_COUNTERS_ON environment variables in appendix B.
<callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> <dynamic-memory enabled="no">1-5</dynamic-memory> </callers>
Callers are the routine addresses present in the process stack at any given moment during the application run. Callers can be used to link the tracefile with the source code of the application.
The instrumentation library can collect a partial view of those addresses during the instrumentation. Such collected addresses are translated by the merging process if the correspondent parameter is given and the application has been compiled and linked with debug information.
There are three points where the instrumentation can gather this information:
The user can choose which addresses to save in the trace (starting from 1, which is the closest point to the MPI call or sampling point) specifying several stack levels by separating them by commas or using the hyphen symbol.
See EXTRAE_MPI_CALLER environment variable in appendix B.
<user-functions enabled="no" list="/home/bsc41/bsc41273/user-functions.dat" exclude-automatic-functions="no"> <counters enabled="yes" /> </user-functions>
The file contains a list of functions to be instrumented by Extrae . There are different alternatives to instrument application functions, and some alternatives provides additional flexibility, as a result, the format of the list varies depending of the instrumentation mechanism used:
To discover the instrumentable loops and basic blocks of a certain function you can execute the command $EXTRAE_HOME/bin/extrae -config extrae.xml -decodeBB, where extrae.xml is an Extrae configuration file that provides a list on the user functions attribute that you want to get the information.
# nm -a pi | grep pi_kernel 00000000004005ed T pi_kerneland add 00000000004005ed # pi_kernel into the function list.
The exclude-automatic-functions attribute is used only by the DynInst instrumenter. By setting this attribute to yes the instrumenter will avoid automatically instrumenting the routines that either call OpenMP outlined routines (i.e. routines with OpenMP pragmas) or call CUDA kernels.
Finally, in order to gather performance counters in these functions and also in those instrumented using the extrae_user_function API call, the node counters has to be enabled.
Warning! Note that you need to compile your application binary with debugging information (typically the -g compiler flag) in order to translate the captured addresses into valuable information such as: function name, file name and line number.
See EXTRAE_FUNCTIONS environment variable in appendix B.
The instrumentation library can be compiled with support for collecting performance metrics of different components available on the system. These components include:
Here is an example of the counters section in the XML configuration file:
<counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="1"> <set enabled="yes" domain="all" changeat-time="5s"> PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L1_DCM <sampling enabled="yes" period="100000000">PAPI_TOT_CYC</sampling> </set> <set enabled="yes" domain="user" changeat-globalops="5"> PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_INS </set> </cpu> <network enabled="yes" /> <resource-usage enabled="yes" /> </counters>
See EXTRAE_COUNTERS, EXTRAE_NETWORK_COUNTERS and EXTRAE_RUSAGE environment variables in appendix B.
Processor performance counters are configured in the <cpu> nodes. The user can configure many sets in the <cpu> node using the <set> node, but just one set will be used at any given time in a specific task. The <cpu> node supports the starting-set-distribution attribute with the following accepted values:
Each set contain a list of performance counters to be gathered at different instrumentation points (see sections 4.2, 4.4 and 4.6). If the tracing library is compiled to support PAPI, performance counters must be given using the canonical name (like PAPI_TOT_CYC and PAPI_L1_DCM), or the PAPI code in hexadecimal format (like 8000003b and 80000000, respectively)4.3. If the tracing library is compiled to support PMAPI, only one group identifier can be given per set4.4 and can be either the group name (like pm_basic and pm_hpmcount1) or the group number (like 6 and 22, respectively).
In the given example (which refers to PAPI support in the tracing library) two sets are defined. First set will read PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_L1_DCM (1st level cache misses). Second set is configured to obtain PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_FP_INS (floating point instructions).
Additionally, if the underlying performance library supports sampling mechanisms, each set can be configured to gather information (see section 4.5) each time the specified counter reaches a specific value. The counter that is used for sampling must be present in the set. In the given example, the first set is enabled to gather sampling information every 100M cycles.
Furthermore, performance counters can be configured to report accounting on different basis depending on the domain attribute specified on each set. Available options are
In the given example, first set is configured to count all the events ocurred, while the second one only counts those events ocurred when the application is running in user-space mode.
Finally, the instrumentation can change the active set in a manual and an automatic fashion. To change the active set manually see Extrae_previous_hwc_set and Extrae_next_hwc_set API calls in 5.1. To change automatically the active set two options are allowed: based on time and based on application code. The former mechanism requires adding the attribute changeat-time and specify the minimum time to hold the set. The latter requires adding the attribute changeat-globalops with a value. The tracing library will automatically change the active set when the application has executed as many MPI global operations as selected in that attribute. When In any case, if either attribute is set to zero, then the set will not me changed automatically.
Network performance counters are only available on systems with Myrinet GM/MX networks and they are fixed depending on the firmware used. Other systems, like BG/* may provide some network performance counters, but they are accessed through the PAPI interface (see section 4.7 and PAPI documentation).
If <network> is enabled the network performance counters appear at the end of the application run, giving a summary for the whole run.
Operating system accounting is obtained through the getrusage(2) system call when <resource-usage> is enabled. As network performance counters, they appear at the end of the application run, giving a summary for the whole run.
The instrumentation packages can be instructed on what/where/how produce the intermediate trace files. These are the available options:
<storage enabled="no"> <trace-prefix enabled="yes">TRACE</trace-prefix> <size enabled="no">5</size> <temporal-directory enabled="yes">/scratch</temporal-directory> <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory> </storage>
Such options refer to:
See EXTRAE_PROGRAM_NAME, EXTRAE_FILE_SIZE, EXTRAE_DIR, EXTRAE_FINAL_DIR and EXTRAE_GATHER_MPITS environment variables in appendix B.
Modify the buffer management entry to tune the tracing buffer behavior.
<buffer enabled="yes"> <size enabled="yes">150000</size> <circular enabled="no" /> </buffer>
By, default (even if the enabled attribute is "no") the tracing buffer is set to 500k events. If <size> is enabled the tracing buffer will be set to the number of events indicated by this node. If the circular option is enabled, the buffer will be created as a circular buffer and the buffer will be dumped only once with the last events generated by the tracing package.
See EXTRAE_BUFFER_SIZE environment variable in appendix B.
<trace-control enabled="yes"> <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file> <global-ops enabled="no">10</global-ops> <remote-control enabled="yes"> <mrnet enabled="yes" target="150" analysis="spectral" start-after="30"> <clustering max_tasks="26" max_points="8000"/> <spectral min_seen="1" max_periods="0" num_iters="3" signals="DurBurst,InMPI"/> </mrnet> <signal enabled="no" which="USR1"/> </remote-control> </trace-control>
This section groups together a set of options to limit/reduce the final trace size. There are three mechanisms which are based on file existance, global operations executed and external remote control procedures.
Regarding the file, the application starts with the tracing disabled, and it is turned on when a control file is created. Use the property frequency to choose at which frequency this check must be done. If not supplied, it will be checked every 100 global operations on MPI_COMM_WORLD.
If the global-ops tag is enabled, the instrumentation package begins disabled and starts the tracing when the given number of global operations on MPI_COMM_WORLD has been executed.
The remote-control tag section allows to configure some external mechanisms to automatically control the tracing. Currently, there is only one option which is built on top of MRNet and it is based on clustering and spectral analysis to generate a small yet representative trace.
These are the options in the mrnet tag:
The clustering tag configures the clustering analysis parameters:
The spectral tag section configures the spectral analysis parameters:
A signal can be used to terminate the tracing when using the remote control. Available values can be only USR1/USR2 Some MPI implementations handle one of those, so check first which is available to you. Set in tag signal the signal code you want to use.
See EXTRAE_CONTROL_FILE, EXTRAE_CONTROL_GLOPS, EXTRAE_CONTROL_TIME environment variables in appendix B.
<bursts enabled="no"> <threshold enabled="yes">500u</threshold> <mpi-statistics enabled="yes" /> </bursts>
If the user enables this option, the instrumentation library will just emit information of computation bursts (i.e., not does not trace MPI calls, OpenMP runtime, and so on) when the current mode (through initial-mode in 4.1) is set to bursts. The library will discard all those computation bursts that last less than the selected threshold.
In addition to that, when the tracing library is running in burst mode, it computes some statistics of MPI activity. Such statistics can be dumped in the tracefile by enabling mpi-statistics.
See EXTRAE_INITIAL_MODE, EXTRAE_BURST_THRESHOLD and EXTRAE_MPI_STATISTICS environment variables in appendix B.
<others enabled="yes"> <minimum-time enabled="no">10m</minimum-time> </others>
This section contains other configuration details that do not fit in the previous sections. Right now, there is only one option available and it is devoted to tell the instrumentation package the minimum instrumentation time. To enable it, set enabled to "yes" and set the minimum time within the minimum-time tag.
<sampling enabled="no" type="default" period="50m" variability="10m"/>
This section configures the time-based sampling capabilities. Every sample contains processor performance counters (if enabled in section 4.7.1 and either PAPI or PMAPI are referred at configure time) and callstack information (if enabled in section 4.5 and proper dependencies are set at configure time).
This section contains two attributes besides enabled. These are
See EXTRAE_SAMPLING_PERIOD, EXTRAE_SAMPLING_VARIABILITY, EXTRAE_SAMPLING_CLOCKTYPE and EXTRAE_SAMPLING_CALLER environment variables in appendix B.
<cuda enabled="yes" />
This section indicates whether the CUDA calls should be instrumented or not. If enabled is set to yes, CUDA calls will be instrumented, otherwise they will not be instrumented.
<opencl enabled="yes" />
This section indicates whether the OpenCL calls should be instrumented or not. If enabled is set to yes, Opencl calls will be instrumented, otherwise they will not be instrumented.
<input-output enabled="no" />
This section indicates whether I/O calls (read and write) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented.
Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-instrument-io option.
Warning! This option seems to intefere with the instrumentation of the GNU and Intel OpenMP runtimes, and the issues haven't been solved yet.
<dynamic-memory enabled="no"> <alloc enabled="yes" threshold="32768" /> <free enabled="yes" /> </dynamic-memory>
This section indicates whether dynamic memory calls (malloc, free, realloc) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented. This section allows deciding whether allocation and free-related memory calls shall be instrumented. Additionally, the configuration can also indicate whether allocation calls should be instrumented if the requested memory size surpasses a given threshold (32768 bytes, in the example).
Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-instrument-dynamic-memory option.
Warning! This option seems to intefere with the instrumentation of the Intel OpenMP runtime, and the issues haven't been solved yet.
<pebs-sampling enabled="yes"> <loads enabled="yes" period="1000000" minimum-latency="10" /> <stores enabled="no" period="1000000" /> </pebs-sampling>
This section tells Extrae to use the PEBS feature from recent Intel processors4.6 to sample memory references. These memory references capture the linear address referenced, the component of the memory hierarchy that solved the reference and the number of cycles to solve the reference. In the example above, PEBS monitors one out of every million load instructions and only grabs those that require at least 10 cycles to be solved.
Note: This is an experimental feature, and needs to be enabled at configure time using the -enable-pebs-sampling option.
<merge enabled="yes" synchronization="default" binary="mpi_ping" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes" > mpi_ping.prv </merge>
If this section is enabled and the instrumentation packaged is configured to support this, the merge process will be automatically invoked after the application run. The merge process will use all the resources devoted to run the application.
In the example given, the leaf of this node will be used as the tracefile name (mpi_ping.prv in this example). Current available options for the merge process are given as attribute of the <merge> node and they are:
In Linux systems, the tracing package can take advantage of certain functionalities from the system and can guess the binary name, and from it the tracefile name. In such systems, you can use the following reduced XML section replacing the earlier section.
<merge enabled="yes" synchronization="default" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes" />
For further references, see chapter 6.
XML tags and attributes can refer to environment variables that are defined in the environment during the application run. If you want to refer to an environment variable within the XML file, just enclose the name of the variable using the dollar symbol ($), for example: $FOO$.
Note that the user has to put an specific value or a reference to an environment variable which means that expanding environment variables in text is not allowed as in a regular shell (i.e., the instrumentation package will not convert the follwing text bar$FOO$bar).
There are two levels of the API in the Extrae instrumentation package. Basic API refers to the basic functionality provided and includes emitting events, source code tracking, changing instrumentation mode and so. Extended API is an experimental addition to provide several of the basic API within single and powerful calls using specific data structures.
The following routines are defined in the ${EXTRAE_HOME}/include/extrae.h. These routines are intended to be called by C/C++ programs. The instrumentation package also provides bindings for Fortran applications. The Fortran API bindings have the same name as the C API but honoring the Fortran compiler function name mangling scheme. To use the API in Fortran applications you must use the module provided in $EXTRAE_HOME/include/extrae_module.f by using the language clause use. This module which provides the appropriate function and constant declarations for Extrae .
Some common use of events are:
for (i = 1; i <= MAX_ITERS; i++) { Extrae_event (1000, i); [original loop code] } Extrae_event (1000, 0);The last added call to Extrae_event marks the end of the loop setting the event value to 0, which facilitates the analysis with Paraver.
void routine1 (void) { Extrae_event (6000019, 1); [routine 1 code] Extrae_event (6000019, 0); } void routine2 (void) { Extrae_event (6000019, 2); [routine 2 code] Extrae_event (6000019, 0); }
void routine1 (void) { Extrae_user_function (1); [routine 1 code] Extrae_user_function (0); } void routine2 (void) { Extrae_user_function (1); [routine 2 code] Extrae_user_function (0); }
In order to gather performance counters during the execution of these calls, the user-functions tag in the XML configuration and its counters have to be both enabled.
Warning! Note that you need to compile your application binary with debugging information (typically the -g compiler flag) in order to translate the captured addresses into valuable information such as: function name, file name and line number.
NOTE: This API is in experimental stage and it is only available in C. Use it at your own risk!
The extended API makes use of two special structures located in ${PREFIX}/include/extrae_types.h. The structures are extrae_UserCommunication and extrae_CombinedEvents. The former is intended to encode an event that will be converted into a Paraver communication when its partner equivalent event has found. The latter is used to generate events containing multiple kinds of information at the same time.
struct extrae_UserCommunication { extrae_user_communication_types_t type; extrae_comm_tag_t tag; unsigned size; /* size_t? */ extrae_comm_partner_t partner; extrae_comm_id_t id; };
The structure extrae_UserCommunication contains the following fields:
struct extrae_CombinedEvents { /* These are used as boolean values */ int HardwareCounters; int Callers; int UserFunction; /* These are intended for N events */ unsigned nEvents; extrae_type_t *Types; extrae_value_t *Values; /* These are intended for user communication records */ unsigned nCommunications; extrae_user_communication_t *Communications; };
The structure extrae_CombinedEvents contains the following fields:
The extended API contains the following routines:
If Java is enabled at configure time, a basic instrumentation library for serial application based on JNI bindings to Extrae will be installed. The current bindings are within the package es.bsc.cepbatools.extrae and the following bindings are provided:
Since Extrae does not have features to automatically discover the thread identifier of the threads that run within the virtual machine, there are some calls that allows to do this manually. These calls are, however, intended for expert users and should be avoided whenever possible because their behavior may be highly modified, or even removed, in future releases.
Once the application has finished, and if the automatic merge process is not setup, the merge must be executed manually. Here we detail how to run the merge process manually.
The inserted probes in the instrumented binary are responsible for gathering performance metrics of each task/thread and for each of them several files are created where the XML configuration file specified (see section 4.8). Such files are:
In order to use Paraver, those intermediate files (i.e., .mpit files) must be merged and translated into Paraver trace file format. The same applies if the user wants to use the Dimemas simulator. To proceed with any of these translation all the intermediate trace files must be merged into a single trace file using one of the available mergers in the bin directory (see table 6.1).
The target trace type is defined in the XML configuration file used at the instrumentation step (see section 4.1), and it has match with the merger used (mpi2prv and mpimpi2prv for Paraver and mpi2dim and mpimpi2dim for Dimemas). However, it is possible to force the format nevertheless the selection done in the XML file using the parameters -paraver or -dimemas6.1.
2tabbg1tabbg2
|
As stated before, there are two Paraver mergers: mpi2prv and mpimpi2prv. The former is for use in a single processor mode while the latter is meant to be used with multiple processors using MPI (and cannot be run using one MPI task).
Paraver merger receives a set of intermediate trace files and generates three files with the same name (which is set with the -o option) but differ in the extension. The Paraver trace itself (.prv file) that contains timestamped records that represent the information gathered during the execution of the instrumented application. It also generates the Paraver Configuration File (.pcf file), which is responsible for translating values contained in the Paraver trace into a more human readable values. Finally, it also generates a file containing the distribution of the application across the cluster computation resources (.row file).
The following sections describe the available options for the Paraver mergers. Typically, options available for single processor mode are also available in the parallel version, unless specified.
These are the available options for the sequential Paraver merger:
These options are specific to the parallel version of the Paraver merger:
As stated before, there are two Dimemas mergers: mpi2dim and mpimpi2dim. The former is for use in a single processor mode while the latter is meant to be used with multiple processors using MPI.
In contrast with Paraver merger, Dimemas mergers generate a single output file with the .dim extension that is suitable for the Dimemas simulator from the given intermediate trace files..
These are the available options for both Dimemas mergers:
There are some environment variables that are related Two environment variables
This environment variable lets the user add custom information to the generated Paraver Configuration File (.pcf). Just set this variable to point to a file containing labels for the unknown (user) events.
The format for the file is:
EVENT_TYPE 0 [type1] [label1] 0 [type2] [label2] ... 0 [typeK] [labelK]
Where [typeN] is the event value and [labelN] is the description for the event with value [typeN]. It is also possible to link both type and value of an event:
EVENT_TYPE 0 [type] [label] VALUES [value1] [label1] [value2] [label2] ... [valueN] [labelN]
With this information, Paraver can deal with both type and value when giving textual information to the end user. If Paraver does not find any information for an event/type it will shown it in numerical form.
Points to a directory where all intermediate temporary files will be stored. These files will be removed as soon the application ends.
Points to a directory where all intermediate temporary files will be stored. These files will be removed as soon the application ends.
Extrae On-line is a new module developed for the Extrae tracing toolkit, available from version 3.0, that incorporates intelligent monitoring, analysis and selection of the traced data. This tracing setup is tailored towards long executions that are producing large traces. Applying automatic analysis techniques based on clustering, signal processing and active monitoring, Extrae gains the ability to inspect and filter the data while it is being collected to minimize the amount of data emitted into the trace, while maximizing the amount of relevant information presented.
Extrae On-line has been developed on top of Synapse, a framework that facilitates the deployment of applications that follow the master/slave architecture based on the MRNet software overlay network. Thanks to its modular design, new types of automatic analyses can be added very easily as new plug-ins into the on-line tracing system, just by defining new Synapse protocols.
This document briefly describes the main features of the Extrae On-line module, and shows how it has to be configured and the different options available.
Extrae On-line currently supports three types of automatic analyses: fine-grain structure detection based on clustering techniques, periodicity detection based on signal processing techniques, and multi-experiment analysis based on active monitoring techniques. Extrae On-line has to be configured to apply one of these types of analyses, and then the analysis will be performed periodically as new data is being traced.
This mechanism aims at identifying the fine-grain structure of the computing regions of the program. Applying density-based clustering, this method is able to expose the main performance trends in the computations, and this information is useful to focus the analysis on the zones of real interest. To perform the cluster analysis, Extrae On-line relies on the ClusteringSuite tool7.1.
At each phase of analysis, several outputs are produced:
Subsequent clustering results can be used to study the evolution of the application over time. In order to study how the clusters are evolving, the xtrack tool can be used.
This mechanism allows to detect iterative patterns over a wide region of time, and precisely delimit where the iterations start. Once a period has been found, those iterations presenting less perturbations are selected to produce a representative trace, and the rest of the data is basically discarded. The result of applying this mechanism is a compact trace where only the representative iterations are traced in full detail, and for the rest of the execution we can optionally keep summarized information in the form of phase profiles or a ``low resolution'' trace.
Please note that applying this technique to a very short execution, or if no periodicity can be detected in the application, may result in an empty trace depending on the configuration options selected (see Section 7.3).
This mechanism employs active measurement techniques in order to simulate different execution scenarios under the same execution. Extrae On-line is able to add controlled interference into the program to simulate different computation loads, network bandwidth, memory congestion and even tuning some configurations of the parallel runtime (currently supports MPI Dynamic Load Balance (DLB) runtime). Then, the application behavior can be studied under different circumstances, and tracking can be used to analyze the impact of these configurations on the program performance. This technique aims at reducing the number of executions necessary to evaluate different parameters and characteristics of your program.
In order to activate the On-line tracing mode, the user has to enable the corresponding configuration section in the Extrae XML configuration file. This section is found under trace-control > remote-control > online. The default configuration is already ready to use:
<online enabled="yes" analysis="clustering" frequency="auto" topology="auto">
The available options for the <online> section are the following:
Depending on the analysis selected, the following specific options become available.
<clustering config="cl.I.IPC.xml"/>
<spectral max_periods="0" num_iters="3" min_seen="0" min_likeness="85"> <spectral_advanced enabled="no" burst_threshold="80"> <periodic_zone detail_level="profile"/> <non_periodic_zone detail_level="bursts" min_duration="3s"/> </spectral_advanced> </spectral>
The basic configuration options for the spectral analysis are the following:
Also, some advanced settings are tunable in the <spectral_advanced> section:
<gremlins start="0" increment="2" roundtrip="no" loop="no"/>
We present here three different examples of generating a Paraver tracefile. First example requires the package to be compiled with DynInst libraries. Second example uses the LD_PRELOAD or LDR_PRELOAD[64] mechanism to interpose code in the application. Such mechanism is available in Linux and FreeBSD operating systems and only works when the application uses dynamic libraries. Finally, there is an example using the static library of the instrumentation package.
DynInst is a third-party instrumentation library developed at UW Madison which can instrument in-memory binaries. It adds flexibility to add instrumentation to the application without modifying the source code. DynInst is ported to different systems (Linux, FreeBSD) and to different architectures8.1 (x86, x86/64, PPC32, PPC64) but the functionality is common to all of them.
[frame=single,numbers=left,labelposition=topline,label=run\_dyninst.sh] #!/bin/sh export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib source ${EXTRAE_HOME}/etc/extrae.sh ## Run the desired program ${EXTRAE_HOME}/bin/extrae -config extrae.xml $*
A similar script can be found in the share/example/SEQ directory in your tracing package directory. Just tune the EXTRAE_HOME environment variable and make the script executable (using chmod u+x). You can either pass the XML configuration file through the EXTRAE_CONFIG_FILE if you prefer instead. Line no. 5 is responsible for loading all the environment variables needed for the DynInst launcher (called extrae) that is invoked in line 8.
In fact, there are two examples provided in share/example/SEQ, one for static (or manual) instrumentation and another for the DynInst-based instrumentation. When using the DynInst instrumentation, the user may add new routines to instrument using the existing function-list file that is already pointed by the extrae.xml configuration file. The way to specify the routines to instrument is add as many lines with the name of every routine to be instrumented.
Running OpenMP applications using DynInst is rather similar to serial codes. Just compile the application with the appropiate OpenMP flags and run as before. You can find an example in the share/example/OMP directory.
MPI applications can also be instrumented using the DynInst instrumentator. The instrumentation is done independently to each spawned MPI process, so in order to execute the DynInst-based instrumentation package on a MPI application, you must be sure that your MPI launcher supports running shell-scripts. The following scripts show how to run the DynInst instrumentator from the MOAB/Slurm queue system. The first script just sets the environment for the job whereas the second is responsible for instrumenting every spawned task.
[frame=single,numbers=left,labelposition=topline,label=slurm\_trace.sh] #!/bin/bash # @ initialdir = . # @ output = trace.out # @ error = trace.err # @ total_tasks = 4 # @ cpus_per_task = 1 # @ tasks_per_node = 4 # @ wall_clock_limit = 00:10:00 # @ tracing = 1 srun ./run.sh ./mpi_ping
The most important thing in the previous script is the line number 11, which is responsible for spawning the MPI tasks (using the srun command). The spawn method is told to execute ./run.sh ./mpi_ping which in fact refers to instrument the mpi_ping binary using the run.sh script. You must adapt this file to your queue-system (if any) and to your MPI submission mechanism (i.e., change srun to mpirun, mpiexec, poe, etc...). Note that changing the line 11 to read like ./run.sh srun ./mpi_ping would result in instrumenting the srun application not mpi_ping.
[frame=single,numbers=left,labelposition=topline,label=run.sh] #!/bin/bash export EXTRAE_HOME=@sub_PREFIXDIR@ source ${EXTRAE_HOME}/etc/extrae.sh # Only show output for task 0, others task send output to /dev/null if test "${SLURM_PROCID}" == "0" ; then ${EXTRAE_HOME}/bin/extrae -config ../extrae.xml $@ > job.out 2> job.err else ${EXTRAE_HOME}/bin/extrae -config ../extrae.xml $@ > /dev/null 2> /dev/null fi
This is the script responsible for instrumenting a single MPI task. In line number 4 we set-up the instrumentation environment by executing the commands from extrae.sh. Then we execute the binary passed to the run.sh script in lines 8 and 10. Both lines are executing the same command except that line 8 sends all the output to two different files (one for standard output and another for standard error) and line 10 sends all the output to /dev/null.
Please note, this script is particularly adapted to the MOAB/Slurm queue systems. You may need to adapt the script to other systems by using the appropiate environment variables. Particularly, SLURM_PROCID identifies the MPI task id (i.e., the task rank) and may be changed to the proper environemnt variable (PMI_RANK in ParaStation/Torque/MOAB system or MXMPI_ID in systems having Myrinet MX devices, for example).
LD_PRELOAD (or LDR_PRELOAD[64] in AIX) interposition mechanism only works for binaries that are linked against shared libraries. This interposition is done by the runtime loader by substituting the original symbols by those provided by the instrumentation package. This mechanism is known to work on Linux, FreeBSD and AIX operating systems, although it may be available on other operating systems (even using different names8.2) they are not tested.
We show how this mechanism works on Linux (or similar environments) in subsection 8.2.1 and on AIX in subsection 8.2.3.
The following script preloads the libmpitrace library to instrument MPI calls of the application passed as an argument (tune EXTRAE_HOME according to your installation).
[frame=single,numbers=left,labelposition=topline,label=trace.sh] #!/bin/sh export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION export EXTRAE_CONFIG_FILE=extrae.xml export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so ## Run the desired program $*
The previous script can be found in the share/example/MPI/ld-preload directory in your tracing package directory. Copy the script to one of your directories, tune the EXTRAE_HOME environment variable and make the script executable (using chmod u+x). Also copy the XML configuration extrae.xml file from the share/example/MPI directory instrumentation package to the current directory. This file is used to configure the whole behavior of the instrumentation package (there is more information about the XML file on chapter 4). The last line in the script, , executes the arguments given to the script, so as you can run the instrumentation by simply adding the script in between your execution command.
Regarding the execution, if you run MPI applications from the command-line, you can issue the typical mpirun command as:
${MPI_HOME}/bin/mpirun -np N ./trace.sh mpi-app
where, ${MPI_HOME} is the directory for your MPI installation, N is the number of MPI tasks you want to run and mpi-app is the binary of the MPI application you want to run.
However, if you execute your MPI applications through a queue system you may need to write a submission script. The following script is an example of a submission script for MOAB/Slurm queuing system using the aforementioned trace.sh script for an execution of the mpi-app on two processors.
[frame=single,numbers=left,labelposition=topline,label=slurm-trace.sh] #! /bin/bash #@ job_name = trace_run #@ output = trace_run%j.out #@ error = trace_run%j.out #@ initialdir = . #@ class = bsc_cs #@ total_tasks = 2 #@ wall_clock_limit = 00:30:00 srun ./trace.sh mpi_app
If your system uses LoadLeveler your job script may look like:
[frame=single,numbers=left,labelposition=topline,label=ll.sh] #! /bin/bash #@ job_type = parallel #@ output = trace_run.ouput #@ error = trace_run.error #@ blocking = unlimited #@ total_tasks = 2 #@ class = debug #@ wall_clock_limit = 00:10:00 #@ restart = no #@ group = bsc41 #@ queue export MLIST=/tmp/machine_list ${$} /opt/ibmll/LoadL/full/bin/ll_get_machine_list > ${MLIST} set NP = `cat ${MLIST} | wc -l` ${MPI_HOME}/mpirun -np ${NP} -machinefile ${MLIST} ./trace.sh ./mpi-app rm ${MLIST}
Besides the job specification given in lines 1-11, there are commands of particular interest. Lines 13-15 are used to know which and how many nodes are involved in the computation. Such information information is given to the mpirun command to proceed with the execution. Once the execution finished, the temporal file created on line 14 is removed on line 19.
There are two ways to instrument CUDA applications, depending on how the package was configured. If the package was configure with -enable-cuda only interposition on binaries using shared libraries are available. If the package was configured with -with-cupti any kind of binary can be instrumented because the instrumentation relies on the CUPTI library to instrument CUDA calls. The example shown below is intended for the former case.
[frame=single,numbers=left,labelposition=topline,label=run.sh] #!/bin/bash export EXTRAE_HOME=/home/harald/extrae export PAPI_HOME=/home/harald/aplic/papi/4.1.4 EXTRAE_CONFIG_FILE=extrae.xml LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:${PAPI_HOME}/lib:${LD_LIBRARY_PATH} ./hello ${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -e ./hello
In this example, the hello application is compiled using the nvcc compiler and linked against the -lcudatrace library. The binary contains calls to Extrae_init and Extrae_fini and then executes a CUDA kernel. Line number 6 refers to the execution of the application itself. The Extrae configuration file and the location of the shared libraries are set in this line. Line number 7 invokes the merge process to generate the final tracefile.
AIX typically ships with POE and LoadLeveler as MPI implementation and queue system respectively. An example for a system with these software packages is given below. Please, note that the example is intended for 64 bit applications, if using 32 bit applications then LDR_PRELOAD64 needs to be changed in favour of LDR_PRELOAD.
[frame=single,numbers=left,labelposition=topline,label=ll-aix64.sh] #@ job_name = basic_test #@ output = basic_stdout #@ error = basic_stderr #@ shell = /bin/bash #@ job_type = parallel #@ total_tasks = 8 #@ wall_clock_limit = 00:15:00 #@ queue export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION export EXTRAE_CONFIG_FILE=extrae.xml export LDR_PRELOAD64=${EXTRAE_HOME}/lib/libmpitrace.so ./mpi-app
Lines 1-8 contain a basic LoadLeveler job definition. Line 10 sets the Extrae package directory in EXTRAE_HOME environment variable. Follows setting the XML configuration file that will be used to set up the tracing. Then follows setting LDR_PRELOAD64 which is responsible for instrumentation using the shared library libmpitrace.so. Finally, line 14 executes the application binary.
This is the basic instrumentation method suited for those installations that neither support DynInst nor LD_PRELOAD, or require adding some manual calls to the Extrae API.
To get the instrumentation working on your code, first you have to link your application with the Extrae libraries. There are installed examples in your package distribution under share/examples directory. There you can find MPI, OpenMP, pthread and sequential examples depending on the support at configure time.
Consider the example Makefile found in share/examples/MPI/static:
[frame=single,numbers=left,labelposition=topline,label=Makefile] MPI_HOME = /gpfs/apps/MPICH2/mx/1.0.7..2/64 EXTRAE_HOME = /home/bsc41/bsc41273/foreign-pkgs/extrae-11oct-mpich2/64 PAPI_HOME = /gpfs/apps/PAPI/3.6.2-970mp-patched/64 XML2_LDFLAGS = -L/usr/lib64 XML2_LIBS = -lxml2 F77 = $(MPI_HOME)/bin/mpif77 FFLAGS = -O2 FLIBS = $(EXTRAE_HOME)/lib/libmpitracef.a \ -L$(PAPI_HOME)/lib -lpapi -lperfctr \ $(XML2_LDFLAGS) $(XML2_LIBS) all: mpi_ping mpi_ping: mpi_ping.f $(F77) $(FFLAGS) mpi_ping.f $(FLIBS) -o mpi_ping clean: rm -f mpi_ping *.o pingtmp? TRACE.*
Lines 2-5 are definitions of some Makefile variables to set up the location of different packages needed by the instrumentation. In particular, EXTRAE_HOME sets where the Extrae package directory is located. In order to link your application with Extrae you have to add its libraries in the link stage (see lines 9-11 and 16). Besides libmpitracef.a we also add some PAPI libraries (-lpapi, and its dependency (which you may or not need -lperfctr), the libxml2 parsing library (-lxml2), and finally, the bfd and liberty libraries (-lbfd and -liberty), if the instrumentation package was compiled to support merge after trace (see chapter 3 for further information).
Executing an application with the statically linked version of the instrumentation package is very similar as the method shown in Section 8.2. There is, however, a difference: do not set LD_PRELOAD in trace.sh.
[frame=single,numbers=left,labelposition=topline,label=trace.sh] #!/bin/sh export EXTRAE_HOME=WRITE-HERE-THE-PACKAGE-LOCATION export EXTRAE_CONFIG_FILE=extrae.xml export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:\ /gpfs/apps/MPICH2/mx/1.0.7..2/64/lib:\ /gpfs/apps/PAPI/3.6.2-970mp-patched/64/lib ## Run the desired program $*
See section 8.2 to know how to run this script either through command line or queue systems.
Independently from the tracing method chosen, it is necessary to translate the intermediate tracefiles into a Paraver tracefile. The Paraver tracefile can be generated automatically (if the tracing package and the XML configuration file were set up accordingly, see chapters 3 and 4) or manually. In case of using the automatic merging process, it will use all the resources allocated for the application to perform the merge once the application ends.
To manually generate the final Paraver tracefile issue the following command:
${EXTRAE_HOME}/bin/mpi2prv -f TRACE.mpits -e mpi-app -o trace.prv
This command will convert the intermediate files generated in the previous step into a single Paraver tracefile. The TRACE.mpits is a file generated automatically by the instrumentation and contains a reference to all the intermediate files generated during the execution run. The -e parameter receives the application binary mpi-app in order to perform translations from addresses to source code. To use this feature, the binary must have been compiled with debugging information. Finally, the -o flag tells the merger how the Paraver tracefile will be named (trace.prv in this case).
<?xml version='1.0'?> <trace enabled="yes" home="@sed_MYPREFIXDIR@" initial-mode="detail" type="paraver" xml-parser-id="@sed_XMLID@" > <mpi enabled="yes"> <counters enabled="yes" /> </mpi> <pthread enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </pthread> <openmp enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </openmp> <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> <user-functions enabled="no" list="/home/bsc41/bsc41273/user-functions.dat" exclude-automatic-functions="no"> <counters enabled="yes" /> </user-functions> <counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="1"> <set enabled="yes" domain="all" changeat-globalops="5"> PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L1_DCM <sampling enabled="no" period="100000000">PAPI_TOT_CYC</sampling> </set> <set enabled="yes" domain="user" changeat-globalops="5"> PAPI_TOT_INS,PAPI_FP_INS,PAPI_TOT_CYC </set> </cpu> <network enabled="yes" /> <resource-usage enabled="yes" /> </counters> <storage enabled="no"> <trace-prefix enabled="yes">TRACE</trace-prefix> <size enabled="no">5</size> <temporal-directory enabled="yes">/scratch</temporal-directory> <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory> </storage> <buffer enabled="yes"> <size enabled="yes">150000</size> <circular enabled="no" /> </buffer> <trace-control enabled="yes"> <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file> <global-ops enabled="no">10</global-ops> <remote-control enabled="yes"> <mrnet enabled="yes" target="150" analysis="spectral" start-after="30"> <clustering max_tasks="26" max_points="8000"/> <spectral min_seen="1" max_periods="0" num_iters="3" signals="DurBurst,InMPI"/> </mrnet> <signal enabled="no" which="USR1"/> </remote-control> </trace-control> <others enabled="yes"> <minimum-time enabled="no">10m</minimum-time> </others> <bursts enabled="no"> <threshold enabled="yes">500u</threshold> <mpi-statistics enabled="yes" /> </bursts> <sampling enabled="no" type="default" period="50m" variability="10m"/> <opencl enabled="no" /> <cuda enabled="no" /> <merge enabled="yes" synchronization="default" binary="mpi_ping" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes" > mpi_ping.prv </merge> </trace>
Although Extrae is configured through an XML file (which is pointed by the EXTRAE_CONFIG_FILE), it also supports minimal configuration to be done via environment variables for those systems that do not have the library responsible for parsing the XML files (i.e., libxml2).
This appendix presents the environment variables the Extrae package uses if EXTRAE_CONFIG_FILE is not set and a description. For those environment variable that refer to XML 'enabled' attributes (i.e., that can be set to "yes" or "no") are considered to be enabled if their value are defined to 1.
Extrae includes a battery of regression tests to evaluate whether recent versions of the instrumentation package keep their compatibility and that new changes on it have not introduced new faults. These tests are meant to be executed in the same machine that compiled Extrae and they are not intended to support its execution through batch-queuing systems nor cross-compilation processes. To invoke the tests, simply run from the terminal the following command:
make check
after the configuration and building process. It will automatically invoke all the tests one after another and will produce several summaries.
These tests are divided into different categories that stress different parts of Extrae . The current categories tested include, but are not limited to:
These tests will change during the development of Extrae . If the reader has a particular suggestion on a particular test, please consider to send it to tools@bsc.es for its consideration.
Extrae includes a set of tests to evaluate the overhead imposed to the application by different components. These tests are installed in ${EXTRAE_HOME}/share/tests/overhead and can be run by executing the run_overhead_tests.sh script within this directory. Note that this script compiles and executes the generated binaries on the same system, so this script will require some tuning to run in a system that uses a batch-queuing system and/or needs cross-compiling.
Currently there are the following tests the evaluate the necessary time to perform certain operations:
Figure D depicts the overhead of the Extrae 3.3.0 in the following systems:
Whenever encountering that Extrae fails while instrumenting an application or generating a trace-file, you may consider to submit a bug report to tools@bsc.es. Before submitting a bug report, consider looking at the Frequently Asked Questions in Appendix E because it may contain valuable information to address the failure you observe.
In any case, if you find that Extrae fails and you want to submit a bug report, please collect as much as information possible to ease the bug-hunting process. The information required depen whether the bug refers to a compilation or an execution issue.
The following list of items are valuable when reporting a compilation problem:
The following list of items are valuable when reporting an execution problem:
These are the instrumented MPI routines in the Extrae package:
The instrumentation of the Intel OpenMP runtime for versions 8.1 to 10.1 is only available using the Extrae package based on DynInst library.
These are the instrument routines of the Intel OpenMP runtime functions using DynInst:
The instrumentation of the Intel OpenMP runtime for version 11.0 to 12.0 is available using the Extrae package based on the LD_PRELOAD and also the DynInst mechanisms. The instrumented routines include:
Extrae supports IBM OpenMP runtime 1.6.
These are the instrumented routines of the IBM OpenMP runtime:
Extrae supports GNU OpenMP runtime 4.2.
These are the instrumented routines of the GNU OpenMP runtime:
These are the instrumented routines of the pthread runtime:
These are the instrumented CUDA routines in the Extrae package:
The CUDA accelerators do not have memory for the tracing buffers, so the tracing buffer resides in the host side. Typically, the CUDA tracing buffer is flushed at cudaThreadSynchronize, cudaStreamSynchronize and cudaMemcpy calls, so it is possible that the tracing buffer for the device gets filled if no calls to this routines are executed.
These are the instrumented OpenCL routines in the Extrae package:
The OpenCL accelerators have small amounts of memory, so the tracing buffer resides in the host side. Typically, the accelerator tracing buffer is flushed at each cl_Finish call, so it is possible that the tracing buffer for the accelerator gets filled if no calls to this routine are executed. However if the operated OpenCL command queue is tagged as not Out-of-Order, then flushes will also happen at clEnqueueReadBuffer, clEnqueueReadBufferRect and clEnqueueMapBuffer if their corresponding blocking parameter is set to true.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -nonumbered_footnotes user-guide
The translation was initiated by Harald Servat on 2015-11-24