|
MVAPICH 1.0 User and Tuning Guide
MVAPICH Team Last Revised: May 29, 2008 |
InfiniBand is emerging as a high-performance interconnect delivering low latency and high bandwidth. It is also getting widespread acceptance due to its open standard.
MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand and other RDMA enabled interconnects to deliver performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.
Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC and InfiniBand community. As of the 05/29/2008, more than 690 organizations (National Labs, Universities, and Industry) in 41 countries have downloaded this software from OSU’s web site directly. In addition, many IBA vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH have obtained positions in the TOP 500 ranking. The current version of MVAPICH is also being made available with the OpenFabrics/Gen2 stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.
More details on MVAPICH/MVAPICH2 software, users list, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:
http://mvapich.cse.ohio-state.edu/
This document contains necessary information for MVAPICH users to download, install, test, use, and tune MVAPICH 1.0. As we get feedback from users and take care of bug-fixes, we introduce new patches against our released distribution and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.
This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH-1.0.
In Section 3 we describe all the features in MVAPICH 1.0. As you read through this section, please note our new features (highlighted as NEW). Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Usage instructions for MVAPICH are explained in Section 5. Apart from describing how to run simple MPI applications, this section also talks about running MVAPICH with some of the advanced features. Section 6 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH, please check Section 7 where we list some of the common problems users face. In Section 8 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. In Section 9, we list important run-time and compile time parameters for the Gen2, VAPI, and QLogic devices, their default values and a small description of each parameter. Finally, Section 10 lists the parameters and tuning options for the OpenFabrics/Gen2-UD device.
MVAPICH (MPI-1 over InfiniBand) is an MPI-1 implementation based on MPICH and MVICH. MVAPICH 1.0 is available as a single integrated package (with the latest MPICH 1.2.7 and MVICH).
A complete set of features of MVAPICH 1.0 are:
This uDAPL support is generic and can work with other networks that provide uDAPL interface. Please note that the stability and performance of MVAPICH with uDAPL depends on the stability and performance of the underlying uDAPL library being used.
The MVAPICH 1.0 package and the project also includes the following provisions:
The MVAPICH 1.0 source code package includes the latest MPICH 1.2.7 version and also the required MVICH files from LBNL. Thus, there is no need to download any other files except MVAPICH 1.0 source code.
You can go to the MVAPICH website to obtain the source code.
Untar the archive you have downloaded from the web page using the following command. You will have a directory named mvapich-1.0 after executing this command.
$ tar xzf mvapich-1.0.tar.gz
As we enhance and improve MVAPICH, we update the available source code on our public SVN repository. In order to obtain these updates, please install a SVN client on your machine. The latest MVAPICH sources may be obtained from the “trunk” of the SVN using the following command:
$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk
The “trunk” may contain newer features and bug fixes. However, it is likely to be lightly tested. If you are interested in obtaining stable and major bug fixes to any release version, you should update your sources from the “branch” of the SVN using the following command:
$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/branches/1.0
MVAPICH 1.0 provides support for seven different ADI devices. Namely, Gen2 Single-Rail (ch_gen2), Gen2 Multi-Rail (ch_gen2_multirail), Gen2/UD (ch_gen2_ud), Shared memory device (ch_smp), VAPI Single-Rail (vapi), VAPI Multi-Rail (vapi_multirail), uDAPL (udapl) and QLogic InfiniPath (psm). Additionally, you can also configure MVAPICH over the standard TCP/IP interface and use it over IPoIB.
There are several options to build MVAPICH 1.0 based on the underlying InfiniBand libraries you want to utilize. In this section we describe in detail the steps you need to perform to correctly build MVAPICH on your choice of InfiniBand libraries, namely OpenFabrics/Gen2, OpenFabrics/Gen2-UD, Mellanox VAPI, uDAPL, Shared Memory or QLogic InfiniPath.
In the following subsection, we describe how to build and configure the Single-Rail device. In later subsections, we describe the building and configuration of the other devices: Multi-Rail with OpenFabrics/Gen2 (4.4.2), Gen2/UD (4.4.3), InfiniPath (4.4.4), VAPI-single-rail (4.4.5), VAPI-multi-rail (4.4.6), uDAPL (4.4.7), Shared memory (4.4.8) and TCP (4.4.9).
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.gen2 configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 with multi-rail device on OpenFabrics Gen2.
After setting all the parameters, the script make.mvapich.gen2_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.
MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Please refer to 5.10 for more details.
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.gen2_ud configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.psm configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 on VAPI.
After setting all the parameters, the script make.mvapich.vapi configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 with multi-rail device.
After setting all the parameters, the script make.mvapich.vapi_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.
Before installing MVAPICH with uDAPL, please make sure you have the uDAPL library installed properly.
There are several methods to configure MVAPICH 1.0 with uDAPL.
After setting all the parameters, the script make.mvapich.udapl configures, builds and installs the entire package in the directory specified by the variable PREFIX.
In the mvapich-1.0 directory, we have provided a script make.mvapich.smp for building MVAPICH over shared memory intended for single SMP systems. The script make.mvapich.smp takes care of different platforms, compilers and architectures. By default, the compilation script uses gcc. In order to select your compiler, please set the variable CC in the script to use either Intel, PathScale or PGI compilers. The platform/architecture is detected automatically. The usage of the shared memory device can be found in 5.2.
In the mvapich-1.0 directory, we have provided a script make.mvapich.tcp for building MVAPICH over TCP/IP intended for use over IPoIB (IP over InfiniBand). In order to select any other compiler than GCC, please set your CC variable in that script. Simply execute this script (e.g. ./make.mvapich.tcp) for completing your build.
This section discusses the usage methods for the various features provided by MVAPICH. If you face any problem while following these instructions, please refer to Section 7.
Use mpicc, mpif77, mpiCC, or mpif90 to compile applications. They can be found under mvapich-1.0/bin.
There are several options to run MPI applications. Please select one of the following options based on your need.
Prerequisites:
Examples of running programs using mpirun_rsh:
$ mpirun_rsh -np 4 n0 n1 n2 n3 ./cpi
The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. By default ssh is used.
$ mpirun_rsh -rsh -np 4 n0 n1 n2 n3 ./cpi
The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. rsh is used regardless of whether ssh or rsh is used when compiling MVAPICH.
$ mpirun_rsh -np 4 -hostfile hosts ./cpi
A list of nodes are in hosts, one per line. MPI ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. ie. if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.
If you are using the shared memory device, then host names can be omitted:
$ mpirun_rsh -np 4 ./cpi
Many parameters of the MPI library can be very easily configured during run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:
$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./cpi
Note that the environmental variables should be put immediately before the executable.
Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.
Please note that there are many different parameters which could be used to improve the performance of applications depending upon their requirements from the MPI library. For a discussion on how to identify which variables may be of interest to you, please take a look at Section 8.
Other options of mpirun_rsh can be obtained using
$ mpirun_rsh --help
SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from: http://www.llnl.gov/linux/slurm/
Once SLURM is installed and the daemons are started, applications compiled with MVAPICH can be launched by SLURM, e.g.
$ srun -n2 --mpi=mvapich ./a.out
The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:
$ srun --cpu_bind=v,map_cpu:0,4 --mem_bind=v,map_mem:0,1 -n2 --mpi=mvapich ./a.out
For more information about SLURM and its features please visit SLURM website.
MVAPICH provides shared memory implementations of important collectives:
MPI_Allreduce, MPI_Reduce, MPI_Barrier and MPI_Bcast. It also has support for
Enhanced MPI_Allgather. These collective operations are enabled by default. Shared
Memory Collectives are supported over Gen2, Gen2/UD, PSM and Shared Memory
devices. The PSM device currently only has MPI_Barrier and MPI_Bcast shared memory
implementation.
These operations can be disabled all at once by setting VIADEV_USE_SHMEM_COLL to 0 or one at a time by using the following environment variables:
Please refer to section 9.8 for tuning the various environment variables.
MVAPICH provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled, your applications will be built on top of shared libraries automatically. The following commands provide some examples of how to build and run your application with shared library support.
$ mpicc -o cpi cpi.c
For example,
$ mpirun_rsh -np 2 n0 n1 LD_LIBRARY_PATH=$MVAPICH_BUILD/lib/shared ./cpi
Again, note that “LD_LIBRARY_PATH=path-to-shared-libraries” should be put immediately before the executable file.
$ mpicc -noshlib -o cpi cpi.c
MVAPICH contains optimized Lustre ADIO support for the OpenFabrics/Gen2 device. The Lustre directory should be mounted on all nodes on which MVAPICH processes will be running. Compile MVAPICH with ADIO support for Lustre as described in Section 4.4.1. If your Lustre mount is /mnt/datafs on nodes n0 and n1, on node n0, you can compile and run your program as follows:
$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 <path to perf>/perf -fname
/mnt/datafs/testfile
If you have enabled support for multiple file systems, append the prefix ”lustre:” to the name of the file. For example:
$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 ./perf -fname
lustre:/mnt/datafs/testfile
MVAPICH provides TotalView support for the OpenFabrics/Gen2 (mpid/ch_gen2),
OpenFabrics/Gen2-UD (mpid/ch_gen2_ud), Single-rail VAPI (mpid/vapi), InfiniPath (mpid/psm)
and Shared-Memory devices (mpid/ch_smp). You need to use mpirun_rsh when running
TotalView. The following commands also provide an example of how to build and run your
application with TotalView support. Note: running TotalView demands correct setup in your
environment, if you encounter any problem with your setup, please check with your system
administrator for help.
$ mpirun_rsh -tv -np 2 n0 n1
LD_LIBRARY_PATH=$MVAPICH_BUILD/lib/shared:$MVAPICH_BUILD/lib prog
MVAPICH provides multi-rail device with advance scheduling policies for data transfer 5.10. However, even with the single-rail configuration, multi-pathing (multiple ports, adapters and multiple paths provided by the LMC mechanism) can be used for multi-core systems. With this support, processes executing on the same node can leverage the above configurations by binding to one of the available configuration. MVAPICH provides multiple choices to the user for leveraging this functionality, which are described in the upcoming examples. This functionality is currently available only in the single-rail gen2 device.
$ mpirun_rsh -np 4 n0 n0 n1 n1 VIADEV_USE_MULTIHCA=1 VIADEV_USE_MULTIPORT=1 ./cpi
The usage of multiple paths is disabled by default. It’s usage can be controlled by using the parameter VIADEV_USE_LMC ( 9.2.6).
$ cat hosts n0:mthca0:1 n0:mthca1:2 n1:mthca0:2 n1:mthca1:1
With this specification, process 0 would be bound to port1 of adapter “mthca0”, process 1 to port 2 of adapter “mthca1” and so on.
MVAPICH supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.
To enable this functionality, a run-time variable, VIADEV_USE_APM (section 9.2.8) can be enabled, as shown in the following example:
$ mpirun_rsh -np 2 VIADEV_USE_APM=1 ./cpi
MVAPICH also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable VIADEV_USE_APM_TEST (section 9.2.9). This should be combined with VIADEV_USE_APM as follows:
$ mpirun_rsh -np 2 VIADEV_USE_APM=1 VIADEV_USE_APM_TEST=1 ./cpi
MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Run-time parameters are being provided to control the policies. They are further divided into policies for small and large messages. These policies are available in the multirail devices for gen2 and VAPI.
Process to CPU mapping may affect application preformance on multi-core systems, especially for memory intensive applications. If the number of processes is smaller than the number of CPU’s/cores, it is preferable to distribute the processes on different chips to avoid memory contention because CPU’s/cores on the same chip usually share the memory controller. MVAPICH provides flexible user defined CPU mapping. To use it, first make sure CPU affinity is set (Section 9.6.5). Then use the run-time environment variable VIADEV_CPU_MAPPING to specify the CPU/core mapping. For example, if it is a quad-core system in which cores [0-3] are on the same chip and cores [4-7] are on another chip, and you need to run an application with 2 processes, then the following mapping will give the best performance:
$ mpirun_rsh -np 2 n0 n0 VIADEV_CPU_MAPPING=0,4 ./a.out
In this case process 0 will be mapped to core 0 and process 1 will be mapped to core 4.
More information about VIADEV_CPU_MAPPING can be found in Section 9.6.6.
If you have arrived at this point, you have successfully installed MVAPICH. Congratulations!! In the mvapich-1.0/osu_benchmarks directory, we provide four basic performance tests: one-way latency test, uni-directional bandwidth test, bi-directional bandwidth test multiple bandwidth/message rate, and MPI-level broadcast latency test. You can compile and run these tests on your machines to evaluate the basic performance of MVAPICH.
These benchmarks as well as other benchmarks (such as for one-sided operations in MPI-2) are available on our projects’ web page. Sample performance numbers for these benchmarks on representative platforms and IBA gears are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let the MVAPICH community know by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.
Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact the MVAPICH community by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.
MVAPICH can be used over multiple underlying InfiniBand libraries, namely OpenFabrics (Gen2), OpenFabrices (Gen2-UD), VAPI, uDAPL and QLogic InfiniPath. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the three InfiniBand libraries.
Running the following command will provide you with the version of MVAPICH that is being used.
$ mpirun_rsh -v
fork() and system() is supported for Gen2 and Gen2-UD devices as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.
This is a common symptom of several setup issues related to job startup. Please make sure of the following things:
MVAPICH implements highly optimized RDMA collective algorithms for frequently used collectives such as MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast MPI_Allgather, MPI_Barrier. The optimized implementations have been well tested and tuned. However, if you face any problems in these collectives for your application, please disable the optimized collectives. For example, if you want to disable MPI_Allreduce, you can do:
$ mpirun_rsh -np 8 -hostfile hf VIADEV_USE_SHMEM_ALLREDUCE=0 ./a.out
The complete list of all such paramaters is given in 9.8
The gfortran compiler can be used for F77 and F90. In order to make this work, the following environment variables should be set prior to running the build script:
$ export F77=gfortran
$ export F90=gfortran
$ export F77_GETARGDECL=" "
If g77 and gfortran are used together for F77 and F90 respectively, it might be necessary to set the following environment variable in order to get around possible compatibility issues:
$ export F90FLAGS="-ff2c"
MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” when using mpirun_rsh to fix the problem. Running the pi3f90 example program using this variable setting is shown below:
$ mpirun_rsh -np 2 n1 n2 GFORTRAN_UNBUFFERED_ALL=y ./pi3f90
If a debug session is terminated with an alarm message, mpirun_rsh may have timedout waiting for the job launch to complete. Use a larger MPIRUN_TIMEOUT (section 9.1.2) to work around this problem.
If an application task terminates unexpectedly during job launch, mpirun_rsh may print the message:
mpispawn.c:303 Unexpected exit status
This usually indicates a problem with the application. Other error messages around this (if any) might point to the actual issue.
If mpirun_rsh fails with this error message, it was unable to locate a necessary utility. This can be fixed by ensuring that all MVAPICH executables are in the PATH on all nodes.
If PATHs cannot be setup as mentioned, then invoke mpirun_rsh with a path. For example:
/path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc
or
../../path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc
If you are using ADIO support for Lustre, please make sure that:
– Lustre is setup correctly, and that you are able to create, read to and write from files in the
Lustre mounted directory.
– The Lustre directory is mounted on all nodes on which MVAPICH processes with ADIO
support for Lustre are running.
– The path to the file is correctly specified.
– The permissions for the file or directory are correctly specified.
If you are using ADIO support for Lustre, the recent Lustre releases require an additional mount
option to have correct file locks.
So please include the following option with your lustre mount command: ”-o localflock”.
For example: $ mount -o localflock -t lustre xxxx@o2ib:/datafs
/mnt/datafs
MVAPICH uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, such as MPI+OpenMP model, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, e.g.
$ mpirun_rsh -np 2 n1 n2 VIADEV_USE_AFFINITY=0 ./a.out
More information about CPU affinity and CPU binding can be found in Sections 9.6.5 and 9.6.6.
Please replace the -Wl,-rpath option in the build scripts (e.g. make.mvapich.gen2) with -R when Sun Studio 12 compilers are used.
Several well-known MPICH related problems on different platforms and environments have already been identified by Argonne. They are available on the MPICH patch webpage.
In this section, we discuss the general error conditions for MVAPICH based on OpenFabrics Gen2.
This error is generated by MVAPICH when it cannot find any Gen2 InfiniBand devices. If you are experiencing this error, then please make sure that your Gen2 installation is proper. You can do so by doing the following: