MVAPICH2-X 1.9a User Guide
Last revised: October 8, 2012
Message Passing Interface (MPI) has been the most popular programming model for developing parallel scientific applications. Partitioned Global Address Space (PGAS) programming models are an attractive alternative for designing applications with irregular communication patterns. They improve programmability by providing a shared memory abstraction while exposing locality control required for performance. It is widely believed that hybrid programming model (MPI+X, where X is a PGAS model) is optimal for many scientific computing problems, especially for exascale computing.
MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for PGAS programming model. This minimizes the development overheads that have been a huge deterrent in porting MPI applications to use PGAS models. The unified runtime also delivers superior performance compared to using different MPI and OpenSHMEM libraries by optimizing use of network and memory resources. MVAPICH2-X supports pure MPI programs, MPI+OpenMP programs, pure PGAS programs as well as hybrid MPI(+OpenMP) + PGAS programs.
MVAPICH2-X derives from the popular MVAPICH2 library and inherits many of its features for performance and scalability of MPI communication. It takes advantage of the RDMA features offered by the InfiniBand interconnect to support OpenSHMEM data transfer and atomic operations. It also provides a high-performance shared memory channel for multi-core InfiniBand clusters.
Current version of MVAPICH2-X 1.9a supports only OpenSHMEM as the PGAS programming model. The MPI implementation of MVAPICH2-X is based on MVAPICH2. It supports all MPI-1 functionalities and is compliant with the latest MPI 2.2 standard. OpenSHMEM implementation is OpenSHMEM v1.0 standard compliant and is based on OpenSHMEM Reference Implementation v1.0c. The current release supports InfiniBand transport interface (inter-node), and Shared Memory Interface (intra-node). The overall architecture of MVAPICH2-X is shown in the Figure 1.
This document contains necessary information for users to download, install, test, use, tune and troubleshoot MVAPICH2-X 1.9a. We continuously fix bugs and update this document as per user feedback. Therefore, we strongly encourage you to refer to our web page for updates.
MVAPICH2-X supports pure MPI programs, MPI+OpenMP programs, pure PGAS programs as well as hybrid MPI(+OpenMP) + PGAS programs. Current version of MVAPICH2 1.9a supports OpenSHMEM as the PGAS model. High-level features of MVAPICH2-X are listed below.
Hybrid Program Features
Unified Runtime Features
The MVAPICH2-X package can be downloaded from here. Select the link for your distro. As as initial technology preview, we are providing RHEL6 and RHEL5 RPMs. These RPMs contain the MVAPICH2-X software built against OFED-1.5.4 on the corresponding distro.
Below are the steps to download and install MVAPICH2-X RPMs for RHEL6:
$ wget http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2
$ tar xzf mvapich2-x-1.9a.rhel6.tar.gz
$ cd mvapich2-x-1.9a.rhel6
Running the install.sh script will install the software in /opt/mvapich2-x. The /opt/mvapich2-x/gnu directory contains the software built using gcc distributed with rhel6. The /opt/mvapich2-x/intel directory contains the software built using Intel 12 compilers. The install.sh script runs the following command:
rpm -Uvh *.rpm --force --nodeps
This will upgrade any prior versions of MVAPICH2-X that may be present as well as ignore any dependency issues that may be present with the Intel rpms (some dependencies are available after sourcing the env scripts provided by the intel compiler).
Advanced users may skip the install.sh script and directly use the appropriate commands to install the desired rpms.
Please email us at email@example.com if your distro does not appear on the list or if you experience any trouble installing the package on your system.
MVAPICH2-X supports MPI applications, PGAS (OpenSHMEM) applications and hybrid (MPI+OpenSHMEM) applications. Please use mpicc for compiling MPI applications. Alternatively, oshcc can also be used. OpenSHMEM applications can be compiled using oshcc. Hybrid applications shall be compiled using oshcc. oshcc and mpicc can be found under <MVAPICH2-X_INSTALL>/bin folder.
Below are examples to build MPI applications using mpicc:
$ mpicc -o test test.c
This command compiles test.c program into binary execution file test by mpicc.
$ mpicc -fopenmp -o hybrd mpi_openmp_hybrid.c
This command compiles a MPI+OpenMP program mpi_openmp_hybrid.c into binary execution file hybrid by mpicc, when MVAPICH2-X is built with GCC compiler. For Intel compilers, use -openmp instead of -fopenmp; For PGI compilers, use -mp instead of -fopenmp.
Below is an example to build an MPI, a OpenSHMEM or a hybrid aplication using oshcc
$ oshcc -o test test.c
This command compiles test.c program into binary execution file test by oshcc.
For MPI+OpenMP hybrid programs, add compile flags -fopenmp, -openmp or -mp according to different compilers, as mentioned in mpicc usage examples.
This section provides instructions on how to run applications with MVAPICH2. Please note that on new multi-core architectures, process-to-core placement has an impact on performance. MVAPICH2-X inherits its process-to-core binding capabilities from MVAPICH2. Please refer to (MVAPICH2 User Guide) for process mapping options on multi-core nodes.
The MVAPICH team suggests users using this mode of job start-up. Mpirun_rsh provides fast and scalable job start-up. It scales to multi-thousand node clusters. It can be use to launch MPI, OpenSHMEM and hybrid applications.
Jobs can be launched using mpirun_rsh by specifying the target nodes as part of the command as shown below:
$ mpirun_rsh -np 4 n0 n0 n1 n1 ./test
This command launches test on nodes n0 and n1, two processes per node. By default ssh is used.
$ mpirun_rsh -rsh -np 4 n0 n0 n1 n1 ./test
This command launches test on nodes n0 and n1, two processes per each node using rsh instead of ssh. The target nodes can also be specified using a hostfile.
$ mpirun_rsh -np 4 -hostfile hosts ./test
The list of target nodes must be provided in the file hosts one per line. MPI or OpenSHMEM ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. i.e., if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.
The mpirun_rsh hostfile format allows users to specify a multiplier to reduce redundancy. It also allows users to specify the HCA to be used for communication. The multiplier allows you to save typing by allowing you to specify blocked distribution of MPI ranks using one line per hostname. The HCA specification allows you to force an MPI rank to use a particular HCA. The optional components are delimited by a ‘:’. Comments and empty lines are also allowed. Comments start with ‘#’ and continue to the next newline. Below are few examples of hostfile formats:
$ cat hosts
# sample hostfile for mpirun_rsh
host1 # rank 0 will be placed on host1
host2:2 # rank 1 and 2 will be placed on host 2
host3:hca1 # rank 3 will be on host3 and will used hca1
host4:4:hca2 # ranks 4 through 7 will be on host4 and use hca2
# if the number of processes specified for this job is greater than 8
# then the additional ranks will be assigned to the hosts in a cyclic
# fashion. For example, rank 8 will be on host1 and ranks 9 and 10
# will be on host2.
Many parameters of the MPI library can be configured at run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:
$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./test
Note that the environmental variables should be put immediately before the executable. Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.
MVAPICH2-X provides oshrun and can be used to launch applications as shown below.
$ oshrun -np 2 ./test
This command launches two processes of test on the localhost. A list of target nodes where the processes should be launched can be provided in a hostfile and can be used as shown below. The oshrun hostfile can be in one of the two formats outlined for mpirun_rsh earlier in this document.
$ oshrun -f hosts -np 2 ./test
MVAPICH2-X also distributes the Hydra process manager along with with mpirun_rsh. Hydra can be used either by using mpiexec or mpiexec.hydra. The following is an example of running a program using it:
$ mpiexec -f hosts -n 2 ./test
This process manager has many features. Please refer to the following web page for more details.
MVAPICH2-X supports hybrid programming models. Applications can be written using both MPI and PGAS constructs. Rather than using a separate runtime for each of the programming models, MVAPICH2-X supports hybrid programming using a unified runtime and thus provides better resource utilization and superior performance.
A simple example of Hybrid (MPI+OpenSHMEM) program is shown below. It uses both MPI and OpenSHMEM constructs to print the sum of ranks of each processes.
1 #include <stdio.h>
2 #include <shmem.h>
3 #include <mpi.h>
5 static int sum = 0;
6 int main(int c, char *argv)
8 int rank, size;
10 /* SHMEM init */
13 /* get rank and size */
14 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
15 MPI_Comm_size(MPI_COMM_WORLD, &size);
17 /* SHMEM barrier */
20 /* fetch-and-add at root */
21 shmem_int_fadd(&sum, rank, 0);
23 /* MPI barrier */
26 /* root broadcasts sum */
27 MPI_Bcast(&sum, 1, MPI_INT, 0, MPI_COMM_WORLD);
29 /* print sum */
30 fprintf(stderr, "(%d): Sum: %d\n", rank, sum);
33 return 0;
start_pes in line 10 initializes the runtime for MPI and OpenSHMEM communication. An explicit call to MPI_Init is not required. The program uses MPI calls MPI_Comm_rank and MPI_Comm_size to get process rank and size, respectively (lines 14-15). MVPIACH2-X assigns same rank for MPI and PGAS model. Thus, alternatively the OpenSHMEM constructs _my_pe and _num_pes can also be used to get rank and size, respectively. In line 17, every process does a barrier using OpenSHMEM construct shmem_barrier_all.
After this, every process does a fetch-and-add of the rank to the variable sum in process 0. The sample program uses OpenSHMEM construct shmem_int_fadd (line 21) for this. Following the fetch-and-add, every process does a barrier using MPI_Barrier (line 24). Process 0 then broadcasts sum to all processes using MPI_Bcast (line 27). Finally, all processes print the variable sum. Explicit MPI_Finalize is not required.
The program outputs the following for a four-process run:
$$> mpirun_rsh -np 4 -hostfile ./hostfile ./hybrid_mpi_shmem
(0): Sum: 6
(1): Sum: 6
(2): Sum: 6
(3): Sum: 6
The above sample hybrid program is available at <MVAPICH2-X_INSTALL>/<gnu|intel>/share
We have extended the OSU Micro Benchmark (OMB) suite with tests to measure performance of OpenSHMEM operations. OSU Microbenchmarks (OMB-3.7) have OpenSHMEM data movement and atomic operation benchmarks. The complete benchmark suite is available along with MVAPICH2-X binary package, in the folder: <MVAPICH2-X_INSTALL>/libexec/osu-micro-benchmarks. A brief description for each of the newly added benchmarks is provided below.
Put Latency (osu_oshm_put):
This benchmark measures latency of a shmem_putmem operation for different data sizes. The user is required to select whether the communication buffers should be allocated in global memory or heap memory, through a parameter. The test requires exactly two PEs. PE 0 issues shmem_putmem to write data at PE 1 and then calls shmem_quiet. This is repeated for a fixed number of iterations, depending on the data size. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. Both PEs call shmem_barrier_all after the test for each message size.
Get Latency (osu_oshm_get):
This benchmark is similar to the one above except that PE 0 does a shmem_getmem operation to read data from PE 1 in each iteration. The average latency per iteration is reported.
Put Operation Rate (osu_oshm_put_mr):
This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Put between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem_putmem operations to its peer PE. The total time for the put operations is measured and operation rate per second is reported. All PEs call shmem_barrier_all after the test for each message size.
Atomics Latency (osu_oshm_atomics):
This benchmark measures the performance of atomic fetch-and-operate and atomic operate routines supported in OpenSHMEM for the integer datatype. The buffers can be selected to be in heap memory or global memory. The PEs are paired like in the case of Put Operation Rate benchmark and the first PE in each pair issues back-to-back atomic operations of a type to its peer PE. The average latency per atomic operation and the aggregate operation rate are reported. This is repeated for each of fadd, finc, add, inc, cswap and swap routines.
MVAPICH2-X supports all the runtime parameters of MVAPICH2 (OFA-IB-CH3). A comprehensive list of all runtime parameters of MVAPICH2 1.9a can be found in User Guide. Runtime parameters specific to MVAPICH2-X are listed below.
Enable/Disable shared memory scheme for intra-node communication.
Set OpenSHMEM Symmetric Heap Size
Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact us by sending an email to firstname.lastname@example.org.
The GNU oshfort gives compilation error for Fortran OpenSHMEM applications. This is because of a typo error in oshfort script. Kindly replace mpicc inside this script with mpif90 to fix this issue. This will be fixed in next release.
Currently, compiling Fortran applications with Intel oshfort results in linker errors. This is because of a limitation inherited from OpenSHMEM reference implementation stack. This will be corrected in future releases.