MVAPICH2 1.0.3 User Guide

MVAPICH Team
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
http://mvapich.cse.ohio-state.edu
Copyright ©2003-2008
Network-Based Computing Laboratory,
headed by Dr. D. K. Panda.
All rights reserved.

Last revised: July 2, 2008

Contents

1 Overview of the Open-Source MVAPICH Project
2 How to use this User Guide?
3 MVAPICH2 1.0 Features
4 Installation Instructions
 4.1 Download MVAPICH2 source code
 4.2 Prepare MVAPICH2 source code
 4.3 Downloading MVAPICH2 Source Code from Anonymous SVN
 4.4 Build MVAPICH2
  4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP
  4.4.2 Build MVAPICH2 with uDAPL
  4.4.3 Build MVAPICH2 with VAPI
  4.4.4 Build MVAPICH2 with TCP/IPoIB
5 Basic Usage Instructions
 5.1 Compile MPI Applications
 5.2 Setting MPD Environment
 5.3 Run MPI Applications Using mpiexec with OpenFabrics Gen2-IB or VAPI Device
  5.3.1 Run MPI Applications using Shared Library Support
  5.3.2 Run MPI Application using TotalView Debugger Support
 5.4 Run MPI Application using OpenFabrics Gen2-iWARP Device
 5.5 Run MPI Application using mpiexec with uDAPL Device
 5.6 Run MPI Application using mpiexec with TCP/IP
  5.6.1 The MPI Job Uses IPoIB
  5.6.2 Both MPD And the MPI Job Use IPoIB
 5.7 Run MPI Applications using SLURM
6 Advanced Usage Instructions
 6.1 Run MPI applications on Multi-Rail Configurations (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)
 6.2 Run MPI application with Customized Optimizations (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)
 6.3 Run MPI application with Checkpoint/Restart Support (for OpenFabrics Gen2-IB Device)
 6.4 Run MPI application with RDMA CM support (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)
 6.5 Run MPI application with Shared Memory Collectives
 6.6 Run MPI Application with Hot-Spot and Congestion Avoidance (for OpenFabrics Gen2-IB Device)
 6.7 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)
7 Using OSU Benchmarks
8 FAQ and Troubleshooting with MVAPICH2
 8.1 General Questions and Troubleshooting
  8.1.1 Invalid Communicators Error
  8.1.2 Are fork() and system() supported?
  8.1.3 Cannot Build with the PathScale Compiler
  8.1.4 Cannot find mpd.conf
  8.1.5 sched_setaffinity: Bad address
  8.1.6 Multi-threaded programs seem to run sequentially
  8.1.7 Running MPI programs built with gfortran
 8.2 With Gen2 Interface
  8.2.1 Cannot Open HCA
  8.2.2 Checking state of IB Link
  8.2.3 Undefined reference to ibv_get_device_list
  8.2.4 Creation of CQ or QP failure
  8.2.5 Hang with the HSAM Functionality
  8.2.6 Failure with Automatic Path Migration
  8.2.7 Error opening file
  8.2.8 RDMA CM Address error
  8.2.9 RDMA CM Route error
 8.3 With Gen2-iWARP Interface
  8.3.1 Error opening file
  8.3.2 RDMA CM Address error
  8.3.3 RDMA CM Route error
 8.4 With VAPI Interface
  8.4.1 Cannot pass MPI_Init
  8.4.2 Cannot Open HCA
  8.4.3 Cannot include vapi.h
  8.4.4 VAPI_RETRY_EXEC_ERROR
  8.4.5 ld:multiple definitions of symbol _calloc error on MacOS
  8.4.6 No Fortran interface on the MacOS platform
 8.5 With UDAPL Interface
  8.5.1 Cannot Open IA
  8.5.2 DAT Insufficient Resource
  8.5.3 Cannot find libdat.so
  8.5.4 Cannot find mpd.conf
 8.6 The MPD mpiexec fails with “no msg recvd from mpd when expecting ack of request.”
 8.7 Checkpoint/Restart
9 Scalable features for Large Scale Clusters and Performance Tuning
 9.1 RDMA Based Point-to-Point tuning
 9.2 Shared Receive Queue (SRQ) Tuning
 9.3 Shared Memory Tuning
 9.4 On-demand Connection Management Tuning
10 MVAPICH2 Parameters
 10.1 MV2_CKPT_FILE
 10.2 MV2_CKPT_INTERVAL
 10.3 MV2_CKPT_MAX_SAVE_CKPTS
 10.4 MV2_CKPT_MPD_BASE_PORT
 10.5 MV2_CKPT_MPIEXEC_PORT
 10.6 MV2_CKPT_NO_SYNC
 10.7 MV2_CM_RECV_BUFFERS
 10.8 MV2_CM_SPIN_COUNT
 10.9 MV2_CM_TIMEOUT
 10.10 MV2_DAPL_PROVIDER
 10.11 MV2_DEFAULT_MTU
 10.12 MV2_ENABLE_AFFINITY
 10.13 MV2_GET_FALLBACK_THRESHOLD
 10.14 MV2_IBA_EAGER_THRESHOLD
 10.15 MV2_INITIAL_PREPOST_DEPTH
 10.16 MV2_MPD_RECVTIMEOUT_MULTIPLIER
 10.17 MV2_NDREG_ENTRIES
 10.18 MV2_NUM_HCAS
 10.19 MV2_NUM_PORTS
 10.20 MV2_NUM_QP_PER_PORT
 10.21 MV2_NUM_RDMA_BUFFER
 10.22 MV2_ON_DEMAND_THRESHOLD
 10.23 MV2_PREPOST_DEPTH
 10.24 MV2_PUT_FALLBACK_THRESHOLD
 10.25 MV2_RDMA_CM_ARP_TIMEOUT
 10.26 MV2_RNDV_PROTOCOL
 10.27 MV2_R3_THRESHOLD
 10.28 MV2_R3_NOCACHE_THRESHOLD
 10.29 MV2_SHMEM_COLL_MAX_MSG_SIZE
 10.30 MV2_SHMEM_COLL_NUM_COMM
 10.31 MV2_SRQ_LIMIT
 10.32 MV2_SRQ_SIZE
 10.33 MV2_USE_APM
 10.34 MV2_USE_APM_TEST
 10.35 MV2_USE_BLOCKING
 10.36 MV2_USE_COALESCE
 10.37 MV2_USE_HSAM
 10.38 MV2_USE_IWARP_MODE
 10.39 MV2_USE_LAZY_MEM_UNREGISTER
 10.40 LAZY_MEM_UNREGISTER
 10.41 MV2_USE_RDMA_CM
 10.42 MV2_USE_RDMA_FAST_PATH
 10.43 MV2_USE_RDMA_ONE_SIDED
 10.44 MV2_USE_RING_STARTUP
 10.45 MV2_USE_SHARED_MEM
 10.46 MV2_USE_SHMEM_ALLREDUCE
 10.47 MV2_USE_SHMEM_BARRIER
 10.48 MV2_USE_SHMEM_COLL
 10.49 MV2_USE_SHMEM_REDUCE
 10.50 MV2_USE_SRQ
 10.51 MV2_VBUF_POOL_SIZE
 10.52 MV2_VBUF_SECONDARY_POOL_SIZE
 10.53 MV2_VBUF_TOTAL_SIZE
 10.54 SMP_EAGERSIZE
 10.55 SMPI_LENGTH_QUEUE
 10.56 SMP_NUM_SEND_BUFFER
 10.57 SMP_SEND_BUF_SIZE

1 Overview of the Open-Source MVAPICH Project

InfiniBand and 10GbE/iWARP are emerging as high-performance interconnects delivering low latency and high bandwidth. They are also getting widespread acceptance due to their open standards.

MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand, iWARP and other RDMA-enabled interconnects and deliver best performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.

Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC and InfiniBand community. As of June 10th, 2008, more than 700 organizations (National Labs, Universities and Industry) world-wide have downloaded this software from OSU’s web site directly. In addition, many InfiniBand and iWARP vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH/MVAPICH2 have obtained positions in the TOP 500 ranking. MVAPICH and MVAPICH2 are also available with the Open Fabrics Enterprise Distribution (OFED) stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.

More details on MVAPICH/MVAPICH2 software, users list, mailing lists, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand- and iWARP-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu

This document contains necessary information for MVAPICH2 users to download, install, test, use, tune and troubleshoot MVAPICH2 1.0.3. As we get feedbacks from users and take care of bug-fixes, we introduce new tarballs and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.

2 How to use this User Guide?

This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH2 1.0.3.

In Section 3 we describe all the features in MVAPICH2 1.0.3. As you read through this section, please note our new features (highlighted as NEW) in the 1.0 series. Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Basic usage of MVAPICH2 is explained in Section 5. Section 6 provides instructions for running MVAPICH2 with some of the advanced features. Section 7 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH2, please check Section 8 where we list some of the common problems people face. In Section 9 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. Finally in Section 10 we list all important run-time parameters, their default values and a small description of what that parameter stands for.

3 MVAPICH2 1.0 Features

MVAPICH2 (MPI-2 over InfiniBand) is an MPI-2 implementation based on MPICH2 ADI3 layer. It also supports all MPI-1 functionalities. MVAPICH2 1.0.3 is available as a single integrated package (with MPICH2 1.0.5p4).

The current release supports the following five underlying transport interfaces:

MVAPICH2-1.0.3 delivers the same level of performance as MVAPICH 1.0.1, the latest release package of MVAPICH supporting MPI-1 standard. In addition, MVAPICH2 1.0.3 provides support and optimizations for other MPI-2 features, multi-threading and fault-tolerance (Checkpoint-restart).

A complete set of features of MVAPICH2 1.0.3 are:

The MVAPICH2 1.0.3 package and the project also includes the following provisions:

4 Installation Instructions

4.1 Download MVAPICH2 source code

The MVAPICH2 1.0.3 source code package includes the MPICH2 1.0.5p4 version. All the required files are present as a single tarball.

You can go to the  MVAPICH2 website to obtain the source code.

4.2 Prepare MVAPICH2 source code

Untar the archive you have downloaded from the web page. Given 1.0.3 version, the following command can run on most Unix machines:

$ tar xzf mvapich2-1.0.3.tar.gz

You will have a directory named mvapich2-1.0.3.

4.3 Downloading MVAPICH2 Source Code from Anonymous SVN

The MVAPICH2 source code is also available for download through a public SVN:

4.4 Build MVAPICH2

There are several options to build MVAPICH2 1.0.3 based on the underlying InfiniBand libraries you want to utilize. In this section we describe in detail the steps you need to perform to correctly build MVAPICH2 on your choice of libraries, namely OpenFabrics Gen2-IB, OpenFabrics Gen2-iWARP, Mellanox VAPI, uDAPL and TCP/IP.

4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP

There are several methods to configure MVAPICH2 and optimize it for a given platform.

After setting all the parameters, the script make.mvapich2.ofa configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.2 Build MVAPICH2 with uDAPL

There are several methods to configure MVAPICH2 and optimize it for a given platform.

After setting all the parameters, the script make.mvapich2.udapl configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.3 Build MVAPICH2 with VAPI

There are several methods to configure MVAPICH2 and optimize it for a given platform.

After setting all the parameters, the script make.mvapich2.vapi configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.4 Build MVAPICH2 with TCP/IPoIB

Go to the mvapich2-1.0.3 directory. We have provided a script make.mvapich2.tcp for building MVAPICH2 over TCP/IP intended for use over IPoIB (IP over InfiniBand). By default, the compilation script uses gcc. In order to use a specific compiler, set the CC, CXX, F77, and F90 variables to the appropriate commands. MVAPICH2 currently supports the GCC, Intel, PathScale, PGI, and Sun Studio compiler suites for TCP/IPoIB. Simply execute this script (e.g. ./make.mvapich2.tcp) for completing your build.

5 Basic Usage Instructions

5.1 Compile MPI Applications

MVAPICH2 provides a variety of MPI compilers to support applications written in different programming languages. Please use mpicc, mpif77, mpiCC, or mpif90 to compile applications. The correct compiler should be selected depending upon the programming language of your MPI application.

These compilers are available in the MVAPICH2_HOME/bin directory. MVAPICH2 installation directory can also be specified by modifying $PREFIX, then all the above compilers will also be present in the $PREFIX/bin directory.

5.2 Setting MPD Environment

Prerequisites: ssh should be enabled between the front nodes and the computing nodes.

Please follow these steps to setup MPD:

This should list all the nodes specified in the hostfile, not necessarily in the order specified in the hostfile.

Up to now, we have specified setting up the environment, which is independent of the underlying device supported by MVAPICH2. In the next sections, we present details specific to different devices.

5.3 Run MPI Applications Using mpiexec with OpenFabrics Gen2-IB or VAPI Device

To start multiple processes, mpiexec can be used in the following fashion:

$ mpiexec -n 4 ./cpi

Four processes will be started on the compute nodes n0, n1, n2 and n3. mpiexec can also be run with several options. $ mpiexec --help lists all the possible options. A useful option is to specify a machinefile which holds the process mapping to machines. It can also be used to specify the number of processes to be run on each host. The machinefile option can be used with mpiexec as follows:

$ mpiexec -machinefile mf -n 4 ./cpi

where the machine file ”mf” contains the process to machine mapping. For example, if you want to run all the 4 processes on n0, then ”mf” contains the following lines:

$ cat mf
n0
n0
n0
n0

Environmental variables can be set with mpiexec as follows:

$ mpiexec -n 4 -env ENV1 value1 -env ENV2 value2 ./cpi

Note that the environmental variables should be put immediately before the executable file. The mpiexec command also propagates exported variables in its runtime environment to all processes by default. Exporting a variable before running mpiexec has the same effect as explicitly passing its value with the -env command line option. The command above could be done in the following manner when using a Bourne shell derivative:

$ export ENV1=value1

$ export ENV2=value2

$ mpiexec -n 4 ./cpi

5.3.1 Run MPI Applications using Shared Library Support

MVAPICH2 provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled your applications will be built on top of shared libraries automatically. the following commands provide some examples of how to build and run your application with shared library support.

5.3.2 Run MPI Application using TotalView Debugger Support

MVAPICH2 provides TotalView support. The following commands provide an example of how to build and run your application with TotalView support. Note: running TotalView requires correct setup in your environment, if you encounter any problem with your setup, please check with your system administrator for help.

5.4 Run MPI Application using OpenFabrics Gen2-iWARP Device

In MVAPICH2, Gen2-iWARP support is enabled with the use of the run time environment variable ‘‘MV2_USE_IWARP_MODE’’.

In addition to this flag, all the systems to be used need the following one time setup for enabling RDMA CM usage.

Programs can be executed as follows:

$ mpiexec -n 4 -env MV2_USE_IWARP_MODE 1 -env ENV1 value1 prog

The iWARP device also provides totalview debugging and shared library support. Please refer to section 5.3.1 and 5.3.2 for shared library and totalview support, respectively.

5.5 Run MPI Application using mpiexec with uDAPL Device

MVAPICH2 can be configured with the uDAPL device, as described in the section 4.4.2 . To compile MPI applications, please refer to the section 5.1. In order to run MPI applications with uDAPL support, please specify the environmental variable MV2_DAPL_PROVIDER. As an example,

$ mpiexec -n 4 -env MV2_DAPL_PROVIDER ib0 ./cpi

or:

$ export MV2_DAPL_PROVIDER=ib0

$ mpiexec -n 4 ./cpi

Please check the /etc/dat.conf file to find all the available uDAPL service providers. The default value for the uDAPL provider will be chosen, if no environment variable is provided at runtime. For OFED 1.2, please use OpenIB-cma as the uDAPL provider.

The uDAPL device also provides totalview debugging and shared library support. Please refer to section 5.3.1 and 5.3.2 for shared library and totalview support, respectively.

5.6 Run MPI Application using mpiexec with TCP/IP

If you would like to run an MPI job using IPoIB but your IB card is not the default interface for ip traffic you have two options. For both of the options , assume that you have a cluster setup as the following:


#hostname Eth Addr IPoIB Addr
    
compute1 192.168.0.1 192.168.1.1
compute2 192.168.0.2 192.168.1.1
compute3 192.168.0.3 192.168.1.1
compute4 192.168.0.4 192.168.1.1

5.6.1 The MPI Job Uses IPoIB

In this scenario, you will start up mpd like normal. However, you will need to create a machine file for mpiexec that tells mpiexec to use a particular interface. Example:
$ cat - > $(MPD_HOSTFILE) compute1
compute2
compute3
compute4

$ mpdboot -n 4 -f $(MPD_HOSTFILE)
compute1 ifhn=192.168.1.1
compute2 ifhn=192.168.1.2
compute3 ifhn=192.168.1.3
compute4 ifhn=192.168.1.4

The ifhn portion tells mpiexec to use the interface associated with that ip address for each machine. You can now run your MPI application using IPoIB similar to the following.
$ mpiexec -n $(NUM_PROCESS) -f $(MACHINE_FILE) $(MPI_APPLICATION)

5.6.2 Both MPD And the MPI Job Use IPoIB

In this scenario you will start up mpd in a modified fashion. However, you will not need to create a machine file for mpiexec. Your hostsfile for mpdboot must contain the ip addresses, or hostnames mapped to these addresses, of each machine’s IPoIB interface. The only exception is that you do not list the ip address or hostname of the local machine. This will be specified on the command line of the mpdboot command using the –ifhn option. Example:
$ cat - > $(MPD_HOSTFILE)
192.168.1.2
192.168.1.3
192.168.1.4

$ mpdboot -n 4 -f $HOSTSFILE --ifhn=192.168.1.1

The –ifhn option tells mpdboot to use the interface corresponding to that ip address to create the mpd ring and run MPI jobs. You can now run your MPI application using IPoIB similar to the following.
$ mpiexec -n $(NUM_PROCES) $(MPI_APPLICATION)

Note: For both options, you can replace the IPoIB addresses with aliases

5.7 Run MPI Applications using SLURM

SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from:
http://www.llnl.gov/linux/slurm/

Once SLURM is installed and the daemons are started, applications compiled with MVAPICH2 can be launched by SLURM, e.g.

$ srun -n2 --mpi=none ./a.out

The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:

$ srun --cpu_bind=v,map_cpu:0,4 --mem_bind=v,map_mem:0,1 -n2 --mpi=none ./a.out

For more information about SLURM and its features please visit SLURM website.

6 Advanced Usage Instructions

In this section, we present the usage instructions for advanced features provided by MVAPICH2.

6.1 Run MPI applications on Multi-Rail Configurations (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)

MVAPICH2 has integrated multi-rail support. Run-time variables are used to specify the control parameters of the multi-rail support; number of adapters with MV2_NUM_HCAS (section 10.18), number of ports per adapter with MV2_NUM_PORTS (section 10.19), and number of queue pairs per port with MV2_NUM_QP_PER_PORT (section 10.20). Those variables are default to 1 if you do not specify them. Following is an example to run multi-rail support with two adapters, using one port per adapter and one queue pair per port:

$ mpiexec -n 2 -env MV2_NUM_HCAS 2 -env MV2_NUM_PORTS 1 -env MV2_NUM_QP_PER_PORT 1 prog

Note that you don’t need to put -env MV2_NUM_PORTS 1 -env MV2_NUM_QP_PER_PORT 1 since they default to 1, so you can type:

$ mpiexec -n 2 -env MV2_NUM_HCAS 2 prog

6.2 Run MPI application with Customized Optimizations (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)

In MVAPICH2-1.0.3, run-time variables are used to switch various optimization schemes on and off. Following is a list of optimizations schemes and the control environmental variables, for a full list please refer to the section 10:

6.3 Run MPI application with Checkpoint/Restart Support (for OpenFabrics Gen2-IB Device)

MVAPICH2 provides system-level checkpoint/restart functionality for the OpenFabrics Gen2-IB interface. Three methods are provided to invoke checkpointing: Manual, Automated and Application Initiated Synchronous Checkpointing. In order to utilize the checkpoint/restart functionality there a couple of steps that need to be followed.

And users are strongly encouraged to read the Administrators guide of BLCR, and test the BLCR on the target platform, before using the checkpointing feature of MVAPICH2.

Now, your system is set up to use the Checkpoint/Restart features of MVAPICH2. There are several parameters related to MVAPICH2 to be setup to control the configuration and useage of this feature.

In order to provide maximum flexibility to end users who wish to use the checkpoint/restart features of MVAPICH2, we’ve provided three different methods which can be used to take the checkpoints during the execution of the MPI application. These methods are described as follows:

#include "mpi.h"  
#include <unistd.h>  
#include <stdio.h>  
 
int main(int argc,char *argv[])  
{  
    MPI_Init(&argc,&argv);  
    printf("Computation\n");  
    sleep(5);  
    MPI_Barrier(MPI_COMM_WORLD);  
    MVAPICH2_Sync_Checkpoint();  
    MPI_Barrier(MPI_COMM_WORLD);  
    printf("Computation\n");  
    sleep(5);  
    MPI_Finalize();  
    return 0;  
}

To restart a job from a checkpoint, users need to issue another command of BLCR, ‘‘cr_restart’’ with the checkpoint file name of the MPI job console as the parameter, usually context.<pid>. The checkpoint file name of the MPI job console can be specified when issuing the checkpoint, see the ‘‘cr_checkpoint --help’’ for more information. Please note that the names of checkpoint files of the MPI processes will be assigned according to the environment variable MV2_CKPT_FILE, ($MV2_CKPT_FILE.<number of checkpoint>.<process rank>).

Please refer to the Section 8.7 for troubleshooting with Checkpoint/Restart.

6.4 Run MPI application with RDMA CM support (for OpenFabrics Gen2-IB and Gen2-iWARP Devices)

In MVAPICH2, for using RDMA CM the runtime variable MV2_USE_RDMA_CM needs to be used as described in 10. (Note: In order to use RDMA CM support on OFED 1.1 software stack, please add the flag -DOFED_VERSION_1_1 to the cflags in the compilation script make.mvapich2.ofa)

In addition to these flags, all the systems to be used need the following one time setup for enabling RDMA CM usage.

Programs can be executed as follows:

$ mpiexec -n 2 -env MV2_USE_RDMA_CM 1 prog

6.5 Run MPI application with Shared Memory Collectives

In MVAPICH2, support for shared memory based collectives has been enabled for MPI applications running over OpenFabrics Gen2-IB, Gen2-iWARP and uDAPL stack. Currently, this support is available for the following collective operations:

Optionally, these feature can be turned off at runtime by using the following parameters:

Please refer to Section 10 for further details.

6.6 Run MPI Application with Hot-Spot and Congestion Avoidance (for OpenFabrics Gen2-IB Device)

MVAPICH2 supports hot-spot and congestion avoidance using InfiniBand multi-pathing mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, MV2_USE_HSAM (Section 10.37) can be enabled, as shown in the following example:

$ mpiexec -n 2 -env MV2_USE_HSAM 1 ./cpi

This functionality automatically defines the number of paths for hot-spot avoidance. Alternatively, the maximum number of paths to be used between a pair of processes can be defined by using a run-time variable MV2_NUM_QP_PER_PORT (Section 10.20).

We expect this functionality to show benefits in the presence of at least partially non-overlapping paths in the network. OpenSM, the subnet manager distributed with OpenFabrics supports LMC mechanism, which can be used to create multiple paths:

$ opensm -l4

will start the subnet manager with LMC value to four, creating sixteen paths between every pair of nodes.

6.7 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)

MVAPICH2 supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, MV2_USE_APM (section 10.33) can be enabled, as shown in the following example:

$ mpiexec -n 2 -env MV2_USE_APM 1 ./cpi

MVAPICH2 also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable MV2_USE_APM_TEST (section 10.34). This should be combined with MV2_USE_APM as follows:

$ mpiexec -n 2 -env MV2_USE_APM 1 -env MV2_USE_APM_TEST 1./cpi

7 Using OSU Benchmarks

If you have arrived at this point, you have successfully installed MVAPICH2. Congratulations!! In the mvapich2-1.0.3/osu_benchmarks directory, we provide these basic performance tests:

These benchmarks are also available on our project’s web page. Sample performance numbers for these benchmarks on representative platforms with InfiniBand and iWARP adapters are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let us know by sending an email to mvapich-discuss@cse.ohio-state.edu.

8 FAQ and Troubleshooting with MVAPICH2

Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact us by sending an email to mvapich-discuss@cse.ohio-state.edu.

MVAPICH2 can be used over five underlying transport interfaces, namely OpenFabrics (Gen2), OpenFabrics (Gen2-iWARP), VAPI, UDAPL and TCP/IP. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the five transport interfaces.

8.1 General Questions and Troubleshooting

8.1.1 Invalid Communicators Error

This is a problem which typically occurs due to the presence of multiple installations of MVAPICH2 on the same set of nodes. The problem is due to the presence of mpi.h other than the one, which is used for executing the program. This problem can be resolved by making sure that the mpi.h from other installation is not included.

8.1.2 Are fork() and system() supported?

fork() and system() is supported for the OpenFabrics device as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.

8.1.3 Cannot Build with the PathScale Compiler

There is a known bug with the PathScale compiler (before version 2.5) when building MVAPICH2. This problem will be solved in the next major release of the PathScale compiler. To work around this bug, use the the “-LNO:simd=0” C compiler option. This can be set in the build script similarly to:

export CC=‘‘pathcc -LNO:simd=0’’

Please note the use of double quotes. If you are building shared libraries and are using the PathScale compiler (version below 2.5), then you should add “-g” to your CFLAGS, in order to get around a compiler bug.

8.1.4 Cannot find mpd.conf

If you get this error, please set your .mpd.conf and .mpdpasswd files.

8.1.5 sched_setaffinity: Bad address

MVAPICH2 supports CPU affinity for multi-way SMP systems. This feature requires a kernel version of 2.6 The above error reports that CPU affinity cannot be enabled on the system. This feature can be disabled by setting -env MV2_ENABLE_AFFINITY 0.

8.1.6 Multi-threaded programs seem to run sequentially

MVAPICH2 uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, however, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, i.e. set -env MV2_ENABLE_AFFINITY 0.

8.1.7 Running MPI programs built with gfortran

MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” and exported into the environment before using the mpiexec command to launch the program, as done in the bash shell example below:

$ export GFORTRAN_UNBUFFERED_ALL=y

8.2 With Gen2 Interface

8.2.1 Cannot Open HCA

The above error reports that the InfiniBand Adapter is not ready for communication. Make sure that the drivers are up. This can be done by executing the following command which gives the path at which drivers are setup.

% locate libibverbs

8.2.2 Checking state of IB Link

In order to check the status of the IB link, one of the following commands can be used:
% ibstatus
or
% ibv_devinfo.

8.2.3 Undefined reference to ibv_get_device_list

Add -DGEN2_OLD_DEVICE_LIST_VERB macro to CFLAGS and rebuild MVAPICH2-gen2. If this happens, this means that your Gen2 installation is old and needs to be updated.

8.2.4 Creation of CQ or QP failure

A possible reason could be inability to pin the memory required. Make sure the following parameters are set.

In /etc/security/limits.conf add the following line:

* soft memlock phys_mem_in_KB

After this, add the following line to /etc/init.d/sshd and restart sshd

ulimit -l phys_mem_in_KB

8.2.5 Hang with the HSAM Functionality

HSAM functionality uses multi-pathing mechanism with LMC functionality. However, some versions of OpenFabrics Drivers (including OpenFabrics Enterprise Distribution (OFED) 1.1) and using the Up*/Down* routing engine do not configure the routes correctly using the LMC mechanism. We strongly suggest to upgrade to OFED 1.2, which supports Up*/Down* routing engine and LMC mechanism correctly.

8.2.6 Failure with Automatic Path Migration

MVAPICH2 provides network fault tolerance with Automatic Path Migration (APM). However, APM is supported only with OFED 1.2 onwards. With OFED 1.1 and prior versions of OpenFabrics drivers, APM functionality is not completely supported. Please refer to Section 10.33 and section 10.34

8.2.7 Error opening file

If you configure MVAPICH2 with RDMA_CM and see this error, you need to verify if you have setup up the local IP address to be used by RDMA_CM in the file /etc/mv2.conf. Further, you need to make sure that this file has the appropriate file read permissions. Please follow Section 6.4 for more details on this.

8.2.8 RDMA CM Address error

If you get this error, please verify that the IP address specified /etc/mv2.conf is correctly specified with the IP address of the device you plan to use RDMA_CM with.

8.2.9 RDMA CM Route error

If see this error, you need to check whether the specified network is working or not.

8.3 With Gen2-iWARP Interface

8.3.1 Error opening file

If you configure MVAPICH2 with RDMA_CM and see this error, you need to verify if you have setup up the local IP address to be used by RDMA_CM in the file /etc/mv2.conf. Further, you need to make sure that this file has the appropriate file read permissions. Please follow Section 5.4 for more details on this.

8.3.2 RDMA CM Address error

If you get this error, please verify that the IP address specified /etc/mv2.conf is correctly specified with the IP address of the device you plan to use RDMA_CM with.

8.3.3 RDMA CM Route error

If see this error, you need to check whether the specified network is working or not.

8.4 With VAPI Interface

8.4.1 Cannot pass MPI_Init

If your MPI application cannot pass MPI_Init, please make sure of the following things:

8.4.2 Cannot Open HCA

The above error reports that the InfiniBand Adapter is not ready for communication. Make sure that the drivers are up. This can be done by executing

$ locate libvapi

which gives the path at which drivers are setup.

8.4.3 Cannot include vapi.h

This error is gener