MVAPICH 1.0 User and Tuning Guide

MVAPICH Team
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
http://mvapich.cse.ohio-state.edu/
Copyright ©2002-2008
Network-Based Computing Laboratory,
headed by Dr. D. K. Panda.
All rights reserved.

Last Revised: May 29, 2008

Contents

1 Overview of the Open-Source MVAPICH Project
2 How to use this User Guide?
3 MVAPICH 1.0 Features
4 Installation Instructions
 4.1 Download MVAPICH source code
 4.2 Prepare MVAPICH source code
 4.3 Getting MVAPICH source updates
 4.4 Build MVAPICH
  4.4.1 Build MVAPICH with Single-Rail configuration on OpenFabrics Gen2
  4.4.2 Build MVAPICH with Multi-Rail Configuration on OpenFabrics Gen2
  4.4.3 Build MVAPICH with OpenFabrics/Gen2 over the Unreliable Datagram Transport (Gen2-UD)
  4.4.4 Build MVAPICH with QLogic InfiniPath
  4.4.5 Build MVAPICH with Single-Rail Configuration on VAPI
  4.4.6 Build MVAPICH with Multi-Rail Configuration on VAPI
  4.4.7 Build MVAPICH with Single-Rail Configuration on uDAPL
  4.4.8 Build MVAPICH with Shared Memory Device
  4.4.9 Build MVAPICH with TCP/IPoIB
5 Usage Instructions
 5.1 Compile MPI applications
 5.2 Run MPI applications using mpirun_rsh
 5.3 Run MPI applications using SLURM
 5.4 Run MPI applications with Scalable Collectives
 5.5 Run MPI applications using shared library support
 5.6 Run MPI applications using ADIO driver for Lustre
 5.7 Run MPI applications using TotalView Debugger support
 5.8 Run MPI applications with Multi-Pathing Support for Multi-Core Architectures
 5.9 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)
 5.10 Run MPI applications on Multi-Rail Configurations
 5.11 Run Memory Intensive Applications on Multi-core Systems
6 Using OSU Benchmarks
7 FAQ and Troubleshooting
 7.1 General Questions and Troubleshooting
  7.1.1 How can I check what version I am using?
  7.1.2 Are fork() and system() supported?
  7.1.3 My application cannot pass MPI_Init
  7.1.4 My application hangs/aborts in Collectives
  7.1.5 Building MVAPICH with g77/gfortran
  7.1.6 Running MPI programs built with gfortran
  7.1.7 Timeout During Debugging
  7.1.8 Unexpected exit status
  7.1.9 /usr/bin/env: mpispawn: No such file or directory
  7.1.10 **io No such file or directory*?
  7.1.11 My program segfaults with:
File locking failed in ADIOI_Set_lock?

  7.1.12 MPI+OpenMP shows bad performance
  7.1.13 Fortran support is disabled with Sun Studio 12 compilers
  7.1.14 Other MPICH problems
 7.2 Troubleshooting with MVAPICH/OpenFabrics(Gen2)
  7.2.1 No IB Devices found
  7.2.2 Error getting HCA Context
  7.2.3 CQ or QP Creation failure
  7.2.4 No Active Port found
  7.2.5 Couldn’t modify SRQ limit
  7.2.6 Got completion with error code 12
  7.2.7 Hang with VIADEV_USE_LMC=1
  7.2.8 Failure with Automatic Path Migration
  7.2.9 Problems with mpirun_rsh
 7.3 Troubleshooting with MVAPICH/VAPI
  7.3.1 Cannot Open HCA
  7.3.2 Cannot include vapi.h
  7.3.3 Program aborts with VAPI_RETRY_EXEC_ERROR
  7.3.4 ld:multiple definitions of symbol _calloc error on MacOS
  7.3.5 No Fortran interface on the MacOS platform
 7.4 Troubleshooting with MVAPICH/uDAPL
  7.4.1 Cannot Open IA
  7.4.2 DAT Insufficient Resource
  7.4.3 Cannot find libdat.so
  7.4.4 Cannot compile with DAPL-v2.0
 7.5 Troubleshooting with MVAPICH/QLogic InfiniPath
  7.5.1 Low Bandwidth
  7.5.2 Cannot find -lpsm_infinipath
  7.5.3 Mandatory variables not set
  7.5.4 Can’t open /dev/ipath, Network Down
  7.5.5 No ports available on /dev/ipath
8 Tuning and Scalability Features for Large Clusters
 8.1 Job Launch Tuning
 8.2 Network Point-to-point Tuning
  8.2.1 Shared Receive Queue (SRQ) Tuning
  8.2.2 On-Demand Connection Tuning
  8.2.3 Adaptive RDMA Tuning
 8.3 Shared Memory Point-to-point Tuning
 8.4 Scalable Collectives Tuning
9 MVAPICH Parameters
 9.1 Job Launch Parameters
  9.1.1 MT_DEGREE
  9.1.2 MPIRUN_TIMEOUT
 9.2 InfiniBand HCA and Network Parameters
  9.2.1 VIADEV_DEVICE
  9.2.2 VIADEV_DEFAULT_PORT
  9.2.3 VIADEV_MAX_PORTS
  9.2.4 VIADEV_USE_MULTIHCA
  9.2.5 VIADEV_USE_MULTIPORT
  9.2.6 VIADEV_USE_LMC
  9.2.7 VIADEV_DEFAULT_MTU
  9.2.8 VIADEV_USE_APM
  9.2.9 VIADEV_USE_APM_TEST
 9.3 Memory Usage and Performance Control Parameters
  9.3.1 VIADEV_NUM_RDMA_BUFFER
  9.3.2 VIADEV_VBUF_TOTAL_SIZE
  9.3.3 VIADEV_RNDV_PROTOCOL
  9.3.4 VIADEV_RENDEZVOUS_THRESHOLD
  9.3.5 VIADEV_MAX_RDMA_SIZE
  9.3.6 VIADEV_R3_NOCACHE_THRESHOLD
  9.3.7 VIADEV_VBUF_POOL_SIZE
  9.3.8 VIADEV_VBUF_SECONDARY_POOL_SIZE
  9.3.9 VIADEV_USE_DREG_CACHE
  9.3.10 LAZY_MEM_UNREGISTER
  9.3.11 VIADEV_NDREG_ENTRIES
  9.3.12 VIADEV_DREG_CACHE_LIMIT
  9.3.13 VIADEV_VBUF_MAX
  9.3.14 VIADEV_ON_DEMAND_THRESHOLD
  9.3.15 VIADEV_MAX_INLINE_SIZE
  9.3.16 VIADEV_NO_INLINE_THRESHOLD
  9.3.17 VIADEV_USE_BLOCKING
  9.3.18 VIADEV_ADAPTIVE_RDMA_LIMIT
  9.3.19 VIADEV_ADAPTIVE_RDMA_THRESHOLD
  9.3.20 VIADEV_ADAPTIVE_ENABLE_LIMIT
  9.3.21 VIADEV_SQ_SIZE
 9.4 Send/Receive Control Parameters
  9.4.1 VIADEV_CREDIT_PRESERVE
  9.4.2 VIADEV_CREDIT_NOTIFY_THRESHOLD
  9.4.3 VIADEV_DYNAMIC_CREDIT_THRESHOLD
  9.4.4 VIADEV_INITIAL_PREPOST_DEPTH
  9.4.5 VIADEV_USE_SHARED_MEM
  9.4.6 VIADEV_PROGRESS_THRESHOLD
  9.4.7 VIADEV_USE_COALESCE
  9.4.8 VIADEV_USE_COALESCE_SAME
  9.4.9 VIADEV_COALESCE_THRESHOLD_SQ
  9.4.10 VIADEV_COALESCE_THRESHOLD_SIZE
 9.5 SRQ (Shared Receive Queue) Control Parameters
  9.5.1 VIADEV_USE_SRQ
  9.5.2 VIADEV_SRQ_MAX_SIZE
  9.5.3 VIADEV_SRQ_SIZE
  9.5.4 VIADEV_SRQ_LIMIT
  9.5.5 VIADEV_MAX_R3_OUST_SEND
  9.5.6 VIADEV_SRQ_ZERO_POST_MAX
  9.5.7 VIADEV_MAX_R3_PENDING_DATA
 9.6 Shared Memory Control Parameters
  9.6.1 VIADEV_SMP_EAGERSIZE
  9.6.2 VIADEV_SMPI_LENGTH_QUEUE
  9.6.3 SMP_SEND_BUF_SIZE
  9.6.4 VIADEV_SMP_NUM_SEND_BUFFER
  9.6.5 VIADEV_USE_AFFINITY
  9.6.6 VIADEV_CPU_MAPPING
 9.7 Multi-Rail Usage Parameters
  9.7.1 STRIPING_THRESHOLD
  9.7.2 NUM_QP_PER_PORT
  9.7.3 NUM_PORTS
  9.7.4 NUM_HCAS
  9.7.5 SM_SCHEDULING
  9.7.6 LM_SCHEDULING
 9.8 Run time parameters for Collectives
  9.8.1 VIADEV_USE_SHMEM_COLL
  9.8.2 VIADEV_USE_SHMEM_BARRIER
  9.8.3 VIADEV_USE_SHMEM_ALLREDUCE
  9.8.4 VIADEV_USE_SHMEM_REDUCE
  9.8.5 VIADEV_USE_ALLGATHER_NEW
  9.8.6 VIADEV_MAX_SHMEM_COLL_COMM
  9.8.7 VIADEV_SHMEM_COLL_MAX_MSG_SIZE
  9.8.8 VIADEV_SHMEM_COLL_REDUCE_THRESHOLD
  9.8.9 VIADEV_SHMEM_COLL_ALLREDUCE_THRESHOLD
  9.8.10 VIADEV_BCAST_KNOMIAL
  9.8.11 MPIR_ALLTOALL_SHORT_MSG
  9.8.12 MPIR_ALLTOALL_MEDIUM_MSG
  9.8.13 MPIR_AllTOALL_BASIC
  9.8.14 MPIR_ALLTOALL_MCORE_OPT
 9.9 CM Control Parameters
  9.9.1 VIADEV_CM_RECV_BUFFERS
  9.9.2 VIADEV_CM_MAX_SPIN_COUNT
  9.9.3 VIADEV_CM_TIMEOUT
 9.10 Other Parameters
  9.10.1 VIADEV_CLUSTER_SIZE
  9.10.2 VIADEV_PREPOST_DEPTH
  9.10.3 VIADEV_MAX_SPIN_COUNT
  9.10.4 VIADEV_PT2PT_FAILOVER
  9.10.5 DAPL_PROVIDER
10 MVAPICH Gen2-UD Parameters
 10.1 InfiniBand HCA and Network Parameters
  10.1.1 MV_DEVICE
  10.1.2 MV_MTU
 10.2 Reliability Parameters
  10.2.1 MV_PROGRESS_TIMEOUT
  10.2.2 MV_RETRY_TIMEOUT
  10.2.3 MV_MAX_RETRY_COUNT
  10.2.4 MV_ACK_AFTER_RECV
  10.2.5 MV_ACK_AFTER_PROGRESS
 10.3 Large Message Transfer Parameters
  10.3.1 MV_USE_UD_ZCOPY
  10.3.2 MV_UD_ZCOPY_QPS
  10.3.3 MV_UD_ZCOPY_THRESHOLD
  10.3.4 MV_USE_REG_CACHE
 10.4 Performance and General Parameters
  10.4.1 MV_USE_HEADERS
  10.4.2 MV_NUM_UD_QPS
  10.4.3 MV_RNDV_THRESHOLD
 10.5 QP and Buffer Parameters
  10.5.1 MV_UD_SQ_SIZE
  10.5.2 MV_UD_RQ_SIZE
  10.5.3 MV_UD_CQ_SIZE
  10.5.4 MV_USE_LMC
 10.6 Shared Memory Control Parameters
  10.6.1 MV_USE_SHARED_MEMORY
  10.6.2 MV_SMP_EAGERSIZE
  10.6.3 MV_SMPI_LENGTH_QUEUE
  10.6.4 SMP_SEND_BUF_SIZE
  10.6.5 MV_SMP_NUM_SEND_BUFFER
  10.6.6 MV_USE_AFFINITY

1 Overview of the Open-Source MVAPICH Project

InfiniBand is emerging as a high-performance interconnect delivering low latency and high bandwidth. It is also getting widespread acceptance due to its open standard.

MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand and other RDMA enabled interconnects to deliver performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.

Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC and InfiniBand community. As of the 05/29/2008, more than 690 organizations (National Labs, Universities, and Industry) in 41 countries have downloaded this software from OSU’s web site directly. In addition, many IBA vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH have obtained positions in the TOP 500 ranking. The current version of MVAPICH is also being made available with the OpenFabrics/Gen2 stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.

More details on MVAPICH/MVAPICH2 software, users list, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu/

This document contains necessary information for MVAPICH users to download, install, test, use, and tune MVAPICH 1.0. As we get feedback from users and take care of bug-fixes, we introduce new patches against our released distribution and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.

2 How to use this User Guide?

This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH-1.0.

In Section 3 we describe all the features in MVAPICH 1.0. As you read through this section, please note our new features (highlighted as NEW). Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Usage instructions for MVAPICH are explained in Section 5. Apart from describing how to run simple MPI applications, this section also talks about running MVAPICH with some of the advanced features. Section 6 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH, please check Section 7 where we list some of the common problems users face. In Section 8 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. In Section 9, we list important run-time and compile time parameters for the Gen2, VAPI, and QLogic devices, their default values and a small description of each parameter. Finally, Section 10 lists the parameters and tuning options for the OpenFabrics/Gen2-UD device.

3 MVAPICH 1.0 Features

MVAPICH (MPI-1 over InfiniBand) is an MPI-1 implementation based on MPICH and MVICH. MVAPICH 1.0 is available as a single integrated package (with the latest MPICH 1.2.7 and MVICH).

A complete set of features of MVAPICH 1.0 are:

The MVAPICH 1.0 package and the project also includes the following provisions:

4 Installation Instructions

4.1 Download MVAPICH source code

The MVAPICH 1.0 source code package includes the latest MPICH 1.2.7 version and also the required MVICH files from LBNL. Thus, there is no need to download any other files except MVAPICH 1.0 source code.

You can go to the MVAPICH website to obtain the source code.

4.2 Prepare MVAPICH source code

Untar the archive you have downloaded from the web page using the following command. You will have a directory named mvapich-1.0 after executing this command.

$ tar xzf mvapich-1.0.tar.gz

4.3 Getting MVAPICH source updates

As we enhance and improve MVAPICH, we update the available source code on our public SVN repository. In order to obtain these updates, please install a SVN client on your machine. The latest MVAPICH sources may be obtained from the “trunk” of the SVN using the following command:

$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk

The “trunk” may contain newer features and bug fixes. However, it is likely to be lightly tested. If you are interested in obtaining stable and major bug fixes to any release version, you should update your sources from the “branch” of the SVN using the following command:

$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/branches/1.0

MVAPICH 1.0 provides support for seven different ADI devices. Namely, Gen2 Single-Rail (ch_gen2), Gen2 Multi-Rail (ch_gen2_multirail), Gen2/UD (ch_gen2_ud), Shared memory device (ch_smp), VAPI Single-Rail (vapi), VAPI Multi-Rail (vapi_multirail), uDAPL (udapl) and QLogic InfiniPath (psm). Additionally, you can also configure MVAPICH over the standard TCP/IP interface and use it over IPoIB.

4.4 Build MVAPICH

There are several options to build MVAPICH 1.0 based on the underlying InfiniBand libraries you want to utilize. In this section we describe in detail the steps you need to perform to correctly build MVAPICH on your choice of InfiniBand libraries, namely OpenFabrics/Gen2, OpenFabrics/Gen2-UD, Mellanox VAPI, uDAPL, Shared Memory or QLogic InfiniPath.

In the following subsection, we describe how to build and configure the Single-Rail device. In later subsections, we describe the building and configuration of the other devices: Multi-Rail with OpenFabrics/Gen2 (4.4.2), Gen2/UD (4.4.3), InfiniPath (4.4.4), VAPI-single-rail (4.4.5), VAPI-multi-rail (4.4.6), uDAPL (4.4.7), Shared memory (4.4.8) and TCP (4.4.9).

4.4.1 Build MVAPICH with Single-Rail configuration on OpenFabrics Gen2

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.gen2 configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.2 Build MVAPICH with Multi-Rail Configuration on OpenFabrics Gen2

There are several methods to configure MVAPICH 1.0 with multi-rail device on OpenFabrics Gen2.

After setting all the parameters, the script make.mvapich.gen2_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.

MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Please refer to 5.10 for more details.

4.4.3 Build MVAPICH with OpenFabrics/Gen2 over the Unreliable Datagram Transport (Gen2-UD)

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.gen2_ud configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.4 Build MVAPICH with QLogic InfiniPath

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.psm configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.5 Build MVAPICH with Single-Rail Configuration on VAPI

There are several methods to configure MVAPICH 1.0 on VAPI.

After setting all the parameters, the script make.mvapich.vapi configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.6 Build MVAPICH with Multi-Rail Configuration on VAPI

There are several methods to configure MVAPICH 1.0 with multi-rail device.

After setting all the parameters, the script make.mvapich.vapi_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.7 Build MVAPICH with Single-Rail Configuration on uDAPL

Before installing MVAPICH with uDAPL, please make sure you have the uDAPL library installed properly.

There are several methods to configure MVAPICH 1.0 with uDAPL.

After setting all the parameters, the script make.mvapich.udapl configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.8 Build MVAPICH with Shared Memory Device

In the mvapich-1.0 directory, we have provided a script make.mvapich.smp for building MVAPICH over shared memory intended for single SMP systems. The script make.mvapich.smp takes care of different platforms, compilers and architectures. By default, the compilation script uses gcc. In order to select your compiler, please set the variable CC in the script to use either Intel, PathScale or PGI compilers. The platform/architecture is detected automatically. The usage of the shared memory device can be found in 5.2.

4.4.9 Build MVAPICH with TCP/IPoIB

In the mvapich-1.0 directory, we have provided a script make.mvapich.tcp for building MVAPICH over TCP/IP intended for use over IPoIB (IP over InfiniBand). In order to select any other compiler than GCC, please set your CC variable in that script. Simply execute this script (e.g. ./make.mvapich.tcp) for completing your build.

5 Usage Instructions

This section discusses the usage methods for the various features provided by MVAPICH. If you face any problem while following these instructions, please refer to Section 7.

5.1 Compile MPI applications

Use mpicc, mpif77, mpiCC, or mpif90 to compile applications. They can be found under mvapich-1.0/bin.

There are several options to run MPI applications. Please select one of the following options based on your need.

5.2 Run MPI applications using mpirun_rsh

Prerequisites:

Examples of running programs using mpirun_rsh:

$ mpirun_rsh -np 4 n0 n1 n2 n3 ./cpi

The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. By default ssh is used.

$ mpirun_rsh -rsh -np 4 n0 n1 n2 n3 ./cpi

The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. rsh is used regardless of whether ssh or rsh is used when compiling MVAPICH.

$ mpirun_rsh -np 4 -hostfile hosts ./cpi

A list of nodes are in hosts, one per line. MPI ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. ie. if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.

If you are using the shared memory device, then host names can be omitted:

$ mpirun_rsh -np 4 ./cpi

Many parameters of the MPI library can be very easily configured during run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:

$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./cpi

Note that the environmental variables should be put immediately before the executable.

Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.

Please note that there are many different parameters which could be used to improve the performance of applications depending upon their requirements from the MPI library. For a discussion on how to identify which variables may be of interest to you, please take a look at Section 8.

Other options of mpirun_rsh can be obtained using

$ mpirun_rsh --help

5.3 Run MPI applications using SLURM

SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from: http://www.llnl.gov/linux/slurm/

Once SLURM is installed and the daemons are started, applications compiled with MVAPICH can be launched by SLURM, e.g.

$ srun -n2 --mpi=mvapich ./a.out

The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:

$ srun --cpu_bind=v,map_cpu:0,4 --mem_bind=v,map_mem:0,1 -n2 --mpi=mvapich ./a.out

For more information about SLURM and its features please visit SLURM website.

5.4 Run MPI applications with Scalable Collectives

MVAPICH provides shared memory implementations of important collectives:
MPI_Allreduce, MPI_Reduce, MPI_Barrier and MPI_Bcast. It also has support for Enhanced MPI_Allgather. These collective operations are enabled by default. Shared Memory Collectives are supported over Gen2, Gen2/UD, PSM and Shared Memory devices. The PSM device currently only has MPI_Barrier and MPI_Bcast shared memory implementation.

These operations can be disabled all at once by setting VIADEV_USE_SHMEM_COLL to 0 or one at a time by using the following environment variables:

Please refer to section 9.8 for tuning the various environment variables.

5.5 Run MPI applications using shared library support

MVAPICH provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled, your applications will be built on top of shared libraries automatically. The following commands provide some examples of how to build and run your application with shared library support.

5.6 Run MPI applications using ADIO driver for Lustre

MVAPICH contains optimized Lustre ADIO support for the OpenFabrics/Gen2 device. The Lustre directory should be mounted on all nodes on which MVAPICH processes will be running. Compile MVAPICH with ADIO support for Lustre as described in Section 4.4.1. If your Lustre mount is /mnt/datafs on nodes n0 and n1, on node n0, you can compile and run your program as follows:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 <path to perf>/perf -fname /mnt/datafs/testfile

If you have enabled support for multiple file systems, append the prefix ”lustre:” to the name of the file. For example:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 ./perf -fname lustre:/mnt/datafs/testfile

5.7 Run MPI applications using TotalView Debugger support

MVAPICH provides TotalView support for the OpenFabrics/Gen2 (mpid/ch_gen2),
OpenFabrics/Gen2-UD (mpid/ch_gen2_ud), Single-rail VAPI (mpid/vapi), InfiniPath (mpid/psm) and Shared-Memory devices (mpid/ch_smp). You need to use mpirun_rsh when running TotalView. The following commands also provide an example of how to build and run your application with TotalView support. Note: running TotalView demands correct setup in your environment, if you encounter any problem with your setup, please check with your system administrator for help.

5.8 Run MPI applications with Multi-Pathing Support for Multi-Core Architectures

MVAPICH provides multi-rail device with advance scheduling policies for data transfer 5.10. However, even with the single-rail configuration, multi-pathing (multiple ports, adapters and multiple paths provided by the LMC mechanism) can be used for multi-core systems. With this support, processes executing on the same node can leverage the above configurations by binding to one of the available configuration. MVAPICH provides multiple choices to the user for leveraging this functionality, which are described in the upcoming examples. This functionality is currently available only in the single-rail gen2 device.

5.9 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)

MVAPICH supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, VIADEV_USE_APM (section 9.2.8) can be enabled, as shown in the following example:

$ mpirun_rsh -np 2 VIADEV_USE_APM=1 ./cpi

MVAPICH also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable VIADEV_USE_APM_TEST (section 9.2.9). This should be combined with VIADEV_USE_APM as follows:

$ mpirun_rsh -np 2 VIADEV_USE_APM=1 VIADEV_USE_APM_TEST=1 ./cpi

5.10 Run MPI applications on Multi-Rail Configurations

MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Run-time parameters are being provided to control the policies. They are further divided into policies for small and large messages. These policies are available in the multirail devices for gen2 and VAPI.

5.11 Run Memory Intensive Applications on Multi-core Systems

Process to CPU mapping may affect application preformance on multi-core systems, especially for memory intensive applications. If the number of processes is smaller than the number of CPU’s/cores, it is preferable to distribute the processes on different chips to avoid memory contention because CPU’s/cores on the same chip usually share the memory controller. MVAPICH provides flexible user defined CPU mapping. To use it, first make sure CPU affinity is set (Section 9.6.5). Then use the run-time environment variable VIADEV_CPU_MAPPING to specify the CPU/core mapping. For example, if it is a quad-core system in which cores [0-3] are on the same chip and cores [4-7] are on another chip, and you need to run an application with 2 processes, then the following mapping will give the best performance:

$ mpirun_rsh -np 2 n0 n0 VIADEV_CPU_MAPPING=0,4 ./a.out

In this case process 0 will be mapped to core 0 and process 1 will be mapped to core 4.

More information about VIADEV_CPU_MAPPING can be found in Section 9.6.6.

6 Using OSU Benchmarks

If you have arrived at this point, you have successfully installed MVAPICH. Congratulations!! In the mvapich-1.0/osu_benchmarks directory, we provide four basic performance tests: one-way latency test, uni-directional bandwidth test, bi-directional bandwidth test multiple bandwidth/message rate, and MPI-level broadcast latency test. You can compile and run these tests on your machines to evaluate the basic performance of MVAPICH.

These benchmarks as well as other benchmarks (such as for one-sided operations in MPI-2) are available on our projects’ web page. Sample performance numbers for these benchmarks on representative platforms and IBA gears are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let the MVAPICH community know by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.

7 FAQ and Troubleshooting

Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact the MVAPICH community by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.

MVAPICH can be used over multiple underlying InfiniBand libraries, namely OpenFabrics (Gen2), OpenFabrices (Gen2-UD), VAPI, uDAPL and QLogic InfiniPath. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the three InfiniBand libraries.

7.1 General Questions and Troubleshooting

7.1.1 How can I check what version I am using?

Running the following command will provide you with the version of MVAPICH that is being used.

$ mpirun_rsh -v

7.1.2 Are fork() and system() supported?

fork() and system() is supported for Gen2 and Gen2-UD devices as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.

7.1.3 My application cannot pass MPI_Init

This is a common symptom of several setup issues related to job startup. Please make sure of the following things:

7.1.4 My application hangs/aborts in Collectives

MVAPICH implements highly optimized RDMA collective algorithms for frequently used collectives such as MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast MPI_Allgather, MPI_Barrier. The optimized implementations have been well tested and tuned. However, if you face any problems in these collectives for your application, please disable the optimized collectives. For example, if you want to disable MPI_Allreduce, you can do:

$ mpirun_rsh -np 8 -hostfile hf VIADEV_USE_SHMEM_ALLREDUCE=0 ./a.out

The complete list of all such paramaters is given in  9.8

7.1.5 Building MVAPICH with g77/gfortran

The gfortran compiler can be used for F77 and F90. In order to make this work, the following environment variables should be set prior to running the build script:

$ export F77=gfortran
$ export F90=gfortran
$ export F77_GETARGDECL=" "

If g77 and gfortran are used together for F77 and F90 respectively, it might be necessary to set the following environment variable in order to get around possible compatibility issues:

$ export F90FLAGS="-ff2c"

7.1.6 Running MPI programs built with gfortran

MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” when using mpirun_rsh to fix the problem. Running the pi3f90 example program using this variable setting is shown below:

$ mpirun_rsh -np 2 n1 n2 GFORTRAN_UNBUFFERED_ALL=y ./pi3f90

7.1.7 Timeout During Debugging

If a debug session is terminated with an alarm message, mpirun_rsh may have timedout waiting for the job launch to complete. Use a larger MPIRUN_TIMEOUT (section 9.1.2) to work around this problem.

7.1.8 Unexpected exit status

If an application task terminates unexpectedly during job launch, mpirun_rsh may print the message:

mpispawn.c:303 Unexpected exit status

This usually indicates a problem with the application. Other error messages around this (if any) might point to the actual issue.

7.1.9 /usr/bin/env: mpispawn: No such file or directory

If mpirun_rsh fails with this error message, it was unable to locate a necessary utility. This can be fixed by ensuring that all MVAPICH executables are in the PATH on all nodes.

If PATHs cannot be setup as mentioned, then invoke mpirun_rsh with a path. For example:

/path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

or

../../path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

7.1.10 **io No such file or directory*?

If you are using ADIO support for Lustre, please make sure that:
– Lustre is setup correctly, and that you are able to create, read to and write from files in the Lustre mounted directory.
– The Lustre directory is mounted on all nodes on which MVAPICH processes with ADIO support for Lustre are running.
– The path to the file is correctly specified.
– The permissions for the file or directory are correctly specified.

7.1.11 My program segfaults with:
File locking failed in ADIOI_Set_lock?

If you are using ADIO support for Lustre, the recent Lustre releases require an additional mount option to have correct file locks.
So please include the following option with your lustre mount command: ”-o localflock”.
For example: $ mount -o localflock -t lustre xxxx@o2ib:/datafs /mnt/datafs

7.1.12 MPI+OpenMP shows bad performance

MVAPICH uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, such as MPI+OpenMP model, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, e.g.

$ mpirun_rsh -np 2 n1 n2 VIADEV_USE_AFFINITY=0 ./a.out

More information about CPU affinity and CPU binding can be found in Sections 9.6.5 and  9.6.6.

7.1.13 Fortran support is disabled with Sun Studio 12 compilers

Please replace the -Wl,-rpath option in the build scripts (e.g. make.mvapich.gen2) with -R when Sun Studio 12 compilers are used.

7.1.14 Other MPICH problems

Several well-known MPICH related problems on different platforms and environments have already been identified by Argonne. They are available on the MPICH patch webpage.

7.2 Troubleshooting with MVAPICH/OpenFabrics(Gen2)

In this section, we discuss the general error conditions for MVAPICH based on OpenFabrics Gen2.

7.2.1 No IB Devices found

This error is generated by MVAPICH when it cannot find any Gen2 InfiniBand devices. If you are experiencing this error, then please make sure that your Gen2 installation is proper. You can do so by doing the following: