MVAPICH 1.0 User and Tuning Guide

MVAPICH Team
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
http://mvapich.cse.ohio-state.edu/
Copyright ©2002-2008
Network-Based Computing Laboratory,
headed by Dr. D. K. Panda.
All rights reserved.

Last Revised: May 29, 2008

Contents

1 Overview of the Open-Source MVAPICH Project
2 How to use this User Guide?
3 MVAPICH 1.0 Features
4 Installation Instructions
 4.1 Download MVAPICH source code
 4.2 Prepare MVAPICH source code
 4.3 Getting MVAPICH source updates
 4.4 Build MVAPICH
  4.4.1 Build MVAPICH with Single-Rail configuration on OpenFabrics Gen2
  4.4.2 Build MVAPICH with Multi-Rail Configuration on OpenFabrics Gen2
  4.4.3 Build MVAPICH with OpenFabrics/Gen2 over the Unreliable Datagram Transport (Gen2-UD)
  4.4.4 Build MVAPICH with QLogic InfiniPath
  4.4.5 Build MVAPICH with Single-Rail Configuration on VAPI
  4.4.6 Build MVAPICH with Multi-Rail Configuration on VAPI
  4.4.7 Build MVAPICH with Single-Rail Configuration on uDAPL
  4.4.8 Build MVAPICH with Shared Memory Device
  4.4.9 Build MVAPICH with TCP/IPoIB
5 Usage Instructions
 5.1 Compile MPI applications
 5.2 Run MPI applications using mpirun_rsh
 5.3 Run MPI applications using SLURM
 5.4 Run MPI applications with Scalable Collectives
 5.5 Run MPI applications using shared library support
 5.6 Run MPI applications using ADIO driver for Lustre
 5.7 Run MPI applications using TotalView Debugger support
 5.8 Run MPI applications with Multi-Pathing Support for Multi-Core Architectures
 5.9 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)
 5.10 Run MPI applications on Multi-Rail Configurations
 5.11 Run Memory Intensive Applications on Multi-core Systems
6 Using OSU Benchmarks
7 FAQ and Troubleshooting
 7.1 General Questions and Troubleshooting
  7.1.1 How can I check what version I am using?
  7.1.2 Are fork() and system() supported?
  7.1.3 My application cannot pass MPI_Init
  7.1.4 My application hangs/aborts in Collectives
  7.1.5 Building MVAPICH with g77/gfortran
  7.1.6 Running MPI programs built with gfortran
  7.1.7 Timeout During Debugging
  7.1.8 Unexpected exit status
  7.1.9 /usr/bin/env: mpispawn: No such file or directory
  7.1.10 **io No such file or directory*?
  7.1.11 My program segfaults with:
File locking failed in ADIOI_Set_lock?

  7.1.12 MPI+OpenMP shows bad performance
  7.1.13 Fortran support is disabled with Sun Studio 12 compilers
  7.1.14 Other MPICH problems
 7.2 Troubleshooting with MVAPICH/OpenFabrics(Gen2)
  7.2.1 No IB Devices found
  7.2.2 Error getting HCA Context
  7.2.3 CQ or QP Creation failure
  7.2.4 No Active Port found
  7.2.5 Couldn’t modify SRQ limit
  7.2.6 Got completion with error code 12
  7.2.7 Hang with VIADEV_USE_LMC=1
  7.2.8 Failure with Automatic Path Migration
  7.2.9 Problems with mpirun_rsh
 7.3 Troubleshooting with MVAPICH/VAPI
  7.3.1 Cannot Open HCA
  7.3.2 Cannot include vapi.h
  7.3.3 Program aborts with VAPI_RETRY_EXEC_ERROR
  7.3.4 ld:multiple definitions of symbol _calloc error on MacOS
  7.3.5 No Fortran interface on the MacOS platform
 7.4 Troubleshooting with MVAPICH/uDAPL
  7.4.1 Cannot Open IA
  7.4.2 DAT Insufficient Resource
  7.4.3 Cannot find libdat.so
  7.4.4 Cannot compile with DAPL-v2.0
 7.5 Troubleshooting with MVAPICH/QLogic InfiniPath
  7.5.1 Low Bandwidth
  7.5.2 Cannot find -lpsm_infinipath
  7.5.3 Mandatory variables not set
  7.5.4 Can’t open /dev/ipath, Network Down
  7.5.5 No ports available on /dev/ipath
8 Tuning and Scalability Features for Large Clusters
 8.1 Job Launch Tuning
 8.2 Network Point-to-point Tuning
  8.2.1 Shared Receive Queue (SRQ) Tuning
  8.2.2 On-Demand Connection Tuning
  8.2.3 Adaptive RDMA Tuning
 8.3 Shared Memory Point-to-point Tuning
 8.4 Scalable Collectives Tuning
9 MVAPICH Parameters
 9.1 Job Launch Parameters
  9.1.1 MT_DEGREE
  9.1.2 MPIRUN_TIMEOUT
 9.2 InfiniBand HCA and Network Parameters
  9.2.1 VIADEV_DEVICE
  9.2.2 VIADEV_DEFAULT_PORT
  9.2.3 VIADEV_MAX_PORTS
  9.2.4 VIADEV_USE_MULTIHCA
  9.2.5 VIADEV_USE_MULTIPORT
  9.2.6 VIADEV_USE_LMC
  9.2.7 VIADEV_DEFAULT_MTU
  9.2.8 VIADEV_USE_APM
  9.2.9 VIADEV_USE_APM_TEST
 9.3 Memory Usage and Performance Control Parameters
  9.3.1 VIADEV_NUM_RDMA_BUFFER
  9.3.2 VIADEV_VBUF_TOTAL_SIZE
  9.3.3 VIADEV_RNDV_PROTOCOL
  9.3.4 VIADEV_RENDEZVOUS_THRESHOLD
  9.3.5 VIADEV_MAX_RDMA_SIZE
  9.3.6 VIADEV_R3_NOCACHE_THRESHOLD
  9.3.7 VIADEV_VBUF_POOL_SIZE
  9.3.8 VIADEV_VBUF_SECONDARY_POOL_SIZE
  9.3.9 VIADEV_USE_DREG_CACHE
  9.3.10 LAZY_MEM_UNREGISTER
  9.3.11 VIADEV_NDREG_ENTRIES
  9.3.12 VIADEV_DREG_CACHE_LIMIT
  9.3.13 VIADEV_VBUF_MAX
  9.3.14 VIADEV_ON_DEMAND_THRESHOLD
  9.3.15 VIADEV_MAX_INLINE_SIZE
  9.3.16 VIADEV_NO_INLINE_THRESHOLD
  9.3.17 VIADEV_USE_BLOCKING
  9.3.18 VIADEV_ADAPTIVE_RDMA_LIMIT
  9.3.19 VIADEV_ADAPTIVE_RDMA_THRESHOLD
  9.3.20 VIADEV_ADAPTIVE_ENABLE_LIMIT
  9.3.21 VIADEV_SQ_SIZE
 9.4 Send/Receive Control Parameters
  9.4.1 VIADEV_CREDIT_PRESERVE
  9.4.2 VIADEV_CREDIT_NOTIFY_THRESHOLD
  9.4.3 VIADEV_DYNAMIC_CREDIT_THRESHOLD
  9.4.4 VIADEV_INITIAL_PREPOST_DEPTH
  9.4.5 VIADEV_USE_SHARED_MEM
  9.4.6 VIADEV_PROGRESS_THRESHOLD
  9.4.7 VIADEV_USE_COALESCE
  9.4.8 VIADEV_USE_COALESCE_SAME
  9.4.9 VIADEV_COALESCE_THRESHOLD_SQ
  9.4.10 VIADEV_COALESCE_THRESHOLD_SIZE
 9.5 SRQ (Shared Receive Queue) Control Parameters
  9.5.1 VIADEV_USE_SRQ
  9.5.2 VIADEV_SRQ_MAX_SIZE
  9.5.3 VIADEV_SRQ_SIZE
  9.5.4 VIADEV_SRQ_LIMIT
  9.5.5 VIADEV_MAX_R3_OUST_SEND
  9.5.6 VIADEV_SRQ_ZERO_POST_MAX
  9.5.7 VIADEV_MAX_R3_PENDING_DATA
 9.6 Shared Memory Control Parameters
  9.6.1 VIADEV_SMP_EAGERSIZE
  9.6.2 VIADEV_SMPI_LENGTH_QUEUE
  9.6.3 SMP_SEND_BUF_SIZE
  9.6.4 VIADEV_SMP_NUM_SEND_BUFFER
  9.6.5 VIADEV_USE_AFFINITY
  9.6.6 VIADEV_CPU_MAPPING
 9.7 Multi-Rail Usage Parameters
  9.7.1 STRIPING_THRESHOLD
  9.7.2 NUM_QP_PER_PORT
  9.7.3 NUM_PORTS
  9.7.4 NUM_HCAS
  9.7.5 SM_SCHEDULING
  9.7.6 LM_SCHEDULING
 9.8 Run time parameters for Collectives
  9.8.1 VIADEV_USE_SHMEM_COLL
  9.8.2 VIADEV_USE_SHMEM_BARRIER
  9.8.3 VIADEV_USE_SHMEM_ALLREDUCE
  9.8.4 VIADEV_USE_SHMEM_REDUCE
  9.8.5 VIADEV_USE_ALLGATHER_NEW
  9.8.6 VIADEV_MAX_SHMEM_COLL_COMM
  9.8.7 VIADEV_SHMEM_COLL_MAX_MSG_SIZE
  9.8.8 VIADEV_SHMEM_COLL_REDUCE_THRESHOLD
  9.8.9 VIADEV_SHMEM_COLL_ALLREDUCE_THRESHOLD
  9.8.10 VIADEV_BCAST_KNOMIAL
  9.8.11 MPIR_ALLTOALL_SHORT_MSG
  9.8.12 MPIR_ALLTOALL_MEDIUM_MSG
  9.8.13 MPIR_AllTOALL_BASIC
  9.8.14 MPIR_ALLTOALL_MCORE_OPT
 9.9 CM Control Parameters
  9.9.1 VIADEV_CM_RECV_BUFFERS
  9.9.2 VIADEV_CM_MAX_SPIN_COUNT
  9.9.3 VIADEV_CM_TIMEOUT
 9.10 Other Parameters
  9.10.1 VIADEV_CLUSTER_SIZE
  9.10.2 VIADEV_PREPOST_DEPTH
  9.10.3 VIADEV_MAX_SPIN_COUNT
  9.10.4 VIADEV_PT2PT_FAILOVER
  9.10.5 DAPL_PROVIDER
10 MVAPICH Gen2-UD Parameters
 10.1 InfiniBand HCA and Network Parameters
  10.1.1 MV_DEVICE
  10.1.2 MV_MTU
 10.2 Reliability Parameters
  10.2.1 MV_PROGRESS_TIMEOUT
  10.2.2 MV_RETRY_TIMEOUT
  10.2.3 MV_MAX_RETRY_COUNT
  10.2.4 MV_ACK_AFTER_RECV
  10.2.5 MV_ACK_AFTER_PROGRESS
 10.3 Large Message Transfer Parameters
  10.3.1 MV_USE_UD_ZCOPY
  10.3.2 MV_UD_ZCOPY_QPS
  10.3.3 MV_UD_ZCOPY_THRESHOLD
  10.3.4 MV_USE_REG_CACHE
 10.4 Performance and General Parameters
  10.4.1 MV_USE_HEADERS
  10.4.2 MV_NUM_UD_QPS
  10.4.3 MV_RNDV_THRESHOLD
 10.5 QP and Buffer Parameters
  10.5.1 MV_UD_SQ_SIZE
  10.5.2 MV_UD_RQ_SIZE
  10.5.3 MV_UD_CQ_SIZE
  10.5.4 MV_USE_LMC
 10.6 Shared Memory Control Parameters
  10.6.1 MV_USE_SHARED_MEMORY
  10.6.2 MV_SMP_EAGERSIZE
  10.6.3 MV_SMPI_LENGTH_QUEUE
  10.6.4 SMP_SEND_BUF_SIZE
  10.6.5 MV_SMP_NUM_SEND_BUFFER
  10.6.6 MV_USE_AFFINITY

1 Overview of the Open-Source MVAPICH Project

InfiniBand is emerging as a high-performance interconnect delivering low latency and high bandwidth. It is also getting widespread acceptance due to its open standard.

MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand and other RDMA enabled interconnects to deliver performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.

Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC and InfiniBand community. As of the 05/29/2008, more than 690 organizations (National Labs, Universities, and Industry) in 41 countries have downloaded this software from OSU’s web site directly. In addition, many IBA vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH have obtained positions in the TOP 500 ranking. The current version of MVAPICH is also being made available with the OpenFabrics/Gen2 stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.

More details on MVAPICH/MVAPICH2 software, users list, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu/

This document contains necessary information for MVAPICH users to download, install, test, use, and tune MVAPICH 1.0. As we get feedback from users and take care of bug-fixes, we introduce new patches against our released distribution and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.

2 How to use this User Guide?

This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH-1.0.

In Section 3 we describe all the features in MVAPICH 1.0. As you read through this section, please note our new features (highlighted as NEW). Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Usage instructions for MVAPICH are explained in Section 5. Apart from describing how to run simple MPI applications, this section also talks about running MVAPICH with some of the advanced features. Section 6 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH, please check Section 7 where we list some of the common problems users face. In Section 8 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. In Section 9, we list important run-time and compile time parameters for the Gen2, VAPI, and QLogic devices, their default values and a small description of each parameter. Finally, Section 10 lists the parameters and tuning options for the OpenFabrics/Gen2-UD device.

3 MVAPICH 1.0 Features

MVAPICH (MPI-1 over InfiniBand) is an MPI-1 implementation based on MPICH and MVICH. MVAPICH 1.0 is available as a single integrated package (with the latest MPICH 1.2.7 and MVICH).

A complete set of features of MVAPICH 1.0 are:

The MVAPICH 1.0 package and the project also includes the following provisions:

4 Installation Instructions

4.1 Download MVAPICH source code

The MVAPICH 1.0 source code package includes the latest MPICH 1.2.7 version and also the required MVICH files from LBNL. Thus, there is no need to download any other files except MVAPICH 1.0 source code.

You can go to the MVAPICH website to obtain the source code.

4.2 Prepare MVAPICH source code

Untar the archive you have downloaded from the web page using the following command. You will have a directory named mvapich-1.0 after executing this command.

$ tar xzf mvapich-1.0.tar.gz

4.3 Getting MVAPICH source updates

As we enhance and improve MVAPICH, we update the available source code on our public SVN repository. In order to obtain these updates, please install a SVN client on your machine. The latest MVAPICH sources may be obtained from the “trunk” of the SVN using the following command:

$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk

The “trunk” may contain newer features and bug fixes. However, it is likely to be lightly tested. If you are interested in obtaining stable and major bug fixes to any release version, you should update your sources from the “branch” of the SVN using the following command:

$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/branches/1.0

MVAPICH 1.0 provides support for seven different ADI devices. Namely, Gen2 Single-Rail (ch_gen2), Gen2 Multi-Rail (ch_gen2_multirail), Gen2/UD (ch_gen2_ud), Shared memory device (ch_smp), VAPI Single-Rail (vapi), VAPI Multi-Rail (vapi_multirail), uDAPL (udapl) and QLogic InfiniPath (psm). Additionally, you can also configure MVAPICH over the standard TCP/IP interface and use it over IPoIB.

4.4 Build MVAPICH

There are several options to build MVAPICH 1.0 based on the underlying InfiniBand libraries you want to utilize. In this section we describe in detail the steps you need to perform to correctly build MVAPICH on your choice of InfiniBand libraries, namely OpenFabrics/Gen2, OpenFabrics/Gen2-UD, Mellanox VAPI, uDAPL, Shared Memory or QLogic InfiniPath.

In the following subsection, we describe how to build and configure the Single-Rail device. In later subsections, we describe the building and configuration of the other devices: Multi-Rail with OpenFabrics/Gen2 (4.4.2), Gen2/UD (4.4.3), InfiniPath (4.4.4), VAPI-single-rail (4.4.5), VAPI-multi-rail (4.4.6), uDAPL (4.4.7), Shared memory (4.4.8) and TCP (4.4.9).

4.4.1 Build MVAPICH with Single-Rail configuration on OpenFabrics Gen2

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.gen2 configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.2 Build MVAPICH with Multi-Rail Configuration on OpenFabrics Gen2

There are several methods to configure MVAPICH 1.0 with multi-rail device on OpenFabrics Gen2.

After setting all the parameters, the script make.mvapich.gen2_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.

MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Please refer to 5.10 for more details.

4.4.3 Build MVAPICH with OpenFabrics/Gen2 over the Unreliable Datagram Transport (Gen2-UD)

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.gen2_ud configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.4 Build MVAPICH with QLogic InfiniPath

There are several methods to configure MVAPICH 1.0.

After setting all the parameters, the script make.mvapich.psm configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.5 Build MVAPICH with Single-Rail Configuration on VAPI

There are several methods to configure MVAPICH 1.0 on VAPI.

After setting all the parameters, the script make.mvapich.vapi configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.6 Build MVAPICH with Multi-Rail Configuration on VAPI

There are several methods to configure MVAPICH 1.0 with multi-rail device.

After setting all the parameters, the script make.mvapich.vapi_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.7 Build MVAPICH with Single-Rail Configuration on uDAPL

Before installing MVAPICH with uDAPL, please make sure you have the uDAPL library installed properly.

There are several methods to configure MVAPICH 1.0 with uDAPL.

After setting all the parameters, the script make.mvapich.udapl configures, builds and installs the entire package in the directory specified by the variable PREFIX.

4.4.8 Build MVAPICH with Shared Memory Device

In the mvapich-1.0 directory, we have provided a script make.mvapich.smp for building MVAPICH over shared memory intended for single SMP systems. The script make.mvapich.smp takes care of different platforms, compilers and architectures. By default, the compilation script uses gcc. In order to select your compiler, please set the variable CC in the script to use either Intel, PathScale or PGI compilers. The platform/architecture is detected automatically. The usage of the shared memory device can be found in 5.2.

4.4.9 Build MVAPICH with TCP/IPoIB

In the mvapich-1.0 directory, we have provided a script make.mvapich.tcp for building MVAPICH over TCP/IP intended for use over IPoIB (IP over InfiniBand). In order to select any other compiler than GCC, please set your CC variable in that script. Simply execute this script (e.g. ./make.mvapich.tcp) for completing your build.

5 Usage Instructions

This section discusses the usage methods for the various features provided by MVAPICH. If you face any problem while following these instructions, please refer to Section 7.

5.1 Compile MPI applications

Use mpicc, mpif77, mpiCC, or mpif90 to compile applications. They can be found under mvapich-1.0/bin.

There are several options to run MPI applications. Please select one of the following options based on your need.

5.2 Run MPI applications using mpirun_rsh

Prerequisites:

Examples of running programs using mpirun_rsh:

$ mpirun_rsh -np 4 n0 n1 n2 n3 ./cpi

The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. By default ssh is used.

$ mpirun_rsh -rsh -np 4 n0 n1 n2 n3 ./cpi

The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. rsh is used regardless of whether ssh or rsh is used when compiling MVAPICH.

$ mpirun_rsh -np 4 -hostfile hosts ./cpi

A list of nodes are in hosts, one per line. MPI ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. ie. if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.

If you are using the shared memory device, then host names can be omitted:

$ mpirun_rsh -np 4 ./cpi

Many parameters of the MPI library can be very easily configured during run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:

$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./cpi

Note that the environmental variables should be put immediately before the executable.

Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.

Please note that there are many different parameters which could be used to improve the performance of applications depending upon their requirements from the MPI library. For a discussion on how to identify which variables may be of interest to you, please take a look at Section 8.

Other options of mpirun_rsh can be obtained using

$ mpirun_rsh --help

5.3 Run MPI applications using SLURM

SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from: http://www.llnl.gov/linux/slurm/

Once SLURM is installed and the daemons are started, applications compiled with MVAPICH can be launched by SLURM, e.g.

$ srun -n2 --mpi=mvapich ./a.out

The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:

$ srun --cpu_bind=v,map_cpu:0,4 --mem_bind=v,map_mem:0,1 -n2 --mpi=mvapich ./a.out

For more information about SLURM and its features please visit SLURM website.

5.4 Run MPI applications with Scalable Collectives

MVAPICH provides shared memory implementations of important collectives:
MPI_Allreduce, MPI_Reduce, MPI_Barrier and MPI_Bcast. It also has support for Enhanced MPI_Allgather. These collective operations are enabled by default. Shared Memory Collectives are supported over Gen2, Gen2/UD, PSM and Shared Memory devices. The PSM device currently only has MPI_Barrier and MPI_Bcast shared memory implementation.

These operations can be disabled all at once by setting VIADEV_USE_SHMEM_COLL to 0 or one at a time by using the following environment variables:

Please refer to section 9.8 for tuning the various environment variables.

5.5 Run MPI applications using shared library support

MVAPICH provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled, your applications will be built on top of shared libraries automatically. The following commands provide some examples of how to build and run your application with shared library support.

5.6 Run MPI applications using ADIO driver for Lustre

MVAPICH contains optimized Lustre ADIO support for the OpenFabrics/Gen2 device. The Lustre directory should be mounted on all nodes on which MVAPICH processes will be running. Compile MVAPICH with ADIO support for Lustre as described in Section 4.4.1. If your Lustre mount is /mnt/datafs on nodes n0 and n1, on node n0, you can compile and run your program as follows:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 <path to perf>/perf -fname /mnt/datafs/testfile

If you have enabled support for multiple file systems, append the prefix ”lustre:” to the name of the file. For example:

$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 ./perf -fname lustre:/mnt/datafs/testfile

5.7 Run MPI applications using TotalView Debugger support

MVAPICH provides TotalView support for the OpenFabrics/Gen2 (mpid/ch_gen2),
OpenFabrics/Gen2-UD (mpid/ch_gen2_ud), Single-rail VAPI (mpid/vapi), InfiniPath (mpid/psm) and Shared-Memory devices (mpid/ch_smp). You need to use mpirun_rsh when running TotalView. The following commands also provide an example of how to build and run your application with TotalView support. Note: running TotalView demands correct setup in your environment, if you encounter any problem with your setup, please check with your system administrator for help.

5.8 Run MPI applications with Multi-Pathing Support for Multi-Core Architectures

MVAPICH provides multi-rail device with advance scheduling policies for data transfer 5.10. However, even with the single-rail configuration, multi-pathing (multiple ports, adapters and multiple paths provided by the LMC mechanism) can be used for multi-core systems. With this support, processes executing on the same node can leverage the above configurations by binding to one of the available configuration. MVAPICH provides multiple choices to the user for leveraging this functionality, which are described in the upcoming examples. This functionality is currently available only in the single-rail gen2 device.

5.9 Run MPI Application with Network Fault Tolerance Support (for OpenFabrics Gen2-IB Device)

MVAPICH supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.

To enable this functionality, a run-time variable, VIADEV_USE_APM (section 9.2.8) can be enabled, as shown in the following example:

$ mpirun_rsh -np 2 VIADEV_USE_APM=1 ./cpi

MVAPICH also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable VIADEV_USE_APM_TEST (section 9.2.9). This should be combined with VIADEV_USE_APM as follows:

$ mpirun_rsh -np 2 VIADEV_USE_APM=1 VIADEV_USE_APM_TEST=1 ./cpi

5.10 Run MPI applications on Multi-Rail Configurations

MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Run-time parameters are being provided to control the policies. They are further divided into policies for small and large messages. These policies are available in the multirail devices for gen2 and VAPI.

5.11 Run Memory Intensive Applications on Multi-core Systems

Process to CPU mapping may affect application preformance on multi-core systems, especially for memory intensive applications. If the number of processes is smaller than the number of CPU’s/cores, it is preferable to distribute the processes on different chips to avoid memory contention because CPU’s/cores on the same chip usually share the memory controller. MVAPICH provides flexible user defined CPU mapping. To use it, first make sure CPU affinity is set (Section 9.6.5). Then use the run-time environment variable VIADEV_CPU_MAPPING to specify the CPU/core mapping. For example, if it is a quad-core system in which cores [0-3] are on the same chip and cores [4-7] are on another chip, and you need to run an application with 2 processes, then the following mapping will give the best performance:

$ mpirun_rsh -np 2 n0 n0 VIADEV_CPU_MAPPING=0,4 ./a.out

In this case process 0 will be mapped to core 0 and process 1 will be mapped to core 4.

More information about VIADEV_CPU_MAPPING can be found in Section 9.6.6.

6 Using OSU Benchmarks

If you have arrived at this point, you have successfully installed MVAPICH. Congratulations!! In the mvapich-1.0/osu_benchmarks directory, we provide four basic performance tests: one-way latency test, uni-directional bandwidth test, bi-directional bandwidth test multiple bandwidth/message rate, and MPI-level broadcast latency test. You can compile and run these tests on your machines to evaluate the basic performance of MVAPICH.

These benchmarks as well as other benchmarks (such as for one-sided operations in MPI-2) are available on our projects’ web page. Sample performance numbers for these benchmarks on representative platforms and IBA gears are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let the MVAPICH community know by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.

7 FAQ and Troubleshooting

Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact the MVAPICH community by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.

MVAPICH can be used over multiple underlying InfiniBand libraries, namely OpenFabrics (Gen2), OpenFabrices (Gen2-UD), VAPI, uDAPL and QLogic InfiniPath. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the three InfiniBand libraries.

7.1 General Questions and Troubleshooting

7.1.1 How can I check what version I am using?

Running the following command will provide you with the version of MVAPICH that is being used.

$ mpirun_rsh -v

7.1.2 Are fork() and system() supported?

fork() and system() is supported for Gen2 and Gen2-UD devices as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.

7.1.3 My application cannot pass MPI_Init

This is a common symptom of several setup issues related to job startup. Please make sure of the following things:

7.1.4 My application hangs/aborts in Collectives

MVAPICH implements highly optimized RDMA collective algorithms for frequently used collectives such as MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast MPI_Allgather, MPI_Barrier. The optimized implementations have been well tested and tuned. However, if you face any problems in these collectives for your application, please disable the optimized collectives. For example, if you want to disable MPI_Allreduce, you can do:

$ mpirun_rsh -np 8 -hostfile hf VIADEV_USE_SHMEM_ALLREDUCE=0 ./a.out

The complete list of all such paramaters is given in  9.8

7.1.5 Building MVAPICH with g77/gfortran

The gfortran compiler can be used for F77 and F90. In order to make this work, the following environment variables should be set prior to running the build script:

$ export F77=gfortran
$ export F90=gfortran
$ export F77_GETARGDECL=" "

If g77 and gfortran are used together for F77 and F90 respectively, it might be necessary to set the following environment variable in order to get around possible compatibility issues:

$ export F90FLAGS="-ff2c"

7.1.6 Running MPI programs built with gfortran

MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” when using mpirun_rsh to fix the problem. Running the pi3f90 example program using this variable setting is shown below:

$ mpirun_rsh -np 2 n1 n2 GFORTRAN_UNBUFFERED_ALL=y ./pi3f90

7.1.7 Timeout During Debugging

If a debug session is terminated with an alarm message, mpirun_rsh may have timedout waiting for the job launch to complete. Use a larger MPIRUN_TIMEOUT (section 9.1.2) to work around this problem.

7.1.8 Unexpected exit status

If an application task terminates unexpectedly during job launch, mpirun_rsh may print the message:

mpispawn.c:303 Unexpected exit status

This usually indicates a problem with the application. Other error messages around this (if any) might point to the actual issue.

7.1.9 /usr/bin/env: mpispawn: No such file or directory

If mpirun_rsh fails with this error message, it was unable to locate a necessary utility. This can be fixed by ensuring that all MVAPICH executables are in the PATH on all nodes.

If PATHs cannot be setup as mentioned, then invoke mpirun_rsh with a path. For example:

/path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

or

../../path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc

7.1.10 **io No such file or directory*?

If you are using ADIO support for Lustre, please make sure that:
– Lustre is setup correctly, and that you are able to create, read to and write from files in the Lustre mounted directory.
– The Lustre directory is mounted on all nodes on which MVAPICH processes with ADIO support for Lustre are running.
– The path to the file is correctly specified.
– The permissions for the file or directory are correctly specified.

7.1.11 My program segfaults with:
File locking failed in ADIOI_Set_lock?

If you are using ADIO support for Lustre, the recent Lustre releases require an additional mount option to have correct file locks.
So please include the following option with your lustre mount command: ”-o localflock”.
For example: $ mount -o localflock -t lustre xxxx@o2ib:/datafs /mnt/datafs

7.1.12 MPI+OpenMP shows bad performance

MVAPICH uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, such as MPI+OpenMP model, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, e.g.

$ mpirun_rsh -np 2 n1 n2 VIADEV_USE_AFFINITY=0 ./a.out

More information about CPU affinity and CPU binding can be found in Sections 9.6.5 and  9.6.6.

7.1.13 Fortran support is disabled with Sun Studio 12 compilers

Please replace the -Wl,-rpath option in the build scripts (e.g. make.mvapich.gen2) with -R when Sun Studio 12 compilers are used.

7.1.14 Other MPICH problems

Several well-known MPICH related problems on different platforms and environments have already been identified by Argonne. They are available on the MPICH patch webpage.

7.2 Troubleshooting with MVAPICH/OpenFabrics(Gen2)

In this section, we discuss the general error conditions for MVAPICH based on OpenFabrics Gen2.

7.2.1 No IB Devices found

This error is generated by MVAPICH when it cannot find any Gen2 InfiniBand devices. If you are experiencing this error, then please make sure that your Gen2 installation is proper. You can do so by doing the following:

$ locate libibverbs

This tells you if you have installed libibverbs (the Gen2 verbs layer) or not. By default it installs in /usr/local.

If you have installed libibverbs, then please check if the OpenFabrics Gen2 drivers are loaded. You can do so by:

$ lsmod | grep ib

If this command does not list ib_uverbs, then probably you haven’t started all OpenFabrics Gen2 services. Please refer to the OpenFabrics Wiki installation cheat sheet for more details on setting up the OpenFabrics Gen2 stack.

7.2.2 Error getting HCA Context

This error is generated when MVAPICH cannot “open” the HCA (or the InfiniBand communication device). Please execute:

$ ls -l /dev/infiniband

If this command shows any devices uverbs0 with read/write permissions for users as shown below, please consult the “Loading kernel components” section of the OpenFabrics Wiki installation cheat sheet.

crw-rw-rw- 1 root root 231, 192 Feb 24 14:31 uverbs0

7.2.3 CQ or QP Creation failure

If you encounter this error, then you need to set the maximum available locked memory value for your system. The usual Linux defaults are quite low to what is required for HPC applications. One way to do this is to edit the file /etc/security/limits.conf and enter the following line:

* soft memlock phys_mem_size_in_KB

Where, phys_mem_size_in_KB is the MemTotal value reported by /proc/meminfo. In addition, you need to enter the following line in /etc/init.d/sshd and then restart sshd.

ulimit -l phys_mem_size_in_KB

7.2.4 No Active Port found

MVAPICH generates this error when it cannot find any port active for the specific HCA being used for communication. This probably means that the ports are not configured to be a part of the InfiniBand subnet and thus are not “Active”. You can check whether the port is active or not, by using the following command:

$ ibstat

Please look at the “State” field for the status of the port being used. To bring a port to “Active” status, on any node in the same InfiniBand subnet, execute the following command:

# opensm -o 1

Please note that you need superuser privilege for this command. This command invokes the InfiniBand subnet manager (OpenSM) and asks it to sweep the subnet once and make all ports “Active”. OpenSM is usually installed in /usr/local/bin.

7.2.5 Couldn’t modify SRQ limit

This means that your HCA doesn’t support the ibv_modify_srq feature. Please upgrade the firmware version and OpenFabrics Gen2 libraries on your cluster. You can obtain the latest Mellanox firmware images from this webpage.

Even after updating your firmware and OpenFabrics Gen2 libraries, if you continue to experience this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE with -DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build the MVAPICH library. Note that you should first try to update your firmware and OpenFabrics Gen2 libraries before taking this measure.

If you believe that your HCA supports this feature and yet you are experiencing this problem, please contact the MVAPICH community by posting a note to mvapich-discuss@cse.ohio-state.edu mailing list.

7.2.6 Got completion with error code 12

The error code 12 indicates that the InfiniBand HCA has given up after attempting to send the packet after several tries. This can be caused by either loose or faulty cables. Please check the InfiniBand connectivity of your cluster. Additionally, you may check the error rates at the respective HCAs using:

$ ibchecknet

This utility (usually installed in /usr/local/bin) sweeps the InfiniBand subnet and reports ports that are OK or if they have errors. You may try to quiesce the entire cluster and bring it up after an InfiniBand switch reboot.

7.2.7 Hang with VIADEV_USE_LMC=1

The VIADEV_USE_LMC parameter allows the usage of multiple paths for multi-core and multi-way SMP systems, set up the subnet manager 9.2.6. The subnet manager allows different routing engines to be used (Min-Hop routing algorithm by default). We have noticed hangs using this parameter with Up/Down routing algorithm of OpenSM. There are two ways to fix this problem:

7.2.8 Failure with Automatic Path Migration

MVAPICH provides network fault tolerance with Automatic Path Migration (APM). However, APM is supported only with OFED 1.2 onwards. With OFED 1.1 and prior versions of OpenFabrics drivers, APM functionality is not completely supported. Please refer to Section 9.2.8 and section 9.2.9

7.2.9 Problems with mpirun_rsh

MVAPICH 1.0 provides a new more scalable startup procedure by default. If for some reason the old version is desired, it can be enabled using the -legacy flag to mpirun_rsh.

$ mpirun_rsh -legacy ...

7.3 Troubleshooting with MVAPICH/VAPI

7.3.1 Cannot Open HCA

The above error reports that the InfiniBand Adapter is not ready for communication. Make sure that the drivers are up. This can be done by executing:

$ locate libvapi

This command gives the path at which drivers are setup. Additionally, you may try to use the command vstat to check the availability of HCAs.

$ vstat

7.3.2 Cannot include vapi.h

This error is generated during compilation, if the correct path to the InfiniBand library installation is not given.

Please setup the environment variable MTHOME as

$ export MTHOME=/usr/local/ibgd/driver/infinihost

If the problem persists, please contact your system administrator or reinstall your copy of IBGD. You can get IBGD from Mellanox website.

7.3.3 Program aborts with VAPI_RETRY_EXEC_ERROR

This error usually indicates that all InfiniBand links the MPI application is trying to use are not in the PORT_ACTIVE state. Please make sure that all ports show PORT_ACTIVE with the VAPI utility vstat. If you are using Multi-Rail support, please keep in mind that all ports of all adapters you are using need to show PORT_ACTIVE.

7.3.4 ld:multiple definitions of symbol _calloc error on MacOS

Please make sure that the environmental variable "MAC_OSX" is set before your configuration. If you use manual configuration and not mvapich.make.macosx, you must configure MVAPICH in the following way:

$ export MAC_OSX=yes; ./configure; make; make install

If you encounter this problem of compiling your own applications, like given below, it is likely that you have explicitly included "-lm". You should remove that.

"ld: multiple definitions of symbol _calloc
/usr/lib/libm.dylib(malloc.So) definition of _calloc
/tmp/mvapich-0.9.5/mvapich/lib/libmpich.a(dreg-g5.o)
definition of _calloc in section (__TEXT,__text)
ld: multiple definitions of symbol _free
/usr/lib/libm.dylib(malloc.So) definition of _free
/tmp/mvapich-0.9.5/mvapich/lib/libmpich.a(dreg-g5.o)
definition of _free in section (__TEXT,__text) "

7.3.5 No Fortran interface on the MacOS platform

To enable Fortran support, you would need to install the IBM compiler located at (there is a 60-day free trial version) available from IBM.

Once you unpack the tar ball, you can customize and use make.mvapich.vapi to compile and install the package or manually configure, compile and install the package.

7.4 Troubleshooting with MVAPICH/uDAPL

7.4.1 Cannot Open IA

If you configure MVAPICH 1.0 with uDAPL and see this error, you need to check whether you have specified the correct uDAPL service provider. If you have specified the uDAPL provider but still see this error, you need to check whether the specified network is working or not.

In addition, please check the contents of the file /etc/dat.conf. It should contain the name of the IA e.g. ib0. A typical entry would look like the following:

ib0 u1.2 nonthreadsafe default /usr/lib/libdapl.so ri.1.1 ‘‘mthca0 1’’ ‘‘’’

7.4.2 DAT Insufficient Resource

If you configure MVAPICH 1.0 with uDAPL and see this error, you need to reduce the value of the environmental variable RDMA_DEFAULT_MAX_WQE depending on the underlying network.

7.4.3 Cannot find libdat.so

If you get the error: “error while loading shared libraries, libdat.so”, the location of the dat shared library is incorrect. You need to find the correct path of libdat.so and export LD_LIBRARY_PATH to this correct location. For example:

$ export LD_LIBRARY_PATH=/path/to/libdat.so:$LD_LIBRARY_PATH

$ mpirun_rsh -np 2 n0 n1 ./a.out

7.4.4 Cannot compile with DAPL-v2.0

MVAPICH 1.0 currently does not support DAPL-v2.0. You need to build MVAPICH against DAPL-v1.2 library. Support for DAPL-v2.0 is available in MVAPICH2.

7.5 Troubleshooting with MVAPICH/QLogic InfiniPath

7.5.1 Low Bandwidth

Incorrect settings of MTRR mapping may result in achieving a low bandwidth with InfiniPath hardware. To alleviate this situation, BIOS settings for MTRR mapping may be edited to “Discrete”. For further details, please refer to the InfiniPath User Guide.

7.5.2 Cannot find -lpsm_infinipath

Variable IBHOME_LIB in make.mvapich.psm file does not point to correct location. IBHOME_LIB should point to the directory containing the InfiniPath device libraries. By default they are installed in /usr/lib or /usr/lib64.

7.5.3 Mandatory variables not set

IBHOME, PREFIX, CC, F77 are mandatory variable required by the installation script and must be set in the file make.mvapich.psm. IBHOME - directory which contains the InfiniPath header file include directory. By default InfiniPath header file include directory is in /usr. PREFIX - directory where MVAPICH should be installed. CC - C compiler. Typically set to gcc. F77 - fortran compiler. Typically set to g77.

7.5.4 Can’t open /dev/ipath, Network Down

This probably means that the ports are not configured to be a part of the InfiniBand subnet and thus are not “Active”. You can check whether the port is active or not, by using the following command on that node:

$ ipath_control -i

Please look at the “Status” field for the status of the port being used. To bring a port to “Active” status, on any node in the same InfiniBand subnet, execute the following command:

# opensm -o

Please note that you may need superuser privilege for this command. This command invokes the InfiniBand subnet manager (OpenSM) and asks it to sweep the subnet once and make all ports “Active”. OpenSM is usually installed in /usr/local/bin. You may also look at the file /sys/bus/pci/drivers/ib_ipath/status_str to verify that the InfiniPath software is loaded correctly. For details, please refer to InfiniPath user guide, download able from www.qlogic.com.

7.5.5 No ports available on /dev/ipath

This is a limitation of InfiniPath Release 2.0. By default it allows a maximum of eight processes per QHT7140 HCA and four processes with QLE7140 HCA. To overcome this, please consult your InfiniPath support provider.

8 Tuning and Scalability Features for Large Clusters

MVAPICH supports many different parameters for tuning and extracting the best performance for a wide range of platforms and applications. These parameters can be either compile time parameters or run time parameters. Please refer to section 9 for a complete description of all the parameters. In this section we classify these parameters depending on what you are tuning for and provide guidelines on how to use them.

8.1 Job Launch Tuning

MVAPICH 1.0 has a new, scalable mpirun_rsh which uses a tree based mechanism to spawn processes. The degree of this tree is determined dynamically to keep the depth low. For large clusters, it might be beneficial to further flatten the tree by specifying a higher degree. The degree can be overridden with the environment variable MT_DEGREE (section 9.1.1).

8.2 Network Point-to-point Tuning

In MVAPICH we use Shared Receive Queue (SRQ) support which consumes less memory than other methods. It can lead to a significant reduction in the memory footprint of MVAPICH.

To enable this mode, please include -DMEMORY_SCALE in your make.mvapich.gcc (it is included by default). Once you have enabled the scalable memory mode in MVAPICH, there are four aspects by which you can customize the memory usage and performance ratio according to the needs of your cluster.

8.2.1 Shared Receive Queue (SRQ) Tuning

The main environmental parameters controlling the behavior of the Shared Receive Queue design are:

Starting with 1.0, MVAPICH uses a dynamic re-size of the number of buffers used for the SRQ by default. The parameter VIADEV_SRQ_MAX_SIZE is the maximum size of the Shared Receive Queue (default 4096). You may increase this to value 8192 if the application requires very large number of processors (8K and beyond). The application will start by only using VIADEV_SRQ_SIZE buffers (default 256) and will double this value on every SRQ limit event (up to VIADEV_SRQ_MAX_SIZE). For long running applications this re-size should show little effect. If needed, the VIADEV_SRQ_SIZE van be increased to 1024 or higher as needed for applications.

VIADEV_SRQ_LIMIT defines the low watermark for the flow control handler. This can be reduced if your aim is to reduce the number of interrupts.

VIADEV_VBUF_POOL_SIZE is a fixed number of pool of vbufs. These vbufs can be shared among all different connections depending on the communication needs of each connection. You may want to increase this number for large scale clusters (4K and beyond).

8.2.2 On-Demand Connection Tuning

The major environmental variables controlling the behavior of the connection management in MVAPICH are:

VIADEV_CM_RECV_BUFFERS is the number of buffers used by the connection manager to establish new connections. These buffers are very small (around 20 bytes) and they are shared for all InfiniBand connections, so this value may be increased to 8192 for large clusters to avoid retries in case of packet drops.

VIADEV_CM_MAX_SPIN_COUNT is the number of times the connection manager polls for new incoming connections. This may be increased to reduce the interrupt overhead when lot of incoming connections are started at the same time.

VIADEV_CM_TIMEOUT is the timeout value associated with connection request messages on the UD channel. Decreasing this may lead to faster retries, but at the cost of generating duplicate messages. Similarly increasing this may lead to slower retries but lesser chance of duplicate messages.

8.2.3 Adaptive RDMA Tuning

MVAPICH implements a dynamic allocation and utilization of the RDMA mechanism for short messages. It can lead to significant reduction in memory footprint of MVAPICH.

There are two environmental parameters:

These two parameters control the behavior of this dynamic scheme.
VIADEV_ADAPTIVE_RDMA_LIMIT controls the maximum number of processes for which the “fast” RDMA buffers are allocated. For very large scale clusters, it is suggested to set this value to -1, which means RDMA buffers will be allocated for log(n) number of connections (where n is the number of processes in the application).
VIADEV_ADAPTIVE_RDMA_THRESHOLD is the number of messages exchanged per connection before RDMA buffers are allocated for that connection. For very large scale clusters, it is suggested that this value be increased so that only very frequently communicating connections allocate RDMA buffers.

In addition, the following parameters are also important in tuning the memory requirement: VIADEV_VBUF_TOTAL_SIZE (9.3.2) and VIADEV_NUM_RDMA_BUFFER (9.3.1).

The product of VBUF_TOTAL_SIZE and VIADEV_NUM_RDMA_BUFFER generally is a measure of the amount of memory registered for eager message passing. These buffers are not shared across connections.

To provide the best performance (latency/bandwidth) to memory ratio, we have decided on a set of default values for these parameters. These parameters are often dependent on the execution platform. To use preset values for small, medium and large clusters (1-64, 64-256, 256-. . . ), please use VIADEV_CLUSTER_SIZE (9.10.1) as either SMALL, MEDIUM or LARGE, respectively.

8.3 Shared Memory Point-to-point Tuning

MVAPICH uses shared memory communication channel to achieve high-performance message passing among processes that are on the same physical node. The two main parameters which are used for tuning shared memory performance for small messages are VIADEV_SMPI_LENGTH_QUEUE ( 10.6.3) and VIADEV_SMP_EAGER_SIZE ( 10.6.2). The two main parameters which are used for tuning shared memory performance for large messages are SMP_SEND_BUF_SIZE( 10.6.4) and VIADEV_SMP_NUM_SEND_BUFFER ( 10.6.5).

VIADEV_SMPI_LENGTH_QUEUE is the size of the shared memory buffer which is used to store outstanding small and control messages. VIADEV_SMP_EAGER_SIZE defines the switch point from Eager protocol to Rendezvous protocol.

Messages larger than VIADEV_SMP_EAGER_SIZE are packetized and sent out in a pipelined manner. SMP_SEND_BUF_SIZE is the packet size, i.e. the send buffer size. VIADEV_SMP_NUM_SEND_BUFFER is the number of send buffers. Shared memory communication can be disabled at run time by the parameter VIADEV_USE_SHARED_MEM( 9.4.5).

Performance of some applications is sensitive to the rank distribution according to their communication pattern. It is advisable that processes that communicate most use the shared memory path, since it offers lower latencies compared to the network path. To adjust the process rank distribution, please refer Section 5.2 to decide which distribution “cyclic” or “block” suits the communication pattern of your application. In particular, we have found that when using “block” distribution, the performance of HPL (Linpack) is better.

8.4 Scalable Collectives Tuning

MVAPICH uses shared memory to get the best performance for many collective operations: MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast.

The important parameters for tuning these collectives are as follows. For MPI_Allreduce, the optimized shared memory algorithm is used until the
VIADEV_SHMEM_COLL_ALLREDUCE_THRESHOLD (9.8.9).

Similarly for MPI_Reduce the corresponding threshold is
VIADEV_SHMEM_COLL_REDUCE_THRESHOLD (9.8.8).

For MPI_Bcast, the important parameter is the degree of the tree used for inter-node data movement. This parameter is VIADEV_BCAST_KNOMIAL (9.8.10).

For MPI_Alltoall, the two main parameters are MPIR_ALLTOALL_SHORT_MSG (9.8.11) and MPIR_ALLTOALL_MEDIUM_MSG (9.8.12). There are three main algorithms used for MPI_Alltoall: short message, medium message and long message. The short message algorithm is used until MPIR_ALLTOALL_SHORT_MSG and from then on the medium message algorithm is used until MPIR_ALLTOALL_MEDIUM_MSG. These thresholds can be tuned appropriately to get the best performance.

9 MVAPICH Parameters

9.1 Job Launch Parameters

9.1.1 MT_DEGREE

The degree of the hierarchical tree used by mpirun_rsh. By default mpirun_rsh uses a value that tries to keep the depth of the tree to 4. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).

9.1.2 MPIRUN_TIMEOUT

The number of seconds after which mpirun_rsh aborts job launch. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).

9.2 InfiniBand HCA and Network Parameters

9.2.1 VIADEV_DEVICE

Name of the InfiniBand device. e.g. mthca0, mthca1 or ehca0 (for IBM ehca).

9.2.2 VIADEV_DEFAULT_PORT

The default port on the InfiniBand device to be used for communication.

9.2.3 VIADEV_MAX_PORTS

This variables allows to change the maximum number of ports per adapter which are supported.

9.2.4 VIADEV_USE_MULTIHCA

This variable allows a user to bind processes on a node to ports attached to different HCAs on a node. It allows an efficient utilization of HCA ports in a round-robin fashion. VIADEV_MULTIHCA is an alias for this variable for backward compatibility. However, if VIADEV_USE_MULTIHCA is defined, value of VIADEV_MULTIHCA will be overwritten.

9.2.5 VIADEV_USE_MULTIPORT

This variable allows a user to bind processes on a node to ports attached to different HCAs on a node. It allows an efficient utilization of HCA ports in a round-robin fashion. VIADEV_MULTIPORT is an alias for this variable for backward compatibility. However, if VIADEV_USE_MULTIPORT is defined, value of VIADEV_MULTIPORT will be overwritten.

9.2.6 VIADEV_USE_LMC

This variable allows the usage of multiple paths between end nodes for multi-core/multi-way SMP systems. The path selection is on the basis of source and destination ranks.

9.2.7 VIADEV_DEFAULT_MTU

The internal MTU used for IB. This parameter should be a string instead of an integer. Valid values are: MTU256, MTU512, MTU1024, MTU2048, MTU4096.

9.2.8 VIADEV_USE_APM

This parameter is used for recovery from network faults using Automatic Path Migration. This functionality is beneficial in the presence of multiple paths in the network, which can be enabled by using lmc mechanism.

9.2.9 VIADEV_USE_APM_TEST

This parameter is used for testing the Automatic Path Migration functionality. It periodically moves the alternate path as the primary path of communication and re-loads another alternate path.

9.3 Memory Usage and Performance Control Parameters

9.3.1 VIADEV_NUM_RDMA_BUFFER

The number of RDMA buffers used for the RDMA fast path. This fast path is used to reduce latency and overhead of small data and control messages. This value is effective only when macro RDMA_FAST_PATH or ADAPTIVE_RDMA_FAST_PATH is defined.

9.3.2 VIADEV_VBUF_TOTAL_SIZE

This macro defines the size of each vbuf.

Different presets for this value are available for different sizes of clusters
VIADEV_CLUSTER_SIZE = (SMALL, MEDIUM, LARGE, AUTO).

9.3.3 VIADEV_RNDV_PROTOCOL

This parameter chooses the underlying Rendezvous protocol

Options are:

NOTE: ASYNC is only available if the library was compiled with the -DASYNC CFLAG (not defined by default)

9.3.4 VIADEV_RENDEZVOUS_THRESHOLD

This specifies the switch point between eager and rendezvous protocol in MVAPICH.

9.3.5 VIADEV_MAX_RDMA_SIZE

Maximum size of an RDMA put message (RPUT) in the rendezvous protocol. Note that this variable should be set in bytes.

9.3.6 VIADEV_R3_NOCACHE_THRESHOLD

This is the message size (in bytes) which will be sent using the R3 mode if the registration cache is turned off, i.e. VIADEV_USE_DREG_CACHE=0

9.3.7 VIADEV_VBUF_POOL_SIZE

The number of vbufs in the initial pool. This pool is shared among all the connections.

9.3.8 VIADEV_VBUF_SECONDARY_POOL_SIZE

The number of vbufs allocated each time when the global pool is running out in the initial pool. This is also shared among all the connections.

9.3.9 VIADEV_USE_DREG_CACHE

This indicates whether registration cache is to be used or not. The registration cache speeds up zero copy operations if user memory is re-used many times.

9.3.10 LAZY_MEM_UNREGISTER

Memory registration cache will be used if this flag is defined. We recommend not to use this flag for vapi and vapi_multirail devices.

9.3.11 VIADEV_NDREG_ENTRIES

This defines the total number of buffers that can be stored in the registration cache. A larger value will lead to more infrequent lazy de-registration.

9.3.12 VIADEV_DREG_CACHE_LIMIT

This sets a limit on the number of pages kept registered by the registration cache. If you set it to 0, that implies no limits on the number of pages registered.

9.3.13 VIADEV_VBUF_MAX

Max (total) number of VBUFs to allocate after which the process terminates with a fatal error. -1 means no limit.

9.3.14 VIADEV_ON_DEMAND_THRESHOLD

Number of processes beyond which on-demand connection management will be used.

9.3.15 VIADEV_MAX_INLINE_SIZE

Maximum size of a message (in bytes) that may be sent INLINE with message descriptor Lowering this increases message latency, but can lower memory requirements. Also see VIADEV_NO_INLINE_THRESHOLD, which will override this value in some cases.

9.3.16 VIADEV_NO_INLINE_THRESHOLD

This parameter automatically changes the VIADEV_MAX_INLINE_SIZE after the number of connections exceeds VIADEV_NO_INLINE_THRESHOLD. Behavior is slightly different depending on whether on-demand connection setup is used:

9.3.17 VIADEV_USE_BLOCKING

Use blocking mode progress, instead of polling. This allows MPI to yield CPU to other processes if there are no more incoming messages.

9.3.18 VIADEV_ADAPTIVE_RDMA_LIMIT

This is the maximum number of RDMA paths that will be established in the entire MPI application. Passing it a value -1 implies that at most log(n) number of paths will be established. Where n is the number of processes in the MPI application.

9.3.19 VIADEV_ADAPTIVE_RDMA_THRESHOLD

This is the number of messages exchanged per connection after which the RDMA path is established.

9.3.20 VIADEV_ADAPTIVE_ENABLE_LIMIT

Default value: Number of processes (np) in application If the number of jobs exceeds this limit, adaptive flow will be enabled. To enable adaptive flow for any number of jobs define: VIADEV_ADAPTIVE_ENABLE_LIMIT=0

9.3.21 VIADEV_SQ_SIZE

To control the number of allowable outstanding send operations to the device.

9.4 Send/Receive Control Parameters

9.4.1 VIADEV_CREDIT_PRESERVE

This parameter records the number of credits per connection that will be preserved for non-data, control packets. If SRQ is not used, this default is 10.

9.4.2 VIADEV_CREDIT_NOTIFY_THRESHOLD

Flow control information is usually sent via piggybacking with other messages. This parameter is used, along with VIADEV_DYNAMIC_CREDIT_THRESHOLD, to determine when to send explicit flow control update messages.

9.4.3 VIADEV_DYNAMIC_CREDIT_THRESHOLD

Flow control information is usually sent via piggybacking with other messages. These two parameters are used to determine when to send explicit flow control update messages.

9.4.4 VIADEV_INITIAL_PREPOST_DEPTH

This defines the initial number of pre-posted receive buffers for each connection. If communication happen for a particular connection, the number of buffers will be increased to VIADEV_PREPOST_DEPTH.

9.4.5 VIADEV_USE_SHARED_MEM

When _SMP_ is defined, shared memory communication can be disabled by setting VIADEV_USE_SHARED_MEM=0.

9.4.6 VIADEV_PROGRESS_THRESHOLD

This value determines if additional MPI progress engine calls are made when making send operations. If there are this number or more queued send operations then progress is attempted.

9.4.7 VIADEV_USE_COALESCE

This setting turns on (1) or off (0) the coalescing of messages. Leaving feature on can help applications that make many consecutive send operations to the same host.

9.4.8 VIADEV_USE_COALESCE_SAME

If VIADEV_USE_COALESCE is enabled, this flag will enable coalescing only for messages of the same tag, communicator, and size. This also increases VIADEV_PROGRESS_THRESHOLD to 2.

9.4.9 VIADEV_COALESCE_THRESHOLD_SQ

If there are more than this number of small messages outstanding to a another task, messages will be coalesced until one of the previous sends completes.

9.4.10 VIADEV_COALESCE_THRESHOLD_SIZE

Attempt to coalesce messages under this size. If this number is greater than
VIADEV_VBUF_TOTAL_SIZE, then it is set to VIADEV_VBUF_TOTAL_SIZE. This has no effect if message coalescing is turned off.

9.5 SRQ (Shared Receive Queue) Control Parameters

9.5.1 VIADEV_USE_SRQ

Indicates whether Shared Receive Queue is to be used or not. Users are strongly encouraged to use this as long as the InfiniBand software/hardware supports this feature.

9.5.2 VIADEV_SRQ_MAX_SIZE

This is the maximum number of work requests allowed on the Shared Receive Queue. Upon receiving a SRQ limit event, the current value of VIADEV_SRQ_SIZE will be doubled or moved to the maximum of VIADEV_SRQ_MAX_SIZE, whichever is smaller.

9.5.3 VIADEV_SRQ_SIZE

This is the maximum number of work requests posted to the Shared Receive Queue initially. This value will dynamically re-size up to VIADEV_SRQ_MAX_SIZE.

9.5.4 VIADEV_SRQ_LIMIT

This is the low watermark limit for the Shared Receive Queue. If the number of available work entries on the SRQ drops below this limit, the flow control will be activated.

9.5.5 VIADEV_MAX_R3_OUST_SEND

This is the maximum number of R3 packets which are outstanding when using Shared Receive Queues.

9.5.6 VIADEV_SRQ_ZERO_POST_MAX

Maximum number of unsuccessful SRQ posts that an async thread can make before going to sleep.

9.5.7 VIADEV_MAX_R3_PENDING_DATA

This is the maximum amount of R3 data that is sent out un-acked

9.6 Shared Memory Control Parameters

9.6.1 VIADEV_SMP_EAGERSIZE

This has no effect if macro _SMP_ is not defined. It defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. If macro _SMP_RNDV_ is defined, then for messages larger than VIADEV_SMP_EAGERSIZE, SMP Rendezvous protocol is used. Note that this variable should be set in KBytes.

9.6.2 VIADEV_SMPI_LENGTH_QUEUE

This has no effect if macro _SMP_ is not defined. It defines the size of shared buffer between every two processes on the same node for transferring messages smaller than or equal to VIADEV_SMP_EAGERSIZE. Note that this variable should be set in KBytes.

9.6.3 SMP_SEND_BUF_SIZE

This has no effect if macro _SMP_ is not defined. It defines the packet size when sending intra-node messages larger than VIADEV_SMP_EAGERSIZE. Note that this variable should be set in Bytes.

9.6.4 VIADEV_SMP_NUM_SEND_BUFFER

This has no effect if macro _SMP_ is not defined. It defines the number of internal send buffers for sending intra-node messages larger than VIADEV_SMP_EAGERSIZE.

9.6.5 VIADEV_USE_AFFINITY

Enable CPU affinity by setting VIADEV_USE_AFFINITY=1 or disable it by setting VIADEV_USE_AFFINITY=0. VIADEV_USE_AFFINITY does not take effect when _AFFINITY_ is not defined.

9.6.6 VIADEV_CPU_MAPPING

User can specify process to CPU mapping within a node. This may help the applications to get the best performance on multi-core systems. For example, if we set
VIADEV_CPU_MAPPING=0,4,1,5, then process 0 on each node will be mapped to CPU 0, process 1 will be mapped to CPU 4, process 2 will be mapped to CPU 1, and process 3 will be mapped to CPU 5. The CPU numbers should be separated by a single “,”. This parameter does not take effect when _AFFINITY_ is not defined or VIADEV_USE_AFFINITY is set to 0.

9.7 Multi-Rail Usage Parameters

9.7.1 STRIPING_THRESHOLD

For a class of messages, a user may want to use Rendezvous protocol and not stripe the data across multiple ports/adapters. For messages of size equal and above this value, the data is striped across multiple paths. This value should at least be equal to the VIADEV_RENDEZVOUS_THRESHOLD. The value of STRIPING_THRESHOLD is currently equal to VIADEV_RENDEZVOUS_THRESHOLD. For optimal performance, this value may need to be changed depending upon the multi-rail setup (i.e. the number of ports and number of adapters) in the system.

9.7.2 NUM_QP_PER_PORT

This parameter indicates number of queue pairs per port to be used for communication on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.

9.7.3 NUM_PORTS

This parameter indicates number of ports to be used for communication per adapter on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.

9.7.4 NUM_HCAS

This parameter indicates number of adapters to be used for communication on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.

9.7.5 SM_SCHEDULING

To control the scheduling policy being used for small messages for Multi-Rail device. Valid policies are USE_FIRST (only use the first sub channel), ROUND_ROBIN (use subchannels in a round-robin manner) and PROCESS_BINDING (bind processes to a specific port of the HCAs). This parameter is only valid for the OpenFabrics/Gen2 Multi-Rail device.

9.7.6 LM_SCHEDULING

To control the scheduling policy being used for large messages for Multi-Rail device. Valid policies are ROUND_ROBIN (use subchannels in a round-robin manner), WEIGHTED_STRIPING (weight subchannels according to their link rates), EVEN_STRIPING (use equal weights for all subchannels), STRIPE_BLOCKING (stripe messages based on whether they are blocking or non-blocking MPI messages), ADAPTIVE_STRIPING (adaptively change the weights based on network congestion) and PROCESS_BINDING (bind processes to a specific port of the HCAs).

9.8 Run time parameters for Collectives

9.8.1 VIADEV_USE_SHMEM_COLL

To disable shmem based collectives, set this to 0.

9.8.2 VIADEV_USE_SHMEM_BARRIER

To disable shmem based Barrier, set this to 0.

9.8.3 VIADEV_USE_SHMEM_ALLREDUCE

To disable shmem based Allreduce, set this to 0.

9.8.4 VIADEV_USE_SHMEM_REDUCE

To disable shmem based Reduce, set this to 0.

9.8.5 VIADEV_USE_ALLGATHER_NEW

To disable the new Allgather, set this to 0.

9.8.6 VIADEV_MAX_SHMEM_COLL_COMM

This parameter allows to configure the number of communicators using shared memory collectives.

9.8.7 VIADEV_SHMEM_COLL_MAX_MSG_SIZE

This parameter allows the maximum message to be tuned for the shared memory collectives.

9.8.8 VIADEV_SHMEM_COLL_REDUCE_THRESHOLD

The shmem reduce is taken for messages less than this threshold. This threshold can be tuned appropriately but should be less than that of  9.8.7 above.

9.8.9 VIADEV_SHMEM_COLL_ALLREDUCE_THRESHOLD

The shmem allreduce is taken for messages less than this threshold. This threshold can be tuned appropriately but should be less than that of  9.8.7 above.

9.8.10 VIADEV_BCAST_KNOMIAL

To control the degree k of the k-nomial Broadcast algorithm. It should always be an integer greater than or equal to 2.

9.8.11 MPIR_ALLTOALL_SHORT_MSG

9.8.12 MPIR_ALLTOALL_MEDIUM_MSG

9.8.13 MPIR_AllTOALL_BASIC

Turning this option on sets the MPIR_ALLTOALL_SHORT_MSG to 256 and
MPIR_ALLTOALL_MEDIUM_MSG to 32768. This setting is for dual node clusters. This parameter is not present for PSM device.

9.8.14 MPIR_ALLTOALL_MCORE_OPT

Turning this option on sets the MPIR_ALLTOALL_SHORT_MSG to 8192 and
MPIR_ALLTOALL_MEDIUM_MSG to 8192. This setting is for multi-core clusters. This parameter is not present for PSM device.

9.9 CM Control Parameters

9.9.1 VIADEV_CM_RECV_BUFFERS

To control the number of receive buffers dedicated to UD based connection manager. Each buffer is only several tens of bytes.

9.9.2 VIADEV_CM_MAX_SPIN_COUNT

9.9.3 VIADEV_CM_TIMEOUT

To control the timeout value for UD messages.

9.10 Other Parameters

9.10.1 VIADEV_CLUSTER_SIZE

This controls the preset values for vbuf size, number of RDMA buffers and Rendezvous threshold for various cluster sizes. It can be set to “SMALL” (1-64), “MEDIUM” (64-256) and “LARGE” (256 and beyond). In addition, there is an “AUTO” option which will automatically set the appropriate parameters based on number of processes in the MPI application.

9.10.2 VIADEV_PREPOST_DEPTH

This defines the number of buffers pre-posted for each connection to handle send/receive operations.

9.10.3 VIADEV_MAX_SPIN_COUNT

This parameter is only effective when blocking mode progress is used. This parameter indicates the number of polls made by MVAPICH before yielding the CPU to other applications.

9.10.4 VIADEV_PT2PT_FAILOVER

This is the memory size of RDMA-based implementations for Alltoall and Allgather after which the default point-to-point mechanism is used instead of RDMA.

9.10.5 DAPL_PROVIDER

This is to specify the underlying uDAPL library that the user would like to use if MVAPICH is built with uDAPL.

10 MVAPICH Gen2-UD Parameters

10.1 InfiniBand HCA and Network Parameters

10.1.1 MV_DEVICE

Name of the InfiniBand device. e.g. mthca0, mthca1.

10.1.2 MV_MTU

MTU size in bytes that should be used (e.g. 1024, 2048, 4096). Must be less than or equal to the value supported by the HCA.

10.2 Reliability Parameters

Reliability is always enabled and cannot be turned off since UD is an unreliable transport. The following are various options to tune it.

10.2.1 MV_PROGRESS_TIMEOUT

Time (usec) until ACK status is checked (and ACKs are sent if needed)

10.2.2 MV_RETRY_TIMEOUT

Time (usec) after which an unacknowledged message will be retried

10.2.3 MV_MAX_RETRY_COUNT

Number of retries of a message before the job is aborted. This is needed in case an HCA fails.

10.2.4 MV_ACK_AFTER_RECV

After this number of messages is received an ACK is sent back to the sender – regardless of MV_PROGRESS_TIMEOUT.

10.2.5 MV_ACK_AFTER_PROGRESS

After a message receive is detected and before control is returned to the application, how many messages can be received before an ACK is transmitted to the sender – regardless of MV_PROGRESS_TIMEOUT.

10.3 Large Message Transfer Parameters

10.3.1 MV_USE_UD_ZCOPY

Whether or not to use the zero-copy transfer mechanism to transfer large messages.

10.3.2 MV_UD_ZCOPY_QPS

How many zero-copy large message transfers can be currently outstanding to a single process.

10.3.3 MV_UD_ZCOPY_THRESHOLD

Messages of this size and above should be transmitted along the zero-copy path
(if MV_USE_UD_ZCOPY is set)

10.3.4 MV_USE_REG_CACHE

Whether buffer registrations should be cached in the MPI library to increase performance

10.4 Performance and General Parameters

10.4.1 MV_USE_HEADERS

Whether messages should use the immediate data field of InfiniBand or encode the data into the body of the data.

10.4.2 MV_NUM_UD_QPS

How many UD QPs should be created and used (in round-robin) for message transfer? Often more than one is required to get full bandwidth.

10.4.3 MV_RNDV_THRESHOLD

For messages over this size the sender should verify that the receive has been posted before sending the message.

10.5 QP and Buffer Parameters

10.5.1 MV_UD_SQ_SIZE

How many send operations can be outstanding at any given time

10.5.2 MV_UD_RQ_SIZE

Maximum number of receive buffers that can be posted at a single time

10.5.3 MV_UD_CQ_SIZE

Maximum number of completions that can be expected. Generally set to MV_UD_SQ_SIZE + MV_UD_RQ_SIZE.

10.5.4 MV_USE_LMC

If the LID Mask Count (LMC) value is above 0, if multiple paths be used through the network

10.6 Shared Memory Control Parameters

10.6.1 MV_USE_SHARED_MEMORY

Whether or not shared memory should be used for communication with peers on the same node (instead of network loopback)

10.6.2 MV_SMP_EAGERSIZE

This has no effect if macro _SMP_ is not defined. It defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. If macro _SMP_RNDV_ is defined, then for messages larger than MV_SMP_EAGERSIZE, SMP Rendezvous protocol is used. Note that this variable should be set in KBytes.

10.6.3 MV_SMPI_LENGTH_QUEUE

This has no effect if macro _SMP_ is not defined. It defines the size of shared buffer between every two processes on the same node for transferring messages smaller than or equal to MV_SMP_EAGERSIZE. Note that this variable should be set in KBytes.

10.6.4 SMP_SEND_BUF_SIZE

This has no effect if macro _SMP_ is not defined. It defines the packet size when sending intra-node messages larger than MV_SMP_EAGERSIZE. Note that this variable should be set in Bytes.

10.6.5 MV_SMP_NUM_SEND_BUFFER

This has no effect if macro _SMP_ is not defined. It defines the number of internal send buffers for sending intra-node messages larger than MV_SMP_EAGERSIZE.

10.6.6 MV_USE_AFFINITY

Enable CPU affinity by setting MV_USE_AFFINITY=1 or disable it by setting
MV_USE_AFFINITY=0. MV_USE_AFFINITY does not take effect when _AFFINITY_ is not defined.