|
MVAPICH 1.0 User and Tuning Guide
MVAPICH Team Last Revised: May 29, 2008 |
InfiniBand is emerging as a high-performance interconnect delivering low latency and high bandwidth. It is also getting widespread acceptance due to its open standard.
MVAPICH (pronounced as “em-vah-pich”) is an open-source MPI software to exploit the novel features and mechanisms of InfiniBand and other RDMA enabled interconnects to deliver performance and scalability to MPI applications. This software is developed in the Network-Based Computing Laboratory (NBCL), headed by Prof. Dhabaleswar K. (DK) Panda.
Currently, there are two versions of this MPI: MVAPICH with MPI-1 semantics and MVAPICH2 with MPI-2 semantics. This open-source MPI software project started in 2001 and a first high-performance implementation was demonstrated at Supercomputing ’02 conference. After that, this software has been steadily gaining acceptance in the HPC and InfiniBand community. As of the 05/29/2008, more than 690 organizations (National Labs, Universities, and Industry) in 41 countries have downloaded this software from OSU’s web site directly. In addition, many IBA vendors, server vendors, and systems integrators have been incorporating MVAPICH/MVAPICH2 into their software stacks and distributing it. Several InfiniBand systems using MVAPICH have obtained positions in the TOP 500 ranking. The current version of MVAPICH is also being made available with the OpenFabrics/Gen2 stack. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.
More details on MVAPICH/MVAPICH2 software, users list, sample performance numbers on a wide range of platforms and interconnect, a set of OSU benchmarks, related publications, and other InfiniBand-related projects (parallel file systems, storage, data centers) can be obtained from the following URL:
http://mvapich.cse.ohio-state.edu/
This document contains necessary information for MVAPICH users to download, install, test, use, and tune MVAPICH 1.0. As we get feedback from users and take care of bug-fixes, we introduce new patches against our released distribution and also continuously update this document. Thus, we strongly request you to refer to our web page for updates.
This guide is designed to take the user through all the steps involved in configuring, installing, running and tuning MPI applications over InfiniBand using MVAPICH-1.0.
In Section 3 we describe all the features in MVAPICH 1.0. As you read through this section, please note our new features (highlighted as NEW). Some of these features are designed in order to optimize specific type of MPI applications and achieve greater scalability. Section 4 describes in detail the configuration and installation steps. This section enables the user to identify specific compilation flags which can be used to turn some of the features on of off. Usage instructions for MVAPICH are explained in Section 5. Apart from describing how to run simple MPI applications, this section also talks about running MVAPICH with some of the advanced features. Section 6 describes the usage of the OSU Benchmarks. If you have any problems using MVAPICH, please check Section 7 where we list some of the common problems users face. In Section 8 we suggest some tuning techniques for multi-thousand node clusters using some of our new features. In Section 9, we list important run-time and compile time parameters for the Gen2, VAPI, and QLogic devices, their default values and a small description of each parameter. Finally, Section 10 lists the parameters and tuning options for the OpenFabrics/Gen2-UD device.
MVAPICH (MPI-1 over InfiniBand) is an MPI-1 implementation based on MPICH and MVICH. MVAPICH 1.0 is available as a single integrated package (with the latest MPICH 1.2.7 and MVICH).
A complete set of features of MVAPICH 1.0 are:
This uDAPL support is generic and can work with other networks that provide uDAPL interface. Please note that the stability and performance of MVAPICH with uDAPL depends on the stability and performance of the underlying uDAPL library being used.
The MVAPICH 1.0 package and the project also includes the following provisions:
The MVAPICH 1.0 source code package includes the latest MPICH 1.2.7 version and also the required MVICH files from LBNL. Thus, there is no need to download any other files except MVAPICH 1.0 source code.
You can go to the MVAPICH website to obtain the source code.
Untar the archive you have downloaded from the web page using the following command. You will have a directory named mvapich-1.0 after executing this command.
$ tar xzf mvapich-1.0.tar.gz
As we enhance and improve MVAPICH, we update the available source code on our public SVN repository. In order to obtain these updates, please install a SVN client on your machine. The latest MVAPICH sources may be obtained from the “trunk” of the SVN using the following command:
$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk
The “trunk” may contain newer features and bug fixes. However, it is likely to be lightly tested. If you are interested in obtaining stable and major bug fixes to any release version, you should update your sources from the “branch” of the SVN using the following command:
$ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/branches/1.0
MVAPICH 1.0 provides support for seven different ADI devices. Namely, Gen2 Single-Rail (ch_gen2), Gen2 Multi-Rail (ch_gen2_multirail), Gen2/UD (ch_gen2_ud), Shared memory device (ch_smp), VAPI Single-Rail (vapi), VAPI Multi-Rail (vapi_multirail), uDAPL (udapl) and QLogic InfiniPath (psm). Additionally, you can also configure MVAPICH over the standard TCP/IP interface and use it over IPoIB.
There are several options to build MVAPICH 1.0 based on the underlying InfiniBand libraries you want to utilize. In this section we describe in detail the steps you need to perform to correctly build MVAPICH on your choice of InfiniBand libraries, namely OpenFabrics/Gen2, OpenFabrics/Gen2-UD, Mellanox VAPI, uDAPL, Shared Memory or QLogic InfiniPath.
In the following subsection, we describe how to build and configure the Single-Rail device. In later subsections, we describe the building and configuration of the other devices: Multi-Rail with OpenFabrics/Gen2 (4.4.2), Gen2/UD (4.4.3), InfiniPath (4.4.4), VAPI-single-rail (4.4.5), VAPI-multi-rail (4.4.6), uDAPL (4.4.7), Shared memory (4.4.8) and TCP (4.4.9).
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.gen2 configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 with multi-rail device on OpenFabrics Gen2.
After setting all the parameters, the script make.mvapich.gen2_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.
MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Please refer to 5.10 for more details.
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.gen2_ud configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0.
After setting all the parameters, the script make.mvapich.psm configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 on VAPI.
After setting all the parameters, the script make.mvapich.vapi configures, builds and installs the entire package in the directory specified by the variable PREFIX.
There are several methods to configure MVAPICH 1.0 with multi-rail device.
After setting all the parameters, the script make.mvapich.vapi_multirail configures, builds and installs the entire package in the directory specified by the variable PREFIX.
Before installing MVAPICH with uDAPL, please make sure you have the uDAPL library installed properly.
There are several methods to configure MVAPICH 1.0 with uDAPL.
After setting all the parameters, the script make.mvapich.udapl configures, builds and installs the entire package in the directory specified by the variable PREFIX.
In the mvapich-1.0 directory, we have provided a script make.mvapich.smp for building MVAPICH over shared memory intended for single SMP systems. The script make.mvapich.smp takes care of different platforms, compilers and architectures. By default, the compilation script uses gcc. In order to select your compiler, please set the variable CC in the script to use either Intel, PathScale or PGI compilers. The platform/architecture is detected automatically. The usage of the shared memory device can be found in 5.2.
In the mvapich-1.0 directory, we have provided a script make.mvapich.tcp for building MVAPICH over TCP/IP intended for use over IPoIB (IP over InfiniBand). In order to select any other compiler than GCC, please set your CC variable in that script. Simply execute this script (e.g. ./make.mvapich.tcp) for completing your build.
This section discusses the usage methods for the various features provided by MVAPICH. If you face any problem while following these instructions, please refer to Section 7.
Use mpicc, mpif77, mpiCC, or mpif90 to compile applications. They can be found under mvapich-1.0/bin.
There are several options to run MPI applications. Please select one of the following options based on your need.
Prerequisites:
Examples of running programs using mpirun_rsh:
$ mpirun_rsh -np 4 n0 n1 n2 n3 ./cpi
The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. By default ssh is used.
$ mpirun_rsh -rsh -np 4 n0 n1 n2 n3 ./cpi
The above command runs cpi on nodes n0, n1, n2 and n3 nodes, one process per each node. rsh is used regardless of whether ssh or rsh is used when compiling MVAPICH.
$ mpirun_rsh -np 4 -hostfile hosts ./cpi
A list of nodes are in hosts, one per line. MPI ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun_rsh. ie. if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as “cyclic”. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as “block”.
If you are using the shared memory device, then host names can be omitted:
$ mpirun_rsh -np 4 ./cpi
Many parameters of the MPI library can be very easily configured during run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example:
$ mpirun_rsh -np 4 -hostfile hosts ENV1=value ENV2=value ./cpi
Note that the environmental variables should be put immediately before the executable.
Alternatively, you may also place environmental variables in your shell environment (e.g. .bashrc). These will be automatically picked up when the application starts executing.
Please note that there are many different parameters which could be used to improve the performance of applications depending upon their requirements from the MPI library. For a discussion on how to identify which variables may be of interest to you, please take a look at Section 8.
Other options of mpirun_rsh can be obtained using
$ mpirun_rsh --help
SLURM is an open-source resource manager designed by Lawrence Livermore National Laboratory. SLURM software package and its related documents can be downloaded from: http://www.llnl.gov/linux/slurm/
Once SLURM is installed and the daemons are started, applications compiled with MVAPICH can be launched by SLURM, e.g.
$ srun -n2 --mpi=mvapich ./a.out
The use of SLURM enables many good features such as explicit CPU and memory binding. For example, if you have two processes and want to bind the first process to CPU 0 and Memory 0, and the second process to CPU 4 and Memory 1, then it can be achieved by:
$ srun --cpu_bind=v,map_cpu:0,4 --mem_bind=v,map_mem:0,1 -n2 --mpi=mvapich ./a.out
For more information about SLURM and its features please visit SLURM website.
MVAPICH provides shared memory implementations of important collectives:
MPI_Allreduce, MPI_Reduce, MPI_Barrier and MPI_Bcast. It also has support for
Enhanced MPI_Allgather. These collective operations are enabled by default. Shared
Memory Collectives are supported over Gen2, Gen2/UD, PSM and Shared Memory
devices. The PSM device currently only has MPI_Barrier and MPI_Bcast shared memory
implementation.
These operations can be disabled all at once by setting VIADEV_USE_SHMEM_COLL to 0 or one at a time by using the following environment variables:
Please refer to section 9.8 for tuning the various environment variables.
MVAPICH provides shared library support. This feature allows you to build your application on top of MPI shared library. If you choose this option, you still will be able to compile applications with static libraries. But as default, when you have shared library support enabled, your applications will be built on top of shared libraries automatically. The following commands provide some examples of how to build and run your application with shared library support.
$ mpicc -o cpi cpi.c
For example,
$ mpirun_rsh -np 2 n0 n1 LD_LIBRARY_PATH=$MVAPICH_BUILD/lib/shared ./cpi
Again, note that “LD_LIBRARY_PATH=path-to-shared-libraries” should be put immediately before the executable file.
$ mpicc -noshlib -o cpi cpi.c
MVAPICH contains optimized Lustre ADIO support for the OpenFabrics/Gen2 device. The Lustre directory should be mounted on all nodes on which MVAPICH processes will be running. Compile MVAPICH with ADIO support for Lustre as described in Section 4.4.1. If your Lustre mount is /mnt/datafs on nodes n0 and n1, on node n0, you can compile and run your program as follows:
$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 <path to perf>/perf -fname
/mnt/datafs/testfile
If you have enabled support for multiple file systems, append the prefix ”lustre:” to the name of the file. For example:
$ mpicc -o perf romio/test/perf.c
$ mpirun_rsh -np 2 n0 n1 ./perf -fname
lustre:/mnt/datafs/testfile
MVAPICH provides TotalView support for the OpenFabrics/Gen2 (mpid/ch_gen2),
OpenFabrics/Gen2-UD (mpid/ch_gen2_ud), Single-rail VAPI (mpid/vapi), InfiniPath (mpid/psm)
and Shared-Memory devices (mpid/ch_smp). You need to use mpirun_rsh when running
TotalView. The following commands also provide an example of how to build and run your
application with TotalView support. Note: running TotalView demands correct setup in your
environment, if you encounter any problem with your setup, please check with your system
administrator for help.
$ mpirun_rsh -tv -np 2 n0 n1
LD_LIBRARY_PATH=$MVAPICH_BUILD/lib/shared:$MVAPICH_BUILD/lib prog
MVAPICH provides multi-rail device with advance scheduling policies for data transfer 5.10. However, even with the single-rail configuration, multi-pathing (multiple ports, adapters and multiple paths provided by the LMC mechanism) can be used for multi-core systems. With this support, processes executing on the same node can leverage the above configurations by binding to one of the available configuration. MVAPICH provides multiple choices to the user for leveraging this functionality, which are described in the upcoming examples. This functionality is currently available only in the single-rail gen2 device.
$ mpirun_rsh -np 4 n0 n0 n1 n1 VIADEV_USE_MULTIHCA=1 VIADEV_USE_MULTIPORT=1 ./cpi
The usage of multiple paths is disabled by default. It’s usage can be controlled by using the parameter VIADEV_USE_LMC ( 9.2.6).
$ cat hosts n0:mthca0:1 n0:mthca1:2 n1:mthca0:2 n1:mthca1:1
With this specification, process 0 would be bound to port1 of adapter “mthca0”, process 1 to port 2 of adapter “mthca1” and so on.
MVAPICH supports network fault recovery by using InfiniBand Automatic Path Migration mechanism. This support is available for MPI applications using OpebFabrics stack and InfiniBand adapters.
To enable this functionality, a run-time variable, VIADEV_USE_APM (section 9.2.8) can be enabled, as shown in the following example:
$ mpirun_rsh -np 2 VIADEV_USE_APM=1 ./cpi
MVAPICH also supports testing Automatic Path Migration in the subnet in the absence of network faults. This can be controlled by using a run-time variable VIADEV_USE_APM_TEST (section 9.2.9). This should be combined with VIADEV_USE_APM as follows:
$ mpirun_rsh -np 2 VIADEV_USE_APM=1 VIADEV_USE_APM_TEST=1 ./cpi
MVAPICH provides multiple scheduling policies for communication, in the presence of multiple ports/adapters/paths with the multi-rail configuration. Run-time parameters are being provided to control the policies. They are further divided into policies for small and large messages. These policies are available in the multirail devices for gen2 and VAPI.
Process to CPU mapping may affect application preformance on multi-core systems, especially for memory intensive applications. If the number of processes is smaller than the number of CPU’s/cores, it is preferable to distribute the processes on different chips to avoid memory contention because CPU’s/cores on the same chip usually share the memory controller. MVAPICH provides flexible user defined CPU mapping. To use it, first make sure CPU affinity is set (Section 9.6.5). Then use the run-time environment variable VIADEV_CPU_MAPPING to specify the CPU/core mapping. For example, if it is a quad-core system in which cores [0-3] are on the same chip and cores [4-7] are on another chip, and you need to run an application with 2 processes, then the following mapping will give the best performance:
$ mpirun_rsh -np 2 n0 n0 VIADEV_CPU_MAPPING=0,4 ./a.out
In this case process 0 will be mapped to core 0 and process 1 will be mapped to core 4.
More information about VIADEV_CPU_MAPPING can be found in Section 9.6.6.
If you have arrived at this point, you have successfully installed MVAPICH. Congratulations!! In the mvapich-1.0/osu_benchmarks directory, we provide four basic performance tests: one-way latency test, uni-directional bandwidth test, bi-directional bandwidth test multiple bandwidth/message rate, and MPI-level broadcast latency test. You can compile and run these tests on your machines to evaluate the basic performance of MVAPICH.
These benchmarks as well as other benchmarks (such as for one-sided operations in MPI-2) are available on our projects’ web page. Sample performance numbers for these benchmarks on representative platforms and IBA gears are also included on our projects’ web page. You are welcome to compare your performance numbers with our numbers. If you see any big discrepancy, please let the MVAPICH community know by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.
Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact the MVAPICH community by sending an email to the mailing list mvapich-discuss@cse.ohio-state.edu.
MVAPICH can be used over multiple underlying InfiniBand libraries, namely OpenFabrics (Gen2), OpenFabrices (Gen2-UD), VAPI, uDAPL and QLogic InfiniPath. Based on the underlying library being utilized, the troubleshooting steps may be different. However, some of the troubleshooting hints are common for all underlying libraries. Thus, in this section, we have divided the troubleshooting tips into four sections: General troubleshooting and Troubleshooting over any one of the three InfiniBand libraries.
Running the following command will provide you with the version of MVAPICH that is being used.
$ mpirun_rsh -v
fork() and system() is supported for Gen2 and Gen2-UD devices as long as the kernel is being used is Linux 2.6.16 or newer. Additionally, the version of OFED used should be 1.2 or higher. The environment variable IBV_FORK_SAFE=1 must also be set to enable fork support.
This is a common symptom of several setup issues related to job startup. Please make sure of the following things:
MVAPICH implements highly optimized RDMA collective algorithms for frequently used collectives such as MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast MPI_Allgather, MPI_Barrier. The optimized implementations have been well tested and tuned. However, if you face any problems in these collectives for your application, please disable the optimized collectives. For example, if you want to disable MPI_Allreduce, you can do:
$ mpirun_rsh -np 8 -hostfile hf VIADEV_USE_SHMEM_ALLREDUCE=0 ./a.out
The complete list of all such paramaters is given in 9.8
The gfortran compiler can be used for F77 and F90. In order to make this work, the following environment variables should be set prior to running the build script:
$ export F77=gfortran
$ export F90=gfortran
$ export F77_GETARGDECL=" "
If g77 and gfortran are used together for F77 and F90 respectively, it might be necessary to set the following environment variable in order to get around possible compatibility issues:
$ export F90FLAGS="-ff2c"
MPI programs built with gfortran might not appear to run correctly due to the default output buffering used by gfortran. If it seems there is an issue with program output, the GFORTRAN_UNBUFFERED_ALL variable can be set to “y” when using mpirun_rsh to fix the problem. Running the pi3f90 example program using this variable setting is shown below:
$ mpirun_rsh -np 2 n1 n2 GFORTRAN_UNBUFFERED_ALL=y ./pi3f90
If a debug session is terminated with an alarm message, mpirun_rsh may have timedout waiting for the job launch to complete. Use a larger MPIRUN_TIMEOUT (section 9.1.2) to work around this problem.
If an application task terminates unexpectedly during job launch, mpirun_rsh may print the message:
mpispawn.c:303 Unexpected exit status
This usually indicates a problem with the application. Other error messages around this (if any) might point to the actual issue.
If mpirun_rsh fails with this error message, it was unable to locate a necessary utility. This can be fixed by ensuring that all MVAPICH executables are in the PATH on all nodes.
If PATHs cannot be setup as mentioned, then invoke mpirun_rsh with a path. For example:
/path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc
or
../../path/to/mpirun_rsh -np 2 node1 node2 ./mpi_proc
If you are using ADIO support for Lustre, please make sure that:
– Lustre is setup correctly, and that you are able to create, read to and write from files in the
Lustre mounted directory.
– The Lustre directory is mounted on all nodes on which MVAPICH processes with ADIO
support for Lustre are running.
– The path to the file is correctly specified.
– The permissions for the file or directory are correctly specified.
If you are using ADIO support for Lustre, the recent Lustre releases require an additional mount
option to have correct file locks.
So please include the following option with your lustre mount command: ”-o localflock”.
For example: $ mount -o localflock -t lustre xxxx@o2ib:/datafs
/mnt/datafs
MVAPICH uses CPU affinity to have better performance for single-threaded programs. For multi-threaded programs, such as MPI+OpenMP model, it may schedule all the threads of a process to run on the same CPU. CPU affinity should be disabled in this case to solve the problem, e.g.
$ mpirun_rsh -np 2 n1 n2 VIADEV_USE_AFFINITY=0 ./a.out
More information about CPU affinity and CPU binding can be found in Sections 9.6.5 and 9.6.6.
Please replace the -Wl,-rpath option in the build scripts (e.g. make.mvapich.gen2) with -R when Sun Studio 12 compilers are used.
Several well-known MPICH related problems on different platforms and environments have already been identified by Argonne. They are available on the MPICH patch webpage.
In this section, we discuss the general error conditions for MVAPICH based on OpenFabrics Gen2.
This error is generated by MVAPICH when it cannot find any Gen2 InfiniBand devices. If you are experiencing this error, then please make sure that your Gen2 installation is proper. You can do so by doing the following:
$ locate libibverbs
This tells you if you have installed libibverbs (the Gen2 verbs layer) or not. By default it installs in /usr/local.
If you have installed libibverbs, then please check if the OpenFabrics Gen2 drivers are loaded. You can do so by:
$ lsmod | grep ib
If this command does not list ib_uverbs, then probably you haven’t started all OpenFabrics Gen2 services. Please refer to the OpenFabrics Wiki installation cheat sheet for more details on setting up the OpenFabrics Gen2 stack.
This error is generated when MVAPICH cannot “open” the HCA (or the InfiniBand communication device). Please execute:
$ ls -l /dev/infiniband
If this command shows any devices uverbs0 with read/write permissions for users as shown below, please consult the “Loading kernel components” section of the OpenFabrics Wiki installation cheat sheet.
crw-rw-rw- 1 root root 231, 192 Feb 24 14:31 uverbs0
If you encounter this error, then you need to set the maximum available locked memory value for your system. The usual Linux defaults are quite low to what is required for HPC applications. One way to do this is to edit the file /etc/security/limits.conf and enter the following line:
* soft memlock phys_mem_size_in_KB
Where, phys_mem_size_in_KB is the MemTotal value reported by /proc/meminfo. In addition, you need to enter the following line in /etc/init.d/sshd and then restart sshd.
ulimit -l phys_mem_size_in_KB
MVAPICH generates this error when it cannot find any port active for the specific HCA being used for communication. This probably means that the ports are not configured to be a part of the InfiniBand subnet and thus are not “Active”. You can check whether the port is active or not, by using the following command:
$ ibstat
Please look at the “State” field for the status of the port being used. To bring a port to “Active” status, on any node in the same InfiniBand subnet, execute the following command:
# opensm -o 1
Please note that you need superuser privilege for this command. This command invokes the InfiniBand subnet manager (OpenSM) and asks it to sweep the subnet once and make all ports “Active”. OpenSM is usually installed in /usr/local/bin.
This means that your HCA doesn’t support the ibv_modify_srq feature. Please upgrade the firmware version and OpenFabrics Gen2 libraries on your cluster. You can obtain the latest Mellanox firmware images from this webpage.
Even after updating your firmware and OpenFabrics Gen2 libraries, if you continue to experience this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE with -DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build the MVAPICH library. Note that you should first try to update your firmware and OpenFabrics Gen2 libraries before taking this measure.
If you believe that your HCA supports this feature and yet you are experiencing this problem, please contact the MVAPICH community by posting a note to mvapich-discuss@cse.ohio-state.edu mailing list.
The error code 12 indicates that the InfiniBand HCA has given up after attempting to send the packet after several tries. This can be caused by either loose or faulty cables. Please check the InfiniBand connectivity of your cluster. Additionally, you may check the error rates at the respective HCAs using:
$ ibchecknet
This utility (usually installed in /usr/local/bin) sweeps the InfiniBand subnet and reports ports that are OK or if they have errors. You may try to quiesce the entire cluster and bring it up after an InfiniBand switch reboot.
The VIADEV_USE_LMC parameter allows the usage of multiple paths for multi-core and multi-way SMP systems, set up the subnet manager 9.2.6. The subnet manager allows different routing engines to be used (Min-Hop routing algorithm by default). We have noticed hangs using this parameter with Up/Down routing algorithm of OpenSM. There are two ways to fix this problem:
# mpirun_rsh -np 2 n0 n1 VIADEV_USE_LMC=0 ./prog
# opensm -o -l4 -r
MVAPICH provides network fault tolerance with Automatic Path Migration (APM). However, APM is supported only with OFED 1.2 onwards. With OFED 1.1 and prior versions of OpenFabrics drivers, APM functionality is not completely supported. Please refer to Section 9.2.8 and section 9.2.9
MVAPICH 1.0 provides a new more scalable startup procedure by default. If for some reason the old version is desired, it can be enabled using the -legacy flag to mpirun_rsh.
$ mpirun_rsh -legacy ...
The above error reports that the InfiniBand Adapter is not ready for communication. Make sure that the drivers are up. This can be done by executing:
$ locate libvapi
This command gives the path at which drivers are setup. Additionally, you may try to use the command vstat to check the availability of HCAs.
$ vstat
This error is generated during compilation, if the correct path to the InfiniBand library installation is not given.
Please setup the environment variable MTHOME as
$ export MTHOME=/usr/local/ibgd/driver/infinihost
If the problem persists, please contact your system administrator or reinstall your copy of IBGD. You can get IBGD from Mellanox website.
This error usually indicates that all InfiniBand links the MPI application is trying to use are not in the PORT_ACTIVE state. Please make sure that all ports show PORT_ACTIVE with the VAPI utility vstat. If you are using Multi-Rail support, please keep in mind that all ports of all adapters you are using need to show PORT_ACTIVE.
Please make sure that the environmental variable "MAC_OSX" is set before your configuration. If you use manual configuration and not mvapich.make.macosx, you must configure MVAPICH in the following way:
$ export MAC_OSX=yes; ./configure; make; make install
If you encounter this problem of compiling your own applications, like given below, it is likely that you have explicitly included "-lm". You should remove that.
"ld: multiple definitions of symbol _calloc
/usr/lib/libm.dylib(malloc.So) definition of _calloc
/tmp/mvapich-0.9.5/mvapich/lib/libmpich.a(dreg-g5.o)
definition of _calloc in section (__TEXT,__text)
ld: multiple definitions of symbol _free
/usr/lib/libm.dylib(malloc.So) definition of _free
/tmp/mvapich-0.9.5/mvapich/lib/libmpich.a(dreg-g5.o)
definition of _free in section (__TEXT,__text) "
To enable Fortran support, you would need to install the IBM compiler located at (there is a 60-day free trial version) available from IBM.
Once you unpack the tar ball, you can customize and use make.mvapich.vapi to compile and install the package or manually configure, compile and install the package.
If you configure MVAPICH 1.0 with uDAPL and see this error, you need to check whether you have specified the correct uDAPL service provider. If you have specified the uDAPL provider but still see this error, you need to check whether the specified network is working or not.
In addition, please check the contents of the file /etc/dat.conf. It should contain the name of the IA e.g. ib0. A typical entry would look like the following:
ib0 u1.2 nonthreadsafe default /usr/lib/libdapl.so ri.1.1 ‘‘mthca0 1’’ ‘‘’’
If you configure MVAPICH 1.0 with uDAPL and see this error, you need to reduce the value of the environmental variable RDMA_DEFAULT_MAX_WQE depending on the underlying network.
If you get the error: “error while loading shared libraries, libdat.so”, the location of the dat shared library is incorrect. You need to find the correct path of libdat.so and export LD_LIBRARY_PATH to this correct location. For example:
$ export LD_LIBRARY_PATH=/path/to/libdat.so:$LD_LIBRARY_PATH
$ mpirun_rsh -np 2 n0 n1 ./a.out
MVAPICH 1.0 currently does not support DAPL-v2.0. You need to build MVAPICH against DAPL-v1.2 library. Support for DAPL-v2.0 is available in MVAPICH2.
Incorrect settings of MTRR mapping may result in achieving a low bandwidth with InfiniPath hardware. To alleviate this situation, BIOS settings for MTRR mapping may be edited to “Discrete”. For further details, please refer to the InfiniPath User Guide.
Variable IBHOME_LIB in make.mvapich.psm file does not point to correct location. IBHOME_LIB should point to the directory containing the InfiniPath device libraries. By default they are installed in /usr/lib or /usr/lib64.
IBHOME, PREFIX, CC, F77 are mandatory variable required by the installation script and must be set in the file make.mvapich.psm. IBHOME - directory which contains the InfiniPath header file include directory. By default InfiniPath header file include directory is in /usr. PREFIX - directory where MVAPICH should be installed. CC - C compiler. Typically set to gcc. F77 - fortran compiler. Typically set to g77.
This probably means that the ports are not configured to be a part of the InfiniBand subnet and thus are not “Active”. You can check whether the port is active or not, by using the following command on that node:
$ ipath_control -i
Please look at the “Status” field for the status of the port being used. To bring a port to “Active” status, on any node in the same InfiniBand subnet, execute the following command:
# opensm -o
Please note that you may need superuser privilege for this command. This command invokes the InfiniBand subnet manager (OpenSM) and asks it to sweep the subnet once and make all ports “Active”. OpenSM is usually installed in /usr/local/bin. You may also look at the file /sys/bus/pci/drivers/ib_ipath/status_str to verify that the InfiniPath software is loaded correctly. For details, please refer to InfiniPath user guide, download able from www.qlogic.com.
This is a limitation of InfiniPath Release 2.0. By default it allows a maximum of eight processes per QHT7140 HCA and four processes with QLE7140 HCA. To overcome this, please consult your InfiniPath support provider.
MVAPICH supports many different parameters for tuning and extracting the best performance for a wide range of platforms and applications. These parameters can be either compile time parameters or run time parameters. Please refer to section 9 for a complete description of all the parameters. In this section we classify these parameters depending on what you are tuning for and provide guidelines on how to use them.
MVAPICH 1.0 has a new, scalable mpirun_rsh which uses a tree based mechanism to spawn processes. The degree of this tree is determined dynamically to keep the depth low. For large clusters, it might be beneficial to further flatten the tree by specifying a higher degree. The degree can be overridden with the environment variable MT_DEGREE (section 9.1.1).
In MVAPICH we use Shared Receive Queue (SRQ) support which consumes less memory than other methods. It can lead to a significant reduction in the memory footprint of MVAPICH.
To enable this mode, please include -DMEMORY_SCALE in your make.mvapich.gcc (it is included by default). Once you have enabled the scalable memory mode in MVAPICH, there are four aspects by which you can customize the memory usage and performance ratio according to the needs of your cluster.
The main environmental parameters controlling the behavior of the Shared Receive Queue design are:
Starting with 1.0, MVAPICH uses a dynamic re-size of the number of buffers used for the SRQ by default. The parameter VIADEV_SRQ_MAX_SIZE is the maximum size of the Shared Receive Queue (default 4096). You may increase this to value 8192 if the application requires very large number of processors (8K and beyond). The application will start by only using VIADEV_SRQ_SIZE buffers (default 256) and will double this value on every SRQ limit event (up to VIADEV_SRQ_MAX_SIZE). For long running applications this re-size should show little effect. If needed, the VIADEV_SRQ_SIZE van be increased to 1024 or higher as needed for applications.
VIADEV_SRQ_LIMIT defines the low watermark for the flow control handler. This can be reduced if your aim is to reduce the number of interrupts.
VIADEV_VBUF_POOL_SIZE is a fixed number of pool of vbufs. These vbufs can be shared among all different connections depending on the communication needs of each connection. You may want to increase this number for large scale clusters (4K and beyond).
The major environmental variables controlling the behavior of the connection management in MVAPICH are:
VIADEV_CM_RECV_BUFFERS is the number of buffers used by the connection manager to establish new connections. These buffers are very small (around 20 bytes) and they are shared for all InfiniBand connections, so this value may be increased to 8192 for large clusters to avoid retries in case of packet drops.
VIADEV_CM_MAX_SPIN_COUNT is the number of times the connection manager polls for new incoming connections. This may be increased to reduce the interrupt overhead when lot of incoming connections are started at the same time.
VIADEV_CM_TIMEOUT is the timeout value associated with connection request messages on the UD channel. Decreasing this may lead to faster retries, but at the cost of generating duplicate messages. Similarly increasing this may lead to slower retries but lesser chance of duplicate messages.
MVAPICH implements a dynamic allocation and utilization of the RDMA mechanism for short messages. It can lead to significant reduction in memory footprint of MVAPICH.
There are two environmental parameters:
These two parameters control the behavior of this dynamic scheme.
VIADEV_ADAPTIVE_RDMA_LIMIT controls the maximum number of processes for which the
“fast” RDMA buffers are allocated. For very large scale clusters, it is suggested to set this value to
-1, which means RDMA buffers will be allocated for log(n) number of connections (where n is the
number of processes in the application).
VIADEV_ADAPTIVE_RDMA_THRESHOLD is the number of messages exchanged per
connection before RDMA buffers are allocated for that connection. For very large scale clusters, it
is suggested that this value be increased so that only very frequently communicating connections
allocate RDMA buffers.
In addition, the following parameters are also important in tuning the memory requirement: VIADEV_VBUF_TOTAL_SIZE (9.3.2) and VIADEV_NUM_RDMA_BUFFER (9.3.1).
The product of VBUF_TOTAL_SIZE and VIADEV_NUM_RDMA_BUFFER generally is a measure of the amount of memory registered for eager message passing. These buffers are not shared across connections.
To provide the best performance (latency/bandwidth) to memory ratio, we have decided on a set of default values for these parameters. These parameters are often dependent on the execution platform. To use preset values for small, medium and large clusters (1-64, 64-256, 256-. . . ), please use VIADEV_CLUSTER_SIZE (9.10.1) as either SMALL, MEDIUM or LARGE, respectively.
MVAPICH uses shared memory communication channel to achieve high-performance message passing among processes that are on the same physical node. The two main parameters which are used for tuning shared memory performance for small messages are VIADEV_SMPI_LENGTH_QUEUE ( 10.6.3) and VIADEV_SMP_EAGER_SIZE ( 10.6.2). The two main parameters which are used for tuning shared memory performance for large messages are SMP_SEND_BUF_SIZE( 10.6.4) and VIADEV_SMP_NUM_SEND_BUFFER ( 10.6.5).
VIADEV_SMPI_LENGTH_QUEUE is the size of the shared memory buffer which is used to store outstanding small and control messages. VIADEV_SMP_EAGER_SIZE defines the switch point from Eager protocol to Rendezvous protocol.
Messages larger than VIADEV_SMP_EAGER_SIZE are packetized and sent out in a pipelined manner. SMP_SEND_BUF_SIZE is the packet size, i.e. the send buffer size. VIADEV_SMP_NUM_SEND_BUFFER is the number of send buffers. Shared memory communication can be disabled at run time by the parameter VIADEV_USE_SHARED_MEM( 9.4.5).
Performance of some applications is sensitive to the rank distribution according to their communication pattern. It is advisable that processes that communicate most use the shared memory path, since it offers lower latencies compared to the network path. To adjust the process rank distribution, please refer Section 5.2 to decide which distribution “cyclic” or “block” suits the communication pattern of your application. In particular, we have found that when using “block” distribution, the performance of HPL (Linpack) is better.
MVAPICH uses shared memory to get the best performance for many collective operations: MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Bcast.
The important parameters for tuning these collectives are as follows. For MPI_Allreduce, the
optimized shared memory algorithm is used until the
VIADEV_SHMEM_COLL_ALLREDUCE_THRESHOLD (9.8.9).
Similarly for MPI_Reduce the corresponding threshold is
VIADEV_SHMEM_COLL_REDUCE_THRESHOLD (9.8.8).
For MPI_Bcast, the important parameter is the degree of the tree used for inter-node data movement. This parameter is VIADEV_BCAST_KNOMIAL (9.8.10).
For MPI_Alltoall, the two main parameters are MPIR_ALLTOALL_SHORT_MSG (9.8.11) and MPIR_ALLTOALL_MEDIUM_MSG (9.8.12). There are three main algorithms used for MPI_Alltoall: short message, medium message and long message. The short message algorithm is used until MPIR_ALLTOALL_SHORT_MSG and from then on the medium message algorithm is used until MPIR_ALLTOALL_MEDIUM_MSG. These thresholds can be tuned appropriately to get the best performance.
The degree of the hierarchical tree used by mpirun_rsh. By default mpirun_rsh uses a value that tries to keep the depth of the tree to 4. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).
The number of seconds after which mpirun_rsh aborts job launch. Note that unlike most other parameters described in this section, this is an environment variable that has to be set in the runtime environment (for e.g. through export in the bash shell).
Name of the InfiniBand device. e.g. mthca0, mthca1 or ehca0 (for IBM ehca).
The default port on the InfiniBand device to be used for communication.
This variables allows to change the maximum number of ports per adapter which are supported.
This variable allows a user to bind processes on a node to ports attached to different HCAs on a node. It allows an efficient utilization of HCA ports in a round-robin fashion. VIADEV_MULTIHCA is an alias for this variable for backward compatibility. However, if VIADEV_USE_MULTIHCA is defined, value of VIADEV_MULTIHCA will be overwritten.
This variable allows a user to bind processes on a node to ports attached to different HCAs on a node. It allows an efficient utilization of HCA ports in a round-robin fashion. VIADEV_MULTIPORT is an alias for this variable for backward compatibility. However, if VIADEV_USE_MULTIPORT is defined, value of VIADEV_MULTIPORT will be overwritten.
This variable allows the usage of multiple paths between end nodes for multi-core/multi-way SMP systems. The path selection is on the basis of source and destination ranks.
The internal MTU used for IB. This parameter should be a string instead of an integer. Valid values are: MTU256, MTU512, MTU1024, MTU2048, MTU4096.
This parameter is used for recovery from network faults using Automatic Path Migration. This functionality is beneficial in the presence of multiple paths in the network, which can be enabled by using lmc mechanism.
This parameter is used for testing the Automatic Path Migration functionality. It periodically moves the alternate path as the primary path of communication and re-loads another alternate path.
The number of RDMA buffers used for the RDMA fast path. This fast path is used to reduce latency and overhead of small data and control messages. This value is effective only when macro RDMA_FAST_PATH or ADAPTIVE_RDMA_FAST_PATH is defined.
This macro defines the size of each vbuf.
Different presets for this value are available for different sizes of clusters
VIADEV_CLUSTER_SIZE = (SMALL, MEDIUM, LARGE, AUTO).
This parameter chooses the underlying Rendezvous protocol
Options are:
NOTE: ASYNC is only available if the library was compiled with the -DASYNC CFLAG (not defined by default)
This specifies the switch point between eager and rendezvous protocol in MVAPICH.
Maximum size of an RDMA put message (RPUT) in the rendezvous protocol. Note that this variable should be set in bytes.
This is the message size (in bytes) which will be sent using the R3 mode if the registration cache is turned off, i.e. VIADEV_USE_DREG_CACHE=0
The number of vbufs in the initial pool. This pool is shared among all the connections.
The number of vbufs allocated each time when the global pool is running out in the initial pool. This is also shared among all the connections.
This indicates whether registration cache is to be used or not. The registration cache speeds up zero copy operations if user memory is re-used many times.
Memory registration cache will be used if this flag is defined. We recommend not to use this flag for vapi and vapi_multirail devices.
This defines the total number of buffers that can be stored in the registration cache. A larger value will lead to more infrequent lazy de-registration.
This sets a limit on the number of pages kept registered by the registration cache. If you set it to 0, that implies no limits on the number of pages registered.
Max (total) number of VBUFs to allocate after which the process terminates with a fatal error. -1 means no limit.
Number of processes beyond which on-demand connection management will be used.
Maximum size of a message (in bytes) that may be sent INLINE with message descriptor Lowering this increases message latency, but can lower memory requirements. Also see VIADEV_NO_INLINE_THRESHOLD, which will override this value in some cases.
This parameter automatically changes the VIADEV_MAX_INLINE_SIZE after the number of connections exceeds VIADEV_NO_INLINE_THRESHOLD. Behavior is slightly different depending on whether on-demand connection setup is used:
Use blocking mode progress, instead of polling. This allows MPI to yield CPU to other processes if there are no more incoming messages.
This is the maximum number of RDMA paths that will be established in the entire MPI application. Passing it a value -1 implies that at most log(n) number of paths will be established. Where n is the number of processes in the MPI application.
This is the number of messages exchanged per connection after which the RDMA path is established.
Default value: Number of processes (np) in application If the number of jobs exceeds this limit, adaptive flow will be enabled. To enable adaptive flow for any number of jobs define: VIADEV_ADAPTIVE_ENABLE_LIMIT=0
To control the number of allowable outstanding send operations to the device.
This parameter records the number of credits per connection that will be preserved for non-data, control packets. If SRQ is not used, this default is 10.
Flow control information is usually sent via piggybacking with other messages. This parameter is used, along with VIADEV_DYNAMIC_CREDIT_THRESHOLD, to determine when to send explicit flow control update messages.
Flow control information is usually sent via piggybacking with other messages. These two parameters are used to determine when to send explicit flow control update messages.
This defines the initial number of pre-posted receive buffers for each connection. If communication happen for a particular connection, the number of buffers will be increased to VIADEV_PREPOST_DEPTH.
When _SMP_ is defined, shared memory communication can be disabled by setting VIADEV_USE_SHARED_MEM=0.
This value determines if additional MPI progress engine calls are made when making send operations. If there are this number or more queued send operations then progress is attempted.
This setting turns on (1) or off (0) the coalescing of messages. Leaving feature on can help applications that make many consecutive send operations to the same host.
If VIADEV_USE_COALESCE is enabled, this flag will enable coalescing only for messages of the same tag, communicator, and size. This also increases VIADEV_PROGRESS_THRESHOLD to 2.
If there are more than this number of small messages outstanding to a another task, messages will be coalesced until one of the previous sends completes.
Attempt to coalesce messages under this size. If this number is greater than
VIADEV_VBUF_TOTAL_SIZE, then it is set to VIADEV_VBUF_TOTAL_SIZE. This has no
effect if message coalescing is turned off.
Indicates whether Shared Receive Queue is to be used or not. Users are strongly encouraged to use this as long as the InfiniBand software/hardware supports this feature.
This is the maximum number of work requests allowed on the Shared Receive Queue. Upon receiving a SRQ limit event, the current value of VIADEV_SRQ_SIZE will be doubled or moved to the maximum of VIADEV_SRQ_MAX_SIZE, whichever is smaller.
This is the maximum number of work requests posted to the Shared Receive Queue initially. This value will dynamically re-size up to VIADEV_SRQ_MAX_SIZE.
This is the low watermark limit for the Shared Receive Queue. If the number of available work entries on the SRQ drops below this limit, the flow control will be activated.
This is the maximum number of R3 packets which are outstanding when using Shared Receive Queues.
Maximum number of unsuccessful SRQ posts that an async thread can make before going to sleep.
This is the maximum amount of R3 data that is sent out un-acked
This has no effect if macro _SMP_ is not defined. It defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. If macro _SMP_RNDV_ is defined, then for messages larger than VIADEV_SMP_EAGERSIZE, SMP Rendezvous protocol is used. Note that this variable should be set in KBytes.
This has no effect if macro _SMP_ is not defined. It defines the size of shared buffer between every two processes on the same node for transferring messages smaller than or equal to VIADEV_SMP_EAGERSIZE. Note that this variable should be set in KBytes.
This has no effect if macro _SMP_ is not defined. It defines the packet size when sending intra-node messages larger than VIADEV_SMP_EAGERSIZE. Note that this variable should be set in Bytes.
This has no effect if macro _SMP_ is not defined. It defines the number of internal send buffers for sending intra-node messages larger than VIADEV_SMP_EAGERSIZE.
Enable CPU affinity by setting VIADEV_USE_AFFINITY=1 or disable it by setting VIADEV_USE_AFFINITY=0. VIADEV_USE_AFFINITY does not take effect when _AFFINITY_ is not defined.
User can specify process to CPU mapping within a node. This may help the applications to get the
best performance on multi-core systems. For example, if we set
VIADEV_CPU_MAPPING=0,4,1,5, then process 0 on each node will be mapped to CPU 0,
process 1 will be mapped to CPU 4, process 2 will be mapped to CPU 1, and process 3 will be
mapped to CPU 5. The CPU numbers should be separated by a single “,”. This parameter does not
take effect when _AFFINITY_ is not defined or VIADEV_USE_AFFINITY is set to
0.
For a class of messages, a user may want to use Rendezvous protocol and not stripe the data across multiple ports/adapters. For messages of size equal and above this value, the data is striped across multiple paths. This value should at least be equal to the VIADEV_RENDEZVOUS_THRESHOLD. The value of STRIPING_THRESHOLD is currently equal to VIADEV_RENDEZVOUS_THRESHOLD. For optimal performance, this value may need to be changed depending upon the multi-rail setup (i.e. the number of ports and number of adapters) in the system.
This parameter indicates number of queue pairs per port to be used for communication on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.
This parameter indicates number of ports to be used for communication per adapter on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.
This parameter indicates number of adapters to be used for communication on an end node. This parameter has no effect if Multi-Rail configuration is not enabled.
To control the scheduling policy being used for small messages for Multi-Rail device. Valid policies are USE_FIRST (only use the first sub channel), ROUND_ROBIN (use subchannels in a round-robin manner) and PROCESS_BINDING (bind processes to a specific port of the HCAs). This parameter is only valid for the OpenFabrics/Gen2 Multi-Rail device.
To control the scheduling policy being used for large messages for Multi-Rail device. Valid policies are ROUND_ROBIN (use subchannels in a round-robin manner), WEIGHTED_STRIPING (weight subchannels according to their link rates), EVEN_STRIPING (use equal weights for all subchannels), STRIPE_BLOCKING (stripe messages based on whether they are blocking or non-blocking MPI messages), ADAPTIVE_STRIPING (adaptively change the weights based on network congestion) and PROCESS_BINDING (bind processes to a specific port of the HCAs).
To disable shmem based collectives, set this to 0.
To disable shmem based Barrier, set this to 0.
To disable shmem based Allreduce, set this to 0.
To disable shmem based Reduce, set this to 0.
To disable the new Allgather, set this to 0.
This parameter allows to configure the number of communicators using shared memory collectives.
This parameter allows the maximum message to be tuned for the shared memory collectives.
The shmem reduce is taken for messages less than this threshold. This threshold can be tuned appropriately but should be less than that of 9.8.7 above.
The shmem allreduce is taken for messages less than this threshold. This threshold can be tuned appropriately but should be less than that of 9.8.7 above.
To control the degree k of the k-nomial Broadcast algorithm. It should always be an integer greater than or equal to 2.
Turning this option on sets the MPIR_ALLTOALL_SHORT_MSG to 256 and
MPIR_ALLTOALL_MEDIUM_MSG to 32768. This setting is for dual node clusters. This
parameter is not present for PSM device.
Turning this option on sets the MPIR_ALLTOALL_SHORT_MSG to 8192 and
MPIR_ALLTOALL_MEDIUM_MSG to 8192. This setting is for multi-core clusters. This
parameter is not present for PSM device.
To control the number of receive buffers dedicated to UD based connection manager. Each buffer is only several tens of bytes.
To control the timeout value for UD messages.
This controls the preset values for vbuf size, number of RDMA buffers and Rendezvous threshold for various cluster sizes. It can be set to “SMALL” (1-64), “MEDIUM” (64-256) and “LARGE” (256 and beyond). In addition, there is an “AUTO” option which will automatically set the appropriate parameters based on number of processes in the MPI application.
This defines the number of buffers pre-posted for each connection to handle send/receive operations.
This parameter is only effective when blocking mode progress is used. This parameter indicates the number of polls made by MVAPICH before yielding the CPU to other applications.
This is the memory size of RDMA-based implementations for Alltoall and Allgather after which the default point-to-point mechanism is used instead of RDMA.
This is to specify the underlying uDAPL library that the user would like to use if MVAPICH is built with uDAPL.
Name of the InfiniBand device. e.g. mthca0, mthca1.
MTU size in bytes that should be used (e.g. 1024, 2048, 4096). Must be less than or equal to the value supported by the HCA.
Reliability is always enabled and cannot be turned off since UD is an unreliable transport. The following are various options to tune it.
Time (usec) until ACK status is checked (and ACKs are sent if needed)
Time (usec) after which an unacknowledged message will be retried
Number of retries of a message before the job is aborted. This is needed in case an HCA fails.
After this number of messages is received an ACK is sent back to the sender – regardless of MV_PROGRESS_TIMEOUT.
After a message receive is detected and before control is returned to the application, how many messages can be received before an ACK is transmitted to the sender – regardless of MV_PROGRESS_TIMEOUT.
Whether or not to use the zero-copy transfer mechanism to transfer large messages.
How many zero-copy large message transfers can be currently outstanding to a single process.
Messages of this size and above should be transmitted along the zero-copy path
(if MV_USE_UD_ZCOPY is set)
Whether buffer registrations should be cached in the MPI library to increase performance
Whether messages should use the immediate data field of InfiniBand or encode the data into the body of the data.
How many UD QPs should be created and used (in round-robin) for message transfer? Often more than one is required to get full bandwidth.
For messages over this size the sender should verify that the receive has been posted before sending the message.
How many send operations can be outstanding at any given time
Maximum number of receive buffers that can be posted at a single time
Maximum number of completions that can be expected. Generally set to MV_UD_SQ_SIZE + MV_UD_RQ_SIZE.
If the LID Mask Count (LMC) value is above 0, if multiple paths be used through the network
Whether or not shared memory should be used for communication with peers on the same node (instead of network loopback)
This has no effect if macro _SMP_ is not defined. It defines the switch point from Eager protocol to Rendezvous protocol for intra-node communication. If macro _SMP_RNDV_ is defined, then for messages larger than MV_SMP_EAGERSIZE, SMP Rendezvous protocol is used. Note that this variable should be set in KBytes.
This has no effect if macro _SMP_ is not defined. It defines the size of shared buffer between every two processes on the same node for transferring messages smaller than or equal to MV_SMP_EAGERSIZE. Note that this variable should be set in KBytes.
This has no effect if macro _SMP_ is not defined. It defines the packet size when sending intra-node messages larger than MV_SMP_EAGERSIZE. Note that this variable should be set in Bytes.
This has no effect if macro _SMP_ is not defined. It defines the number of internal send buffers for sending intra-node messages larger than MV_SMP_EAGERSIZE.
Enable CPU affinity by setting MV_USE_AFFINITY=1 or disable it by setting
MV_USE_AFFINITY=0. MV_USE_AFFINITY does not take effect when _AFFINITY_ is not
defined.