1. Overview

MVAPICH2-GDR 2.3a binary release is based on MVAPICH2 2.2 and incorporates designs that take advantage of the new GPUDirect RDMA technology, which enables direct P2P communication between NVIDIA GPUs and Mellanox InfiniBand adapters. MVAPICH2-GDR 2.3a offers significant improvements in latency and bandwidth for GPU-buffer based intranode and internode MPI Communication involving small and medium message sizes. For more information on the GPUDirect RDMA technology, refer to http://www.mellanox.com/page/products_dyn?product_family=116.

MVAPICH2-GDR 2.3a provides an efficient support for Non-Blocking Collectives (NBC) from GPU buffers to achieve maximal overlap. It uses novel designs that combine GPUDirect RDMA and Core-Direct technologies. Further MVAPICH2-GDR 2.3a also provides support for CUDA Managed memory features and optimizes large message collectives targeting Deep Learning frameworks.

Note that this release is for GPU-Cluster with GPUDirect RDMA support, if your cluster does not have this support please use the default MVAPICH2 library. For more details please refer to http://mvapich.cse.ohio-state.edu/.

2. System Requirements

MVAPICH2-GDR 2.3a binary release requires the following software to be installed on your system:

List of Mellanox InfiniBand adapters and NVIDIA GPU devices which support GPUDirect RDMA can be found here.

MVAPICH2-GDR 2.3a boosts the performance by taking advantage of the new GDRCOPY module from NVIDIA. In order to take advantage of this feature, please download and install this module from:

(https://github.com/NVIDIA/gdrcopy)

After installing this module you need to add this path to your LD_LIBRARY_PATH or use MV2_GPUDIRECT_GDRCOPY_LIB to pass this path to the MPI library at runtime. For more details please refer to section (GDRCOPY USAGE AND TUNING) of this README.

Note that even if this module is not available, MVAPICH2-GDR 2.3a will deliver very good performance by taking advantage of the Loopback feature. For more details refer to section (LOOPBACK FEATURE) of this README.

4. Installing MVAPICH2-GDR library

To install the MVAPICH2-GDR library you simply need to select the correct library (MOFED version, Compiler version, etc.) for your system and install the RPM using your favorite RPM tool. Please use the downloads page (http://mvapich.cse.ohio-state.edu/downloads/) to find the appropriate RPM link and follow the instructions below.

Install library built using PGI compilers against CUDA 9.0 runtime for MOFED 4.0 and CentOS 7
$ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3a/mofed4.0/mvapich2-gdr-mcast.cuda9.0.176_384.81.mofed4.0.pgi17.10-2.3a_pre-1.el7.centos.x86_64.rpm

$ rpm -Uvh --nodeps mvapich2-gdr-mcast.cuda9.0.176_384.81.mofed4.0.pgi17.10-2.3a_pre-1.el7.centos.x86_64.rpm

The RPMs contained in our libraries are relocatable and can be installed using a prefix other than the default of ./opt/mvapich2/ used by the library in the previous example.

Install library specifying custom prefix
$ rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-mcast.cuda9.0.176_384.81.mofed4.0.pgi17.10-2.3a_pre-1.el7.centos.x86_64.rpm

If you do not have root permission you can use rpm2cpio to extract the library.

Use rpm2cpio to extract the library
$ rpm2cpio mvapich2-gdr-mcast.cuda9.0.176_384.81.mofed4.0.pgi17.10-2.3a_pre-1.el7.centos.x86_64.rpm | cpio -id

When using the rpm2cpio method, you will need to update the MPI compiler scripts, such as mpicc, in order to point to the correct path of where you place the library.

Tip
If you are using a Debian based system such as Ubuntu you can convert the rpm to a deb using a tool such as alien or follow the rpm2cpio instructions above.

5. Running applications

Here are some examples running applications with the MVAPICH2-GDR software.

5.1. Example running OSU Micro Benchmark

To run osu_latency test for measuring internode MPI Send/Recv latency between GPUs, when enabling GPUDirect RDMA-based designs in MVAPICH2-GDR 2.3a

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3a/gnu
    2: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install
    3: $ export MV2_USE_CUDA=1
    4:
    5: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    6:         $MV2_PATH/libexec/mvapich2/get_local_rank \
    7:         $MV2_PATH/libexec/mvapich2/osu_latency D D

5.2. Example use of LD_PRELOAD

Some cases may require LD_PRELOAD to be set to the path of the MVAPICH2 library. This should be limited to cases where the application uses an interpreter such as python to load the CUDA library.

Try setting LD_PRELOAD if you find the MVAPICH2 fails when using your CUDA device buffers in MPI calls.

Example running HOOMD-BLUE with LD_PRELOAD
    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3a/gnu
    2: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install
    3: $ export MV2_USE_CUDA=1
    4:
    5: $ $MV2_PATH/bin/mpirun_rsh -n 4 hostA hostA hostB hostB \
    6:         LD_PRELOAD=$MV2_PATH/lib/libmpi.so hoomd lj_liquid_bmark.hoomd

6. Supported Configurations

Below is the list of currently supported configurations. In order to select the GPU, the application is free to use any selection method. For HCA, the parameters that are required have to exported as shown below. Note that MVAPICH2-GDR 2.3a does automatically and dynamically the best binding by default, and prints a warning if the user does specify a binding which is not the best mapping. See CPU Binding and Mapping Parameters for more information.

6.1. Single GPU / Single HCA

With single HCA, the default example shown earlier will work fine since there is no HCA selection involved for Single HCA configurations.

/static/images/mvapich/userguide/gdr/sgsh.png
Single GPU / Single HCA : (a) 1 GPU and 1 HCA in Different Sockets (b) 1 GPU and 1 HCA in Same Socket

6.2. Single GPU / Multiple HCA

As mentioned earlier, MVAPICH2-GDR 2.3a will automatically tries to take advantage of the multirail configuration for small and large message sizes. Note that if one HCA is far from the GPU (different sockets), then the default multirail selection might affect the performance. Thus, please do an explicit HCA selection, selecting the nearest as the first HCA In order to explicitly select the near HCA, please use MV2_IBA_HCA or MV2_PROCESS_TO_RAIL_MAPPING parameter.

$ export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1
$        or
$ export MV2_IBA_HCA=mlx5_0:mlx5_1
/static/images/mvapich/userguide/gdr/sgmh.png
Single GPU / Multiple HCA : (a) 1 GPU and 2 HCAs in Different Sockets (b) 1 GPU and 2 HCAs in Same Socket

6.3. Multiple GPU / Single HCA

For this configuration, users can use same parameters as Single GPU / Single

/static/images/mvapich/userguide/gdr/mgsh.png
Multiple GPU / Single HCA : (a) 2 GPUs and 1 HCA in Different Sockets (b) 2 GPUs and 1 HCA in Same Socket

6.4. Multiple GPU / Multiple HCA

By default MVAPICH2-GDR 2.3a will automatically select the HCA near to the GPU select by the process and bind that process to a core on the same socket. Further, as MVAPICH2-GDR supports the dynamic initialization of the CUDA devices, the HCA re-selection is transparently and dynamically performed. The parameters below are an explicit setting to the same default setting.

$ export MV2_RAIL_SHARING_POLICY=FIXED_MAPPING
$ export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1
$        or
$ export MV2_IBA_HCA=mlx5_0:mlx5_1

6.5. Examples using OSU micro-benchmarks with multi-rail support

To run osu_mbw_mr test with two processes per node and each process exclusively using a different IB card available on the nodes

$ export MV2_PATH=/opt/mvapich2/gdr/2.3a/gnu
$ export MV2_GPUDIRECT_GDRCOPY_LIB= path to the GDRCOPY install
$ export MV2_USE_CUDA=1
$ export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1

$ $MV2_PATH/bin/mpirun_rsh -export -np 4 hostA hostA hostB hostB \
        $MV2_PATH/libexec/mvapich2/get_local_rank \
        $MV2_PATH/libexec/mvapich2/osu_mbw_mr -d cuda

For more information about running OSU micro-benchmarks to measure MPI Multiple GPU / Multiple HCA : 2 GPUs and 2 HCAs in Different Sockets] communication performance on NVIDIA GPU clusters, please refer to:

7. Enabling Support on GPU-Clusters using regular OFED (without GPUDirect RDMA)

Targeted for GPU-based Systems that do not support GPUDirect RDMA feature: A) Use regular OFED instead of Mellanox OFED (MOFED) and B) Use older GPUs that do not support GPUDirect RDMA feature.

For such systems, MVAPICH2-GDR 2.3a efficiently takes advantage of CUDA IPC and GDRCOPY features. Please download and install the appropriate package.

8. Enabling Support for Managed Memory

Managed Memory feature provides high productivity in developing CUDA kernels by allowing the same memory allocation (pointer) to be used on both CPU and GPU. To enable this productivity for distributed memory programming, MVAPICH2-GDR 2.3a provides CUDA-Aware MPI support with manged memory allocation. In other words, MPI calls can be directly performed on managed memory buffers. Further, it allows mixing different memory allocations in a single application. For instance one can perform a send operation from a managed memory buffer and receive it on a buffer allocated directly on the GPU memory.

To enable this feature, please set MV2_CUDA_ENABLE_MANAGED=1 when executing your application.

MVAPICH2-GDR 2.3a enhances the intra-node support for managed memory when all GPUs in a node are peer-access. The design takes advantage of CUDA-IPC to boost the performance of data movement operations from/to managed buffers. To enable this feature, please add MV2_CUDA_MANAGED_IPC=1 on your execution command line.

9. Enabling Support for InfiniBand hardware UD-Multicast based collectives

InfiniBand hardware UD-Multicast (IB-MCAST) is a feature for designing highly scalable collective operations. Combining IB-MCAST and other features in MVAPICH2-GDR is extremely helpful for high-performance streaming applications, which are using MPI_Bcast.

This feature is disabled by default. Basic support for IB-MCAST feature with GDR can be enabled by using following parameters. Please refer to MVAPICH2 User Guide for more details (http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html).

  • MV2_USE_MCAST

    • Default: 0

    • Set this to 1, to enable hardware multicast support in collective communication

  • MV2_MCAST_NUM_NODES_THRESHOLD

    • Default: 8

    • This defines the threshold for enabling multicast support in collective communication. When MV2_USE_MCAST is set to 1 and the number of nodes in the job is greater than or equal to the threshold value, it uses multicast support in collective communication.

In addition to the basic support, advanced features can be enabled by using the following parameters:

  • MV2_GPUDIRECT_MCAST_PIPELINE

    • Default: 1 (Enabled)

    • To disable the support for efficient pipelined IB-MCAST from NVIDIA GPUs, set to 0.

  • MV2_GPUDIRECT_MCAST_RECV_TYPE

    • Default: 1 (Basic support)

    • To toggle support for IB-MCAST from/to NVIDIA GPUs. For enabling efficient zero-copy design, set to 2.

  • MV2_MCAST_RELIABILITY_TYPE

    • Default: 1 (Negative acknowledgement-based scheme)

    • To toggle reliability support for UD-based IB-MCAST from/to NVIDIA GPUs. For enabling RMA-based reliability scheme, set to 2.

10. Tuning and Usage Parameters

Note that MVAPICH2-GDR selects the optimal value for each of the following parameters based on architecture detection.

10.1. Basic Usage

  • MV2_USE_CUDA

    • Default: 0 (Disabled)

    • To toggle support for communication from NVIDIA GPUs. For enabling, set to 1.

  • MV2_CUDA_BLOCK_SIZE

    • Default: 262144

    • To tune pipelined internode transfers between NVIDIA GPUs. Higher values may help applications that use larger messages and are bandwidth critical.

  • MV2_GPUDIRECT_LIMIT

    • Default: 8192

    • To tune the hybrid design that uses pipelining and GPUDirect RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

  • MV2_USE_GPUDIRECT_RECEIVE_LIMIT

    • Default: 131072

    • To tune the hybrid design that uses pipelining and GPUDirect RDMA for maximum performance while overcoming P2P read bandwidth bottlenecks seen on modern systems. Lower values (16384) may help improve performance on nodes with multiple GPUs and IB adapters. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

  • MV2_CUDA_IPC_THRESHOLD

    • Default: 32768

    • To tune the usage of IPC communication path for intranode configuration. Note that if you have processes sharing the same GPU, please increase this parameter to a higher value like 524288.

  • MV2_CUDA_USE_IPC_BCAST

    • Default: 0 (Disabled)

    • To toggle support for IPC-based intra-node broadcast among NVIDIA GPUs. For enabling, set to 1.

10.2. GDRCOPY Feature

  • MV2_USE_GPUDIRECT_GDRCOPY_LIMIT

    • Default: 8192

    • To tune the local transfer threshold using gdrcopy module between GPU and CPU for point to point communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

  • MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT

    • Default: 8192

    • To tune the local transfer threshold using gdrcopy module between GPU and CPU for collective communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

10.3. Loopback Feature

  • MV2_USE_GPUDIRECT_LOOPBACK_LIMIT

    • Default: 8192

    • To tune the transfer threshold using loopback design for point to point communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

  • MV2_USE_GPUDIRECT_LOOPBACK_NAIVE_LIMIT

    • Default: 8192

    • To tune the transfer threshold using loopback design for collective communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

10.4. Non-Blocking Collective Support with GDR and Core-Direct

  • MV2_USE_CORE_DIRECT

    • Default: 0

    • To turn ON this features please set it to 1

    • To enable efficient support while working around the P2P bottleneck please set to 2

    • To provide a maximal overlap for both CPU and GPU simultaneous computing while avoiding the P2P bottlenecks set to 3

10.5. CPU Binding and Mapping Parameters

When experimenting on nodes with multiple NVIDIA GPUs and InfiniBand adapters, selecting the right NVIDIA GPU and IB adapter at each MPI process can be important to achieve good performance. The following parameters help uses bind processes to different IB HCAs. GPU device selection is expected to be made in the application using CUDA interfaces like cudaSetDevice. For the IB selection we have the below scenarios :

1) Multi-IB and Multi-GPU scenario: on a systems with 2 IBs and 2 GPUs, achieving the best best performance requires the processes to use the GPU closest to the IB. To do so:

  • MV2_PROCESS_TO_RAIL_MAPPING

    • Default: NONE

    • Value Domain: BUNCH, SCATTER, <CUSTOM LIST>

    • When MV2_RAIL_SHARING_POLICY is set to the value FIXED_MAPPING this variable decides the manner in which the HCAs will be mapped to the rails. The<CUSTOM LIST> is colon(:) separated list with the HCA ranks (e.g. 0:1:1:0) or HCA names specified (e.g. mlx5_0:mlx5_1:mlx5_0:mlx5_1). For more information on this parameter, refer to following section of MVAPICH2 2.3b user guide: http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html#x1-21500011.40

2) Multi-IB and 1 GPU scenario : for this scenario and in order to take advantage of the Multi-rails support, sending a large message from/to a GPU will take advantage of both IBs. In addition to the MV2_PROCESS_TO_RAIL_MAPPING parameter, the following parameters can be used:

  • MV2_RAIL_SHARING_POLICY

  • MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD

    • Default: 16K

    • This specifies the message size beyond which striping of messages across all available rails will take place.

It is also important to bind MPI processes as close to the GPU and IB adapter as possible. The following parameter allows you to manually control process-to-core mapping. MVAPICH2-GDR 2.3a does automatically the best binding by default, and prints a warning if the user does specify a binding which is not the best mapping.

10.6. GPU Datatype Processing Feature

  • MV2_CUDA_DIRECT_DT_THRESHOLD

    • Default: 8

    • To tune the direct transfer scheme using asynchronous CUDA memory copy for datatype packing/unpacking. Direct transfer scheme can avoid the kernel invocation overhead for dense datatypes. It has to be tuned based on the node architecture, the processor, the GPU and the IB card.

  • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

    • Default: 1024

    • To tune the thread block size for vector/hvector packing/unpacking kernels. It has to be tuned based on the vector/hvector datatype shape and the GPU.

  • MV2_CUDA_KERNEL_VECTOR_YSIZE

    • Default: 32

    • To tune the x dimension of thread block for vector/hvector packing/unpacking kernels. This value is automatically tuned based on the block length of vector/hvector datatypes. It also can be tuned based on the vector/hvector datatype shape and the GPU.

  • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE

    • Default: 1024

    • To tune the thread block size for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU.

  • MV2_CUDA_KERNEL_SUBARR_XDIM

    • Default: 8 (3D) /16 (2D) /256 (1D)

    • To tune the x dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU.

  • MV2_CUDA_KERNEL_SUBARR_YDIM

    • Default: 8 (3D) /32 (2D) /4 (1D)

    • To tune the y dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU.

  • MV2_CUDA_KERNEL_SUBARR_ZDIM

    • Default: 16 (3D) /1 (2D) /1 (1D)

    • To tune the z dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU.

  • MV2_CUDA_KERNEL_ALL_XDIM

    • Default: 16

    • To tune the x dimension of thread block for all datatypes except vector/hvector, indexed_block/hindexed_block and subarray. It has to be tuned based on the datatype shape and the GPU.

  • MV2_CUDA_KERNEL_IDXBLK_XDIM

    • Default: 1

    • To tune the x dimension of thread block for indexed_block/hindexed_block packing/unpacking kernels.It has to be tuned based on the indexed_block/hindexed_block datatype shape and the GPU.

11. CUDA-Aware MPI Primitives:

The following CUDA aware MPI primitives are available as a part of the library:

  • Point to Point

    • MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv, MPI_Ssend, MPI_Rsend, MPI_Issend, MPI_Bsend, MPI_Sendrecv, MPI_Send_init, MPI_Recv_init, MPI_Ssend_init, MPI_Bsend_init, MPI_Rsend_init

  • Collectives

    • MPI_Bcast, MPI_Scatter, MPI_Scatterv, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Allreduce, MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Scan, MPI_Exscan, MPI_Reduce_scatter, MPI_Reduce_scatter_block

  • Non-Blocking Collectives (NBC)

    • MPI_Ibcast, MPI_Iscatter, MPI_Igather, MPI_Iallgather, MPI_Ialltoall

  • One Sided Communications
    [True one sided implementation for transfers from/to GPU]

    Allocation Primitives
    • MPI_Win_create, MPI_Win_allocate, MPI_Win_allocate_shared, MPI_Win_create_dynamic

    Communication Primitives
    • MPI_Get, MPI_Put, MPI_Accumulate

    Synchronization Primitives
    • MPI_Win_lock/MPI_Win_unlock, MPI_Win_lock_all/MPI_Win_unlock_all, MPI_Win_flush, MPI_Win_fence, MPI_Win_flush_all, MPI_Win_flush_local, MPI_Win_flush_local_all, MPI_Win_post/MPI_Win_start/MPI_Win_wait/MPI_Win_complete

  • Enhanced support of MPI datatypes for GPU

    • MPI_Type_vector, MPI_Type_hvector, MPI_Type_create_struct, MPI_Type_create_subarray, MPI_Type_indexed, MPI_Type_hindexed, MPI_Type_create_hindexed_block

12. CUDA and OpenACC Extensions to OMB

The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers on NVIDIA GPU devices.

osu_bibw

Bidirectional Bandwidth Test

osu_bw

Bandwidth Test

osu_latency

Latency Test

osu_put_latency

Latency Test for Put

osu_get_latency

Latency Test for Get

osu_put_bw

Bandwidth Test for Put

osu_get_bw

Bandwidth Test for Get

osu_put_bibw

Bidirectional Bandwidth Test for Put

osu_acc_latency

Latency Test for Accumulate

osu_cas_latency

Latency Test for Compare and Swap

osu_fop_latency

Latency Test for Fetch and Op

osu_allgather

MPI_Allgather Latency Test

osu_allgatherv

MPI_Allgatherv Latency Test

osu_allreduce

MPI_Allreduce Latency Test

osu_alltoall

MPI_Alltoall Latency Test

osu_alltoallv

MPI_Alltoallv Latency Test

osu_bcast

MPI_Bcast Latency Test

osu_gather

MPI_Gather Latency Test

osu_gatherv

MPI_Gatherv Latency Test

osu_reduce

MPI_Reduce Latency Test

osu_reduce_scatter

MPI_Reduce_scatter Latency Test

osu_scatter

MPI_Scatter Latency Test

osu_scatterv

MPI_Scatterv Latency Test

osu_iallgather

MPI_Iallgather Latency and Overlap Test

osu_iallgatherv

MPI_Iallgatherv Latency and Overlap Test

osu_iallreduce

MPI_Iallreduce Latency and Overlap est

osu_ialltoall

MPI_Ialltoall Latency and Overlap Test

osu_ialltoallv

MPI_Ialltoallv Latency and Overlap Test

osu_ibcast

MPI_Ibcast Latency and Overlap Test

osu_igather

MPI_Igather Latency and Overlap Test

osu_igatherv

MPI_Igatherv Latency and Overlap Test

osu_iscatter

MPI_Iscatter Latency and Overlap Test

osu_iscatterv

MPI_Iscatterv Latency and Overlap Test

Some directions for usage are:

  • The CUDA extensions are enabled when the benchmark suite is configured with --enable-cuda option.

  • The OpenACC extensions are enabled when --enable-openacc is specified. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.

  • Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either H or D to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.

  • The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.

  • The non-blocking collective benchmarks can also use -t for MPI_Test() calls and -r option for setting the target of dummy computation.

13. Managed Memory Extensions to OMB

In addition to the CUDA and OpenACC support, the following benchmarks provides the support for managed memory allocation. To enable this support, set the environment variable mentioned earlier. This will enable support in the MVAPICH runtime. To enable allocation of memory using the CUDA managed memory API, refer below.

  • For pt2pt operations use the M flag

  • For collective operations use the -d managed flag

13.1. Example running OSU Micro Benchmarks with Managed Memory Support

To run osu_latency test for measuring internode MPI Send/Recv latency between GPUs, when enabling managed memory allocations on both the sender and receiver.

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3a/gnu
    2: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install
    3: $ export MV2_USE_CUDA=1
    4: $ export MV2_CUDA_ENABLE_MANAGED=1
    5:
    6: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    7:         $MV2_PATH/libexec/mvapich2/get_local_rank \
    8:         $MV2_PATH/libexec/mvapich2/osu_latency M M

For collectives, see the example below

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3a/gnu
    2: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install
    3: $ export MV2_USE_CUDA=1
    4: $ export MV2_CUDA_ENABLE_MANAGED=1
    5:
    6: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    7:         $MV2_PATH/libexec/mvapich2/get_local_rank \
    8:         $MV2_PATH/libexec/mvapich2/osu_bcast -d managed

The following benchmarks have been extended to evaluate performance of MPI Communications to and from buffers allocated with CUDA Managed memory.

osu_bibw

Bidirectional Bandwidth Test

osu_bw

Bandwidth Test

osu_latency

Latency Test

osu_allgather

MPI_Allgather Latency Test

osu_allgatherv

MPI_Allgatherv Latency Test

osu_allreduce

MPI_Allreduce Latency Test

osu_alltoall

MPI_Alltoall Latency Test

osu_alltoallv

MPI_Alltoallv Latency Test

osu_bcast

MPI_Bcast Latency Test

osu_gather

MPI_Gather Latency Test

osu_gatherv

MPI_Gatherv Latency Test

osu_reduce

MPI_Reduce Latency Test

osu_reduce_scatter

MPI_Reduce_scatter Latency Test

osu_scatter

MPI_Scatter Latency Test

osu_scatterv

MPI_Scatterv Latency Test