MVAPICH2-GDR 2.0
-----------------

MVAPICH2-GDR 2.0 binary release is based on MVAPICH2 2.0 and incorporates
designs that take advantage of the new GPUDirect RDMA technology, which
enables direct P2P communication between NVIDIA GPUs and Mellanox InfiniBand
adapters.  MVAPICH2-GDR 2.0 offers significant improvements in latency and
bandwidth for GPU-buffer based MPI Communication involving small and medium
message sizes. For more information on the GPUDirect RDMA technology, refer
to: 

http://www.mellanox.com/page/products_dyn?product_family=116

Note that this release is for GPU-Cluster with GPUDirect RDMA support, if your
cluster does not have this support please use the default MVAPICH2 library.
For more details please refer to:
 
http://mvapich.cse.ohio-state.edu/ 

System Requirements
-------------------

MVAPICH2-GDR 2.0 binary release requires the following software to be
installed on your system:

1) Mellanox OFED 2.1 and later 
   (http://www.mellanox.com/page/products_dyn?product_family=26)
2) NVIDIA Driver 331.20 or later 
   (http://www.nvidia.com/Download/driverResults.aspx/69372/)
3) NVIDIA CUDA Toolkit 6.0 and later
   (https://developer.nvidia.com/cuda-toolkit)
4) Plugin module to enable GPUDirect RDMA 
   (http://www.mellanox.com/page/products_dyn?product_family=116) 

List of Mellanox InfiniBand adapters and NVIDIA GPU devices which support
GPUDirect RDMA can be found here:
http://www.mellanox.com/page/products_dyn?product_family=116

Strongly Recommended System features 
------------------------------------

MVAPICH2-GDR 2.0 boosts the performance by taking advantage of the new GDRCOPY
module from NVIDIA.  In order to take advantage of this feature, please
download and install this module from:

   (https://github.com/NVIDIA/gdrcopy)

After installing this module you need to add this path to your LD_LIBRARY_PATH
or use MV2_GPUDIRECT_GDRCOPY_LIB to pass this path to the MPI library at
runtime. For more details please refer to section (GDRCOPY USAGE AND TUNING)
of this README.

Note that even if this module is not available, MVAPICH2-GDR 2.0 will deliver
a very good performance by taking advantage of the Loopback feature. For more
details refer to section (LOOPBACK FEATURE) of this README.    

Example for running OSU Micro Benchmark
---------------------------------------

To run osu_latency test for measuring internode MPI Send/Recv latency between
GPUs, when enabling GPUDirect RDMA-based designs in MVAPICH2-GDR 2.0

    export MV2_PATH=/opt/mvapich2/gdr/2.0/gnu
    export MV2_GPUDIRECT_GDRCOPY_LIB=path/to/GDRCOPY/install
    export MV2_USE_CUDA=1

    $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
        $MV2_PATH/libexec/mvapich2/get_local_rank \
        $MV2_PATH/libexec/mvapich2/osu_latency D D


Basic Usage and Tuning Parameters
----------------------------------

- MV2_USE_CUDA
   * Default: 0 (Disabled)
   * To toggle support for communication from NVIDIA GPUs. For enabling, set
     to 1.

- MV2_CUDA_BLOCK_SIZE
   * Default: 262144
   * To tune pipelined internode transfers between NVIDIA GPUs. Higher values 
     may help applications that use larger messages and are bandwidth critical.

- MV2_GPUDIRECT_LIMIT
   * Default: 8192
   * To tune the hybrid design that uses pipelining and GPUDirect RDMA for
     maximum performance while overcoming P2P bandwidth bottlenecks seen on
     modern systems. GPUDirect RDMA is used only for messages with size less
     than or equal to this limit. It has to be tuned based on the node
     architecture, the processor, the GPU and the IB card. 

- MV2_USE_GPUDIRECT_RECEIVE_LIMIT
   * Default: 131072
   * To tune the hybrid design that uses pipelining and GPUDirect RDMA for
     maximum performance while overcoming P2P read bandwidth bottlenecks seen
     on modern systems.  Lower values (16384) may help improve performance on
     nodes with multiple GPUs and IB adapters. It has to be tuned based on the
     node architecture, the processor, the GPU and the IB card. 


GDRCOPY feature: Usage and Tuning Parameters
--------------------------------------------

- MV2_USE_GPUDIRECT_GDRCOPY_LIMIT
   * Default: 8192
   * To tune the local transfer threshold using gdrcopy module between GPU and
     CPU for point to point communications. It has to be tuned based on the
     node architecture, the processor, the GPU and the IB card.

- MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT
   * Default: 8192
   * To tune the local transfer threshold using gdrcopy module between GPU and
     CPU for collective communications. It has to be tuned based on the node
     architecture, the processor, the GPU and the IB card.

Loopback Feature: Usage and Tuning Parameters
---------------------------------------------

- MV2_USE_GPUDIRECT_LOOPBACK_LIMIT
   * Default: 8192
   * To tune the transfer threshold using loopback design for point to point
     communications.  It has to be tuned based on the node architecture, the
     processor, the GPU and the IB card.

- MV2_USE_GPUDIRECT_LOOPBACK_NAIVE_LIMIT
   * Default: 8192
   * To tune the transfer threshold using loopback design for collective
     communications.  It has to be tuned based on the node architecture, the
     processor, the GPU and the IB card.


CPU Binding and Mapping Parameters
----------------------------------

When experimenting on nodes with multiple NVIDIA GPUs and InfiniBand adapters,
selecting the right NVIDIA GPU and IB adapter at each MPI process can be
important to achieve good performance. The following parameters help uses bind
processes to different IB HCAs. GPU device selection is expected to be made in
the application using CUDA interfaces like cudaSetDevice. For the IB selection 
we have the below scenarios : 

1) Multi-IB and Multi-GPU scenario: on a systems with 2 IBs and 2 GPUs,
achieving the best best performance requires the processes to use the GPU
closest to the IB.  To do so:

- MV2_PROCESS_TO_RAIL_MAPPING
   * Default: NONE
   * Value Domain: BUNCH, SCATTER, <CUSTOM LIST>
   * When MV2_RAIL_SHARING_POLICY is set to the value FIXED_MAPPING this
     variable decides the manner in which the HCAs will be mapped to the
     rails.  The<CUSTOM LIST> is colon(:) separated list with the HCA ranks
     (e.g.  0:1:1:0) or HCA names specified (e.g.
     mlx5_0:mlx5_1:mlx5_0:mlx5_1). For more information on this parameter,
     refer to following section of MVAPICH2 2.0 user guide:
     http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-670006.14

2) Multi-IB and 1 GPU scenario : for this scenario and in order to take
advantage of the Multi-rails support, sending a large message from/to a GPU
will take advantage of both IBs. In addition to the
MV2_PROCESS_TO_RAIL_MAPPING parameter, the following parameters can be used:
  
- MV2_RAIL_SHARING_POLICY
   * Default: ROUND_ROBIN
   * Value Domain: USE_FIRST, ROUND_ROBIN, FIXED_MAPPING
   * This specifies the policy that will be used to assign HCAs to each of the
     processes. For more information on this parameter, refer to following
     section of MVAPICH2 2.0 user guide:
     http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-670006.14

- MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD
    * Default: 16K
    * This specifies the message size beyond which striping of messages across
      all available rails will take place.   
   
It is also important to bind MPI processes as close to the GPU and IB adapter
as possible. The following parameter allows you to manually control
process-to-core mapping. MVAPICH2-GDR2.0 does automatically the best binding
by default, and prints a warning if the user does specify a binding which is
not the best mapping.

- MV2_CPU_MAPPING
    * Default: Unset
    * This allows users to specify process to CPU (core) mapping. The detailed
      usage of this parameter is described in
      http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-570006.5.2


GPU Datatype Processing Feature: Usage and Tuning
-------------------------------------------------

- MV2_CUDA_DIRECT_DT_THRESHOLD
   * Default: 8
   * To tune the direct transfer scheme using asynchronous CUDA memory copy
     for datatype packing/unpacking. Direct transfer scheme can avoid the
     kernel invocation overhead for dense datatypes. It has to be tuned based
     on the node architecture, the processor, the GPU and the IB card.

- MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
   * Default: 1024
   * To tune the thread block size for vector/hvector packing/unpacking
     kernels.  It has to be tuned based on the vector/hvector datatype shape
     and the GPU.

- MV2_CUDA_KERNEL_VECTOR_YSIZE
   * Default: 32
   * To tune the x dimension of thread block for vector/hvector
     packing/unpacking kernels. This value is automatically tuned based on the
     block length of vector/hvector datatypes. It also can be tuned based on
     the vector/hvector datatype shape and the GPU.

- MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
   * Default: 1024
   * To tune the thread block size for subarray packing/unpacking kernels.
     It has to be tuned based on the subarray datatype dimension, shape and
     the GPU.

- MV2_CUDA_KERNEL_SUBARR_XDIM
   * Default: 8 (3D) /16 (2D) /256 (1D)
   * To tune the x dimension of thread block for subarray packing/unpacking
     kernels.  It has to be tuned based on the subarray datatype dimension,
     shape and the GPU.

- MV2_CUDA_KERNEL_SUBARR_YDIM
   * Default: 8 (3D) /32 (2D) /4 (1D)
   * To tune the y dimension of thread block for subarray packing/unpacking
     kernels.  It has to be tuned based on the subarray datatype dimension,
     shape and the GPU.

- MV2_CUDA_KERNEL_SUBARR_ZDIM
   * Default: 16 (3D) /1 (2D) /1 (1D)
   * To tune the z dimension of thread block for subarray packing/unpacking
     kernels.  It has to be tuned based on the subarray datatype dimension,
     shape and the GPU.

- MV2_CUDA_KERNEL_ALL_XDIM
   * Default: 16
   * To tune the x dimension of thread block for all datatypes except
     vector/hvector, indexed_block/hindexed_block and subarray.  It has to be
     tuned based on the datatype shape and the GPU.

- MV2_CUDA_KERNEL_IDXBLK_XDIM
   * Default: 1
   * To tune the x dimension of thread block for indexed_block/hindexed_block
     packing/unpacking kernels.It has to be tuned based on the
     indexed_block/hindexed_block datatype shape and the GPU.
  

Examples using OSU micro-benchmarks with multi-rails support:
------------------------------------------------------------

To run osu_mbw_mr test with two processes per node and each process
exclusively using a different IB card available on the nodes

    export MV2_PATH=/opt/mvapich2/gdr/2.0/gnu
    export MV2_GPUDIRECT_GDRCOPY_LIB= path to the GDRCOPY install
    export MV2_USE_CUDA=1
    export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1
    export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G

    $MV2_PATH/bin/mpirun_rsh -export -np 4 hostA hostA hostB hostB \
        $MV2_PATH/libexec/mvapich2/get_local_rank \
        $MV2_PATH/libexec/mvapich2/osu_mbw_mr -d cuda

For more information about running OSU micro-benchmarks to measure MPI
communication performance on NVIDIA GPU clusters, please refer to:

http://mvapich.cse.ohio-state.edu/benchmarks/