MVAPICH2-GDR 2.0 ----------------- MVAPICH2-GDR 2.0 binary release is based on MVAPICH2 2.0 and incorporates designs that take advantage of the new GPUDirect RDMA technology, which enables direct P2P communication between NVIDIA GPUs and Mellanox InfiniBand adapters. MVAPICH2-GDR 2.0 offers significant improvements in latency and bandwidth for GPU-buffer based MPI Communication involving small and medium message sizes. For more information on the GPUDirect RDMA technology, refer to: http://www.mellanox.com/page/products_dyn?product_family=116 Note that this release is for GPU-Cluster with GPUDirect RDMA support, if your cluster does not have this support please use the default MVAPICH2 library. For more details please refer to: http://mvapich.cse.ohio-state.edu/ System Requirements ------------------- MVAPICH2-GDR 2.0 binary release requires the following software to be installed on your system: 1) Mellanox OFED 2.1 and later (http://www.mellanox.com/page/products_dyn?product_family=26) 2) NVIDIA Driver 331.20 or later (http://www.nvidia.com/Download/driverResults.aspx/69372/) 3) NVIDIA CUDA Toolkit 6.0 and later (https://developer.nvidia.com/cuda-toolkit) 4) Plugin module to enable GPUDirect RDMA (http://www.mellanox.com/page/products_dyn?product_family=116) List of Mellanox InfiniBand adapters and NVIDIA GPU devices which support GPUDirect RDMA can be found here: http://www.mellanox.com/page/products_dyn?product_family=116 Strongly Recommended System features ------------------------------------ MVAPICH2-GDR 2.0 boosts the performance by taking advantage of the new GDRCOPY module from NVIDIA. In order to take advantage of this feature, please download and install this module from: (https://github.com/NVIDIA/gdrcopy) After installing this module you need to add this path to your LD_LIBRARY_PATH or use MV2_GPUDIRECT_GDRCOPY_LIB to pass this path to the MPI library at runtime. For more details please refer to section (GDRCOPY USAGE AND TUNING) of this README. Note that even if this module is not available, MVAPICH2-GDR 2.0 will deliver a very good performance by taking advantage of the Loopback feature. For more details refer to section (LOOPBACK FEATURE) of this README. Example for running OSU Micro Benchmark --------------------------------------- To run osu_latency test for measuring internode MPI Send/Recv latency between GPUs, when enabling GPUDirect RDMA-based designs in MVAPICH2-GDR 2.0 export MV2_PATH=/opt/mvapich2/gdr/2.0/gnu export MV2_GPUDIRECT_GDRCOPY_LIB=path/to/GDRCOPY/install export MV2_USE_CUDA=1 $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \ $MV2_PATH/libexec/mvapich2/get_local_rank \ $MV2_PATH/libexec/mvapich2/osu_latency D D Basic Usage and Tuning Parameters ---------------------------------- - MV2_USE_CUDA * Default: 0 (Disabled) * To toggle support for communication from NVIDIA GPUs. For enabling, set to 1. - MV2_CUDA_BLOCK_SIZE * Default: 262144 * To tune pipelined internode transfers between NVIDIA GPUs. Higher values may help applications that use larger messages and are bandwidth critical. - MV2_GPUDIRECT_LIMIT * Default: 8192 * To tune the hybrid design that uses pipelining and GPUDirect RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. - MV2_USE_GPUDIRECT_RECEIVE_LIMIT * Default: 131072 * To tune the hybrid design that uses pipelining and GPUDirect RDMA for maximum performance while overcoming P2P read bandwidth bottlenecks seen on modern systems. Lower values (16384) may help improve performance on nodes with multiple GPUs and IB adapters. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. GDRCOPY feature: Usage and Tuning Parameters -------------------------------------------- - MV2_USE_GPUDIRECT_GDRCOPY_LIMIT * Default: 8192 * To tune the local transfer threshold using gdrcopy module between GPU and CPU for point to point communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. - MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT * Default: 8192 * To tune the local transfer threshold using gdrcopy module between GPU and CPU for collective communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. Loopback Feature: Usage and Tuning Parameters --------------------------------------------- - MV2_USE_GPUDIRECT_LOOPBACK_LIMIT * Default: 8192 * To tune the transfer threshold using loopback design for point to point communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. - MV2_USE_GPUDIRECT_LOOPBACK_NAIVE_LIMIT * Default: 8192 * To tune the transfer threshold using loopback design for collective communications. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. CPU Binding and Mapping Parameters ---------------------------------- When experimenting on nodes with multiple NVIDIA GPUs and InfiniBand adapters, selecting the right NVIDIA GPU and IB adapter at each MPI process can be important to achieve good performance. The following parameters help uses bind processes to different IB HCAs. GPU device selection is expected to be made in the application using CUDA interfaces like cudaSetDevice. For the IB selection we have the below scenarios : 1) Multi-IB and Multi-GPU scenario: on a systems with 2 IBs and 2 GPUs, achieving the best best performance requires the processes to use the GPU closest to the IB. To do so: - MV2_PROCESS_TO_RAIL_MAPPING * Default: NONE * Value Domain: BUNCH, SCATTER, * When MV2_RAIL_SHARING_POLICY is set to the value FIXED_MAPPING this variable decides the manner in which the HCAs will be mapped to the rails. The is colon(:) separated list with the HCA ranks (e.g. 0:1:1:0) or HCA names specified (e.g. mlx5_0:mlx5_1:mlx5_0:mlx5_1). For more information on this parameter, refer to following section of MVAPICH2 2.0 user guide: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-670006.14 2) Multi-IB and 1 GPU scenario : for this scenario and in order to take advantage of the Multi-rails support, sending a large message from/to a GPU will take advantage of both IBs. In addition to the MV2_PROCESS_TO_RAIL_MAPPING parameter, the following parameters can be used: - MV2_RAIL_SHARING_POLICY * Default: ROUND_ROBIN * Value Domain: USE_FIRST, ROUND_ROBIN, FIXED_MAPPING * This specifies the policy that will be used to assign HCAs to each of the processes. For more information on this parameter, refer to following section of MVAPICH2 2.0 user guide: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-670006.14 - MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD * Default: 16K * This specifies the message size beyond which striping of messages across all available rails will take place. It is also important to bind MPI processes as close to the GPU and IB adapter as possible. The following parameter allows you to manually control process-to-core mapping. MVAPICH2-GDR2.0 does automatically the best binding by default, and prints a warning if the user does specify a binding which is not the best mapping. - MV2_CPU_MAPPING * Default: Unset * This allows users to specify process to CPU (core) mapping. The detailed usage of this parameter is described in http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-570006.5.2 GPU Datatype Processing Feature: Usage and Tuning ------------------------------------------------- - MV2_CUDA_DIRECT_DT_THRESHOLD * Default: 8 * To tune the direct transfer scheme using asynchronous CUDA memory copy for datatype packing/unpacking. Direct transfer scheme can avoid the kernel invocation overhead for dense datatypes. It has to be tuned based on the node architecture, the processor, the GPU and the IB card. - MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE * Default: 1024 * To tune the thread block size for vector/hvector packing/unpacking kernels. It has to be tuned based on the vector/hvector datatype shape and the GPU. - MV2_CUDA_KERNEL_VECTOR_YSIZE * Default: 32 * To tune the x dimension of thread block for vector/hvector packing/unpacking kernels. This value is automatically tuned based on the block length of vector/hvector datatypes. It also can be tuned based on the vector/hvector datatype shape and the GPU. - MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE * Default: 1024 * To tune the thread block size for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU. - MV2_CUDA_KERNEL_SUBARR_XDIM * Default: 8 (3D) /16 (2D) /256 (1D) * To tune the x dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU. - MV2_CUDA_KERNEL_SUBARR_YDIM * Default: 8 (3D) /32 (2D) /4 (1D) * To tune the y dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU. - MV2_CUDA_KERNEL_SUBARR_ZDIM * Default: 16 (3D) /1 (2D) /1 (1D) * To tune the z dimension of thread block for subarray packing/unpacking kernels. It has to be tuned based on the subarray datatype dimension, shape and the GPU. - MV2_CUDA_KERNEL_ALL_XDIM * Default: 16 * To tune the x dimension of thread block for all datatypes except vector/hvector, indexed_block/hindexed_block and subarray. It has to be tuned based on the datatype shape and the GPU. - MV2_CUDA_KERNEL_IDXBLK_XDIM * Default: 1 * To tune the x dimension of thread block for indexed_block/hindexed_block packing/unpacking kernels.It has to be tuned based on the indexed_block/hindexed_block datatype shape and the GPU. Examples using OSU micro-benchmarks with multi-rails support: ------------------------------------------------------------ To run osu_mbw_mr test with two processes per node and each process exclusively using a different IB card available on the nodes export MV2_PATH=/opt/mvapich2/gdr/2.0/gnu export MV2_GPUDIRECT_GDRCOPY_LIB= path to the GDRCOPY install export MV2_USE_CUDA=1 export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1 export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G $MV2_PATH/bin/mpirun_rsh -export -np 4 hostA hostA hostB hostB \ $MV2_PATH/libexec/mvapich2/get_local_rank \ $MV2_PATH/libexec/mvapich2/osu_mbw_mr -d cuda For more information about running OSU micro-benchmarks to measure MPI communication performance on NVIDIA GPU clusters, please refer to: http://mvapich.cse.ohio-state.edu/benchmarks/