MVAPICH2-GDR Changelog ---------------------- This file briefly describes the changes to the MVAPICH2-GDR software package. The logs are arranged in the "most recent first" order. MVAPICH2-GDR 2.3.7 (05/27/2022) * Features and Enhancements (Since 2.3.6) - Enhanced performance for GPU-aware MPI_Alltoall and MPI_Alltoallv - Added automatic rebinding of processes to cores based on GPU NUMA domain - This is enabled by setting the env MV2_GPU_AUTO_REBIND=1 - Added NCCL communication substrate for various non-blocking MPI collectives - MPI_Iallreduce, MPI_Ireduce, MPI_Iallgather, MPI_Iallgatherv, MPI_Ialltoall, MPI_Ialltoallv, MPI_Iscatter, MPI_Iscatterv, MPI_Igather, MPI_Igatherv, and MPI_Ibcast - Enhanced point-to-point and collective tuning for AMD Milan processors with NVIDIA A100 and AMD Mi100 GPUs - Enhanced point-to-point and collective tuning for NVIDIA DGX-A100 systems - Added support for Cray Slingshot-10 interconnect MVAPICH2-GDR 2.3.6 (08/12/2021) * Features and Enhancements (Since 2.3.5) - Based on MVAPICH2 2.3.6 - Added support for 'on-the-fly' compression of point-to-point messages used for GPU to GPU communication - Applicable to NVIDIA GPUs - Added NCCL communication substrate for various MPI collectives - Support for hybrid communication protocols using NCCL-based, CUDA-based, and IB verbs-based primitives - MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv, MPI_Gather, MPI_Gatherv, and MPI_Bcast - Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100 systems - Enhanced architecture detection, process placement and HCA selection - Enhanced intra-node and inter-node point-to-point tuning - Enhanced collective tuning - Introduced architecture detection, point-to-point tuning and collective tuning for ThetaGPU @ANL - Enhanced point-to-point and collective tuning for NVIDIA GPUs on Frontera @TACC, Lassen @LLNL, and Sierra @LLNL - Enhanced point-to-point and collective tuning for Mi50 and Mi60 AMD GPUs on Corona @LLNL - Added several new MPI_T PVARs - Added support for CUDA 11.3 - Added support for ROCm 4.1 - Enhanced output for runtime variable MV2_SHOW_ENV_INFO - Tested with Horovod and common DL Frameworks - TensorFlow, PyTorch, and MXNet - Tested with MPI4Dask 0.2 - MPI4Dask is a custom Dask Distributed package with MPI support - Tested with MPI4cuML 0.1 - MPI4cuML is a custom cuML package with MPI support Bug Fixes (Since 2.3.5) - Fix a bug where GPUs and HCAs were incorrectly identified as being on different sockets - Thanks to Chris Chambreau @LLNL for the report - Fix issues in collective tuning tables - Fix issues with adaptive HCA selection - Fix compilation warnings and memory leaks MVAPICH2-GDR 2.3.5 (12/11/2020) * Features and Enhancements (Since 2.3.4) - Based on MVAPICH2 2.3.5 - Added support for AMD GPUs via Radeon Open Compute (ROCm) platform - Added support for ROCm PeerDirect, ROCm IPC, and unified memory based device-to-device communication for AMD GPUs - Enhanced designs for GPU-aware MPI_Alltoall - Enhanced designs for GPU-aware MPI_Allgather - Added support for enhanced MPI derived datatype processing via kernel fusion - Added architecture specific flags to improve the performance of CUDA operations - Added support for Apache MXNet Deep Learning Framework - Added GPU-based point-to-point tuning for AMD Mi50 and Mi60 GPUs - Enhanced GPU-based Alltoall and Allgather tuning for POWER9 systems - Enhanced GPU-based Allreduce tuning for Frontera RTX system - Tested with PyTorch and DeepSpeed framework for distributed Deep Learning Bug Fixes (Since 2.3.4) - Fix performance degradation in first CUDA call due to CUDA JIT compilation for PTX Compatibility - Fix validation issue with kernel-based datatype processing - Fix validation issue with GPU based MPI_Scatter - Fix a potential issue when using MPI_Win_allocate - Thanks to Bert Wesarg at TU Dresden and George Katevenisi at ICS Forth for reporting the issue and providing the initial patch - Fix out of memory issue when allocating CUDA events - Fix compilation errors with PGI 20.x compilers - Fix compilation warnings and memory leaks MVAPICH2-GDR 2.3.4 (06/04/2020) * Features and Enhancements (Since 2.3.3) - Based on MVAPICH2 2.3.4 - Enhanced MPI_Allreduce performance on DGX-2 systems - Enhanced MPI_Allreduce performance on POWER9 systems - Reduced the CUDA interception overhead for non-CUDA symbols - Enhanced performance for point-to-point and collective operations on Frontera's RTX nodes - Add new runtime variable 'MV2_SUPPORT_DL' to replace 'MV2_SUPPORT_TENSOR_FLOW' - Added compilation and runtime methods for checking CUDA support - Enhanced GDR output for runtime variable MV2_SHOW_ENV_INFO - Tested with Horovod and common DL Frameworks (TensorFlow, PyTorch, and MXNet) - Tested with PyTorch Distributed Bug Fixes (Since 2.3.3) - Fix hang caused by the use of multiple communicators - Fix detection of Intel CPU Model name - Fix intermediate buffer size for Allreduce when DL workload is expected - Fix the random hangs in IMB4-RMA tests - Fix hang in OMP offloading - Fix hang with -w dynamic option when using one-sided benchmarks for device buffers - Add proper fallback and warning message when shared RMA window cannot be created - Fix potential FP exception error in MPI_Allreduce - Thanks to Shinichiro Takizawa@AIST for the report - Fix data validation issue of MPI_Allreduce - Thanks to Andreas Herten@JSC for the report - Fix the need for preloading libmpi.so - Thanks to Andreas Herten@JSC for the feedback - Fix compilation issue with PGI compiler - Fix compilation warnings and memory leaks MVAPICH2-GDR 2.3.3 (01/09/2019) * Features and Enhancements (Since 2.3.2) - Based on MVAPICH2 2.3.3 - Support for GDRCopy v2.0 - Support for jsrun - Enhanced datatype support for CUDA kernel-based Allreduce - Enhanced inter-node point-to-point performance for CUDA managed buffers on POWER9 system - Enhanced CUDA-Aware MPI_Allreduce on NVLink-enabled GPU systems. - Enhanced CUDA-Aware MPI_Pack and MPI_Unpack Bug Fixes (Since 2.3.2) - Fix data validation in datatype support for device buffers - Fix segfault for MPI_Gather on device buffers - Fix segfault and hang on intra-node communication on device buffers - Fix issues in intra-node MPI_Bcast design on device buffers - Fix incorrect initialization of CUDA support for collective communication - Fix compilation warnings MVAPICH2-GDR 2.3.2 (08/08/2019) * Features and Enhancements (Since 2.3.1) - Based on MVAPICH2 2.3.1 - Support for CUDA 10.1 - Support for PGI 19.x - Enhanced intra-node and inter-node point-to-point performance - Enhanced MPI_Allreduce performance for DGX-2 system - Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode - Enhanced performance of datatype support for GPU-resident data - Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe - Enhanced GPU-based point-to-point and collective tuning on - OpenPOWER systems such as ORNL Summit and LLNL Sierra ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center - Scaled Allreduce to 24,576 Volta GPUs on Summit * Bug Fixes (Since 2.3.1) - Fix hang issue in host-based MPI_Alltoallv - Fix GPU communication progress in MPI_THREAD_MULTIPLE mode - Fix potential failures in GDRCopy registration - Fix compilation warnings MVAPICH2-GDR 2.3.1 (03/16/2019) * Features and Enhancements (Since 2.3) - Based on MVAPICH2 2.3.1 - Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems - Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems - Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get - Support for PGI 18.10 - Add new runtime variables - 'MV2_GDRCOPY_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_LIMIT' - 'MV2_GDRCOPY_NAIVE_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT' - 'MV2_USE_GDRCOPY' to replace 'MV2_USE_GPUDIRECT_GDRCOPY' - Flexible support for running TensorFlow (Horovod) jobs * Bug Fixes (Since 2.3) - Fix data validation issue in CUDA-Aware MPI_Reduce - Fix hang in CUDA-Aware MPI_Get_accumulate - Fix compilation errors with clang - Fix compilation warnings MVAPICH2-GDR 2.3 (11/10/2018) * Features and Enhancements (Since 2.3rc1) - Support for CUDA 10.0 * Bug Fixes (Since 2.3rc1) - Fix memory leaks in CUDA-based collectives - Fix memory leaks in CUDA IPC cache designs - Fix segfault when freeing NULL IPC resouces MVAPICH2-GDR 2.3rc1 (09/22/2018) * Features and Enhancements (Since 2.3a) - Based on MVAPICH2 2.3 - Support for CUDA 9.2 - Support for OpenPOWER9 with NVLink - Support IBM XLC and PGI compilers with CUDA kernel features - Enhanced point-to-point performance for the small messages - Enhanced Alltoallv operation for host buffers - Enhanced CUDA-based collective tuning on OpenPOWER8/9 systems - Enhanced large-message Reduce, Broadcast and Allreduce for Deep Learning workloads - Add new runtime variables 'MV2_USE_GPUDIRECT_RDMA' and 'MV2_USE_GDR' to replace 'MV2_USE_GPUDIRECT' - Support collective offload using Mellanox's SHArP for Allreduce on host-buffers - Enhance tuning framework for Allreduce using SHArP - Enhanced host-based collectives for IBM POWER8/9, Intel Skylake, Intel KNL, and Intel Broadwell architectures * Bug Fixes (Since 2.3a) - Fix issues with InfiniBand Multicast (IB-MCAST) based designs for GPU-based Broadcast - Fix hang issue with the zero-copy Broadcast operation - Fix issue with datatype processing for host buffer - Fix application crash with GDRCopy feature - Fix memory leak in CUDA-based Allreduce algorithms - Fix data validation issue for Allreduce algorithms - Fix data validation issue for non-blocking Gather operation MVAPICH2-GDR 2.3a (11/09/2017) * Features and Enhancements (Since 2.2) - Based on MVAPICH2 2.2 - Support for CUDA 9.0 - Add support for Volta (V100) GPU - Support for OpenPOWER with NVLink - Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink - Enhanced performance of GPU-based point-to-point communication - Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication - Enhanced performance of MPI_Allreduce for GPU-resident data - InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications * Basic support for IB-MCAST designs with GPUDirect RDMA * Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA * Advanced reliability support for IB-MCAST designs - Efficient broadcast designs for Deep Learning applications - Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems * Bug Fixes (Since 2.2) - Fix issue with MPI_Finalize when MV2_USE_GPUDIRECT=0 - Fix data validation issue with GDRCOPY and Loopback - Fix issue with runtime error when MV2_USE_CUDA=0 - Fix issue with MPI_Allreduce for R3 protocol - Fix warning message when GDRCOPY module cannot be used MVAPICH2-GDR 2.3a (Pre-release Version) (08/10/2017) * Features and Enhancements (since 2.2): - Based on MVAPICH2 2.2 - InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast * Basic support for IB-MCAST designs with GPUDirect RDMA * Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA * Advanced reliability support for IB-MCAST designs - Efficient broadcast designs for DL applications - Enhanced tuning for broadcast collective on * RI2@OSU * CSCS systems * OpenPOWER systems at LLNL MVAPICH2-GDR 2.2 (10/25/2016) * Features and Enhancements (since 2.2rc1): - Based on MVAPICH2 2.2 - Add support for CUDA 8.0 - Add support for Pascal (P100) GPU - Efficient support for CUDA-aware large message collectives targeting Deep Learning frameworks - Efficient support for Bcast - Efficient support for Reduce - Efficient support for Allreduce - Introducing GPU-based tuning framework for Reduce collective for Broadwell+EDR+K80 based systems * Bug Fixes (since 2.2rc1): - Correctly guard core-direct support for GPU-based non-blocking collectives - Minor fixes to GPU-based collective tuning framework for Wilkes and CSCS - Fix compilation warnings MVAPICH2-GDR 2.2rc1 (05/27/2016) * Features and Enhancements (since 2.2b): - Based on MVAPICH2 2.2rc1 - Support for high-performance non-blocking send operations from GPU buffers - Enhancing Intranode CUDA Managed-Aware communication using a new CUDA-IPC-based design - Adding support for RDMA-CM communication - Introducing support for RoCE-V1 and RoCE-V2 - Introducing GPU-based tuning framework for Bcast operation - Introducing GPU-based tuning framework for Gather operation * Bug Fixes (since 2.2b): - Properly handle socket / numa node binding - Remove the usage of default stream during the communication - Fix compile warnings - Properly handle out of WQE scenarios - Fix memory leaks in multicast code path MVAPICH2-GDR-2.2b (02/04/2016) * Features and Enhancements (since MVAPICH2-GDR 2.2a) - Based on MVAPICH2-2.2b - Support for CUDA-Aware Managed memory - Efficient intra-node and inter-node communication directly from/to Managed pointers - Automatic support for heterogeneous program with both managed and no managed memory (traditional) allocation * Bug Fixes (since MVAPICH2-GDR 2.2a) - Provide clearer message when GDR library is unable to use GDR kernel driver MVAPICH2-GDR-2.2a (11/09/2015) * Features and Enhancements (since MVAPICH2-GDR 2.1) - Based on MVAPICH2-2.2.a - Support for efficient Non-Blocking Collectives for Device buffers - Exploit both Core-Direct and GPUDirect RDMA features - Maximal overlap of communication and computation on both CPU and GPU - Enable Support on GPU-Clusters using regular OFED (without GPUDirect RDMA) - Capability to use IPC - Capability to use GDRCOPY - Tuned IPC thresholds for multi-GPU nodes * Bug Fixes (since MVAPICH2-GDR 2.1) - Fix IPC interaction with RMA Lock synchronization - Thanks to Akihiro Tabuchi@ University of Tsukuba MVAPICH2-GDR-2.1 (10/16/2015) * Features and Enhancements (since MVAPICH2-GDR 2.1rc2) - Based on MVAPICH2-2.1 - CUDA 7.5 compatibility - Multi-rail support for loopback design - Multi-rail support for medium and large message size for H-H, H-D, D-H, and D-D - Automatic and dynamic rail and CPU binding - Support, optimization, and tuning for CS-Storm architectures - Optimization and tuning for point-point and collective operations - Support SLURM launcher environment * Bug Fixes (since MVAPICH2-GDR 2.1rc2) - Fix IPC interaction with SLURM - Thanks to Mark Klein@CSCS for the report - Fix the heterogeneous support where some processes do not use GPUs - Thanks to Xavier Lapillonne@meteoswiss for the report - Fix for on-demand connection with loopback design - Thanks to Mark Klein@CSCS for the report MVAPICH2-GDR-2.1rc2 (06/24/2015) * Features and Enhancements (since MVAPICH2-GDR 2.1a) - Based on MVAPICH2-2.1rc2 - CUDA 7.0 compatibility - CUDA-Aware support for MPI_Rsend and MPI_Irsend primitives - Parallel intranode communication channels (shared memory for H-H and GDR for D-D) - Optimized H-H, H-D and D-H communication - Optimized intranode D-D communication - Optimization and tuning for point-point and collective operations - Update to sm_20 kernel optimization for Datatype processing * Bug Fixes (since MVAPICH2-GDR 2.1a) - Fix data validation in some scenarios for GDR-based communication - Fix the overlap performance issue for IPC-based communication - Fix for RMA operations with R3 protocol MVAPICH2-GDR-2.1a (12/20/2014) * Features and Enhancements (since MVAPICH2-GDR 2.0) - Based on MVAPICH2 2.1a - Optimized design for GPU based small message transfers - Adding R3 support for GPU based packetized transfer - Enhanced performance for small message host-to-device transfers - Support for MPI_Scan and MPI_Exscan collective operations from GPU buffers - Optimization of collectives with new copy designs * Bug Fixes (since MVAPICH2-GDR 2.0) - Fix issue in one-sided tests with post-start-complete-wait synchronization - Fix error with void pointer arithmetic for PGI compiler - Fix issue in CUDA-aware MPI_Unpack - Fix compiler warnings and memory leaks MVAPICH2-GDR-2.0 (08/23/2014) * Features and Enhancements (since MVAPICH2-GDR 2.0b): - Based on MVAPICH2 2.0 (OFA-IB-CH3 interface) - Support for efficient MPI-3 RMA (One Sided) communication using GPUDirect RDMA and pipelining - Efficient small message inter-node communication using the new NVIDIA GDRCOPY module - Efficient small message inter-node communication using loopback design - Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-IOH and inter-IOH) - Automatic selection and binding of the best GPU/HCA pair in a multi-GPU/HCA system configuration - Optimized and tuned support for collective communication from GPU buffers - Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index, array and subarray * Bug-Fixes (since MVAPICH2-GDR 2.0b): - Fix a bug in the registration cache of GPU buffers