MVAPICH2-GDR Changelog
----------------------
This file briefly describes the changes to the MVAPICH2-GDR software package.
The logs are arranged in the "most recent first" order.

MVAPICH2-GDR 2.3.7 (05/27/2022)

* Features and Enhancements (Since 2.3.6)
    - Enhanced performance for GPU-aware MPI_Alltoall and MPI_Alltoallv
    - Added automatic rebinding of processes to cores based on GPU NUMA domain
        - This is enabled by setting the env MV2_GPU_AUTO_REBIND=1
    - Added NCCL communication substrate for various non-blocking MPI collectives
        - MPI_Iallreduce, MPI_Ireduce, MPI_Iallgather, MPI_Iallgatherv,
          MPI_Ialltoall, MPI_Ialltoallv, MPI_Iscatter, MPI_Iscatterv,
          MPI_Igather, MPI_Igatherv, and MPI_Ibcast
    - Enhanced point-to-point and collective tuning for AMD Milan processors
      with NVIDIA A100 and AMD Mi100 GPUs
    - Enhanced point-to-point and collective tuning for NVIDIA DGX-A100 systems
    - Added support for Cray Slingshot-10 interconnect

MVAPICH2-GDR 2.3.6 (08/12/2021)

* Features and Enhancements (Since 2.3.5)
    - Based on MVAPICH2 2.3.6
    - Added support for 'on-the-fly' compression of point-to-point messages used
      for GPU to GPU communication
        - Applicable to NVIDIA GPUs
    - Added NCCL communication substrate for various MPI collectives
        - Support for hybrid communication protocols using NCCL-based,
          CUDA-based, and IB verbs-based primitives
        - MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv,
          MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv,
          MPI_Gather, MPI_Gatherv, and MPI_Bcast
    - Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100
      systems
        - Enhanced architecture detection, process placement and HCA selection
        - Enhanced intra-node and inter-node point-to-point tuning
        - Enhanced collective tuning
    - Introduced architecture detection, point-to-point tuning and collective
      tuning for ThetaGPU @ANL
    - Enhanced point-to-point and collective tuning for NVIDIA GPUs on
      Frontera @TACC, Lassen @LLNL, and Sierra @LLNL
    - Enhanced point-to-point and collective tuning for Mi50 and Mi60 AMD GPUs
      on Corona @LLNL
    - Added several new MPI_T PVARs
    - Added support for CUDA 11.3
    - Added support for ROCm 4.1
    - Enhanced output for runtime variable MV2_SHOW_ENV_INFO
    - Tested with Horovod and common DL Frameworks
        - TensorFlow, PyTorch, and MXNet
    - Tested with MPI4Dask 0.2
        - MPI4Dask is a custom Dask Distributed package with MPI support
    - Tested with MPI4cuML 0.1
        - MPI4cuML is a custom cuML package with MPI support

Bug Fixes (Since 2.3.5)
    - Fix a bug where GPUs and HCAs were incorrectly identified as being on
      different sockets
        - Thanks to Chris Chambreau @LLNL for the report
    - Fix issues in collective tuning tables
    - Fix issues with adaptive HCA selection
    - Fix compilation warnings and memory leaks

MVAPICH2-GDR 2.3.5 (12/11/2020)

* Features and Enhancements (Since 2.3.4)
    - Based on MVAPICH2 2.3.5
    - Added support for AMD GPUs via Radeon Open Compute (ROCm) platform
    - Added support for ROCm PeerDirect, ROCm IPC, and unified memory based
      device-to-device communication for AMD GPUs
    - Enhanced designs for GPU-aware MPI_Alltoall
    - Enhanced designs for GPU-aware MPI_Allgather
    - Added support for enhanced MPI derived datatype processing via kernel fusion
    - Added architecture specific flags to improve the performance of CUDA operations
    - Added support for Apache MXNet Deep Learning Framework
    - Added GPU-based point-to-point tuning for AMD Mi50 and Mi60 GPUs
    - Enhanced GPU-based Alltoall and Allgather tuning for POWER9 systems
    - Enhanced GPU-based Allreduce tuning for Frontera RTX system
    - Tested with PyTorch and DeepSpeed framework for distributed Deep Learning

Bug Fixes (Since 2.3.4)
    - Fix performance degradation in first CUDA call due to CUDA JIT compilation
      for PTX Compatibility
    - Fix validation issue with kernel-based datatype processing
    - Fix validation issue with GPU based MPI_Scatter
    - Fix a potential issue when using MPI_Win_allocate
      - Thanks to Bert Wesarg at TU Dresden and George Katevenisi at ICS Forth
        for reporting the issue and providing the initial patch
    - Fix out of memory issue when allocating CUDA events
    - Fix compilation errors with PGI 20.x compilers
    - Fix compilation warnings and memory leaks

MVAPICH2-GDR 2.3.4 (06/04/2020)

* Features and Enhancements (Since 2.3.3)
    - Based on MVAPICH2 2.3.4
    - Enhanced MPI_Allreduce performance on DGX-2 systems
    - Enhanced MPI_Allreduce performance on POWER9 systems
    - Reduced the CUDA interception overhead for non-CUDA symbols
    - Enhanced performance for point-to-point and collective operations on
      Frontera's RTX nodes
    - Add new runtime variable  'MV2_SUPPORT_DL' to replace
      'MV2_SUPPORT_TENSOR_FLOW'
    - Added compilation and runtime methods for checking CUDA support
    - Enhanced GDR output for runtime variable MV2_SHOW_ENV_INFO
    - Tested with Horovod and common DL Frameworks (TensorFlow, PyTorch, and
      MXNet)
    - Tested with PyTorch Distributed

Bug Fixes (Since 2.3.3)
    - Fix hang caused by the use of multiple communicators
    - Fix detection of Intel CPU Model name
    - Fix intermediate buffer size for Allreduce when DL workload is expected
    - Fix the random hangs in IMB4-RMA tests
    - Fix hang in OMP offloading
    - Fix hang with -w dynamic option when using one-sided benchmarks for
      device buffers
    - Add proper fallback and warning message when shared RMA window cannot be
      created
    - Fix potential FP exception error in MPI_Allreduce
      - Thanks to Shinichiro Takizawa@AIST for the report
    - Fix data validation issue of MPI_Allreduce
      - Thanks to Andreas Herten@JSC for the report
    - Fix the need for preloading libmpi.so
      - Thanks to Andreas Herten@JSC for the feedback
    - Fix compilation issue with PGI compiler
    - Fix compilation warnings and memory leaks

MVAPICH2-GDR 2.3.3 (01/09/2019)

* Features and Enhancements (Since 2.3.2)
    - Based on MVAPICH2 2.3.3
    - Support for GDRCopy v2.0
    - Support for jsrun
    - Enhanced datatype support for CUDA kernel-based Allreduce
    - Enhanced inter-node point-to-point performance for CUDA managed buffers
      on POWER9 system
    - Enhanced CUDA-Aware MPI_Allreduce on NVLink-enabled GPU systems.
    - Enhanced CUDA-Aware MPI_Pack and MPI_Unpack

Bug Fixes (Since 2.3.2)
    - Fix data validation in datatype support for device buffers
    - Fix segfault for MPI_Gather on device buffers
    - Fix segfault and hang on intra-node communication on device buffers
    - Fix issues in intra-node MPI_Bcast design on device buffers
    - Fix incorrect initialization of CUDA support for collective communication
    - Fix compilation warnings

MVAPICH2-GDR 2.3.2 (08/08/2019)

* Features and Enhancements (Since 2.3.1)
    - Based on MVAPICH2 2.3.1
    - Support for CUDA 10.1
    - Support for PGI 19.x
    - Enhanced intra-node and inter-node point-to-point performance
    - Enhanced MPI_Allreduce performance for DGX-2 system
    - Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode
    - Enhanced performance of datatype support for GPU-resident data
      -  Zero-copy transfer when P2P access is available between GPUs through
         NVLink/PCIe
    - Enhanced GPU-based point-to-point and collective tuning on
      -  OpenPOWER systems such as ORNL Summit and LLNL Sierra
         ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center
    - Scaled Allreduce to 24,576 Volta GPUs on Summit

* Bug Fixes (Since 2.3.1)
    - Fix hang issue in host-based MPI_Alltoallv
    - Fix GPU communication progress in MPI_THREAD_MULTIPLE mode
    - Fix potential failures in GDRCopy registration
    - Fix compilation warnings

MVAPICH2-GDR 2.3.1 (03/16/2019)

* Features and Enhancements (Since 2.3)
    - Based on MVAPICH2 2.3.1
    - Enhanced intra-node and inter-node point-to-point performance for DGX-2
      and IBM POWER8 and IBM POWER9 systems
    - Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
    - Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
    - Support for PGI 18.10
    - Add new runtime variables
      - 'MV2_GDRCOPY_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_LIMIT'
      - 'MV2_GDRCOPY_NAIVE_LIMIT' to replace 'MV2_USE_GPUDIRECT_GDRCOPY_NAIVE_LIMIT'
      - 'MV2_USE_GDRCOPY' to replace 'MV2_USE_GPUDIRECT_GDRCOPY'
    - Flexible support for running TensorFlow (Horovod) jobs

* Bug Fixes (Since 2.3)
    - Fix data validation issue in CUDA-Aware MPI_Reduce
    - Fix hang in CUDA-Aware MPI_Get_accumulate
    - Fix compilation errors with clang
    - Fix compilation warnings

MVAPICH2-GDR 2.3 (11/10/2018)

* Features and Enhancements (Since 2.3rc1)
    - Support for CUDA 10.0

* Bug Fixes (Since 2.3rc1)
    - Fix memory leaks in CUDA-based collectives
    - Fix memory leaks in CUDA IPC cache designs
    - Fix segfault when freeing NULL IPC resouces

MVAPICH2-GDR 2.3rc1 (09/22/2018)

* Features and Enhancements (Since 2.3a)
    - Based on MVAPICH2 2.3
    - Support for CUDA 9.2
    - Support for OpenPOWER9 with NVLink
    - Support IBM XLC and PGI compilers with CUDA kernel features
    - Enhanced point-to-point performance for the small messages
    - Enhanced Alltoallv operation for host buffers
    - Enhanced CUDA-based collective tuning on OpenPOWER8/9 systems
    - Enhanced large-message Reduce, Broadcast and Allreduce for Deep Learning
      workloads
    - Add new runtime variables 'MV2_USE_GPUDIRECT_RDMA' and 'MV2_USE_GDR' to
      replace 'MV2_USE_GPUDIRECT'
    - Support collective offload using Mellanox's SHArP for Allreduce on
      host-buffers
        - Enhance tuning framework for Allreduce using SHArP
    - Enhanced host-based collectives for IBM POWER8/9, Intel
      Skylake, Intel KNL, and Intel Broadwell architectures

* Bug Fixes (Since 2.3a)
    - Fix issues with InfiniBand Multicast (IB-MCAST) based designs for GPU-based Broadcast
    - Fix hang issue with the zero-copy Broadcast operation
    - Fix issue with datatype processing for host buffer
    - Fix application crash with GDRCopy feature
    - Fix memory leak in CUDA-based Allreduce algorithms
    - Fix data validation issue for Allreduce algorithms
    - Fix data validation issue for non-blocking Gather operation

MVAPICH2-GDR 2.3a (11/09/2017)

* Features and Enhancements (Since 2.2)
    - Based on MVAPICH2 2.2
    - Support for CUDA 9.0
    - Add support for Volta (V100) GPU
    - Support for OpenPOWER with NVLink
    - Efficient Multiple CUDA stream-based IPC communication for
      multi-GPU systems with and without NVLink
    - Enhanced performance of GPU-based point-to-point communication
    - Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based
      communication
    - Enhanced performance of MPI_Allreduce for GPU-resident data
    - InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and
      streaming applications
        * Basic support for IB-MCAST designs with GPUDirect RDMA
        * Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
        * Advanced reliability support for IB-MCAST designs
    - Efficient broadcast designs for Deep Learning applications
    - Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems

* Bug Fixes (Since 2.2)
    - Fix issue with MPI_Finalize when MV2_USE_GPUDIRECT=0
    - Fix data validation issue with GDRCOPY and Loopback
    - Fix issue with runtime error when MV2_USE_CUDA=0
    - Fix issue with MPI_Allreduce for R3 protocol
    - Fix warning message when GDRCOPY module cannot be used

MVAPICH2-GDR 2.3a (Pre-release Version) (08/10/2017)

* Features and Enhancements (since 2.2):
    - Based on MVAPICH2 2.2
    - InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast
        * Basic support for IB-MCAST designs with GPUDirect RDMA
        * Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
        * Advanced reliability support for IB-MCAST designs
    - Efficient broadcast designs for DL applications
    - Enhanced tuning for broadcast collective on
        * RI2@OSU
        * CSCS systems
        * OpenPOWER systems at LLNL

MVAPICH2-GDR 2.2 (10/25/2016)

* Features and Enhancements (since 2.2rc1):
    - Based on MVAPICH2 2.2
    - Add support for CUDA 8.0
    - Add support for Pascal (P100) GPU
    - Efficient support for CUDA-aware large message collectives targeting Deep
      Learning frameworks
        - Efficient support for Bcast
        - Efficient support for Reduce
        - Efficient support for Allreduce
    - Introducing GPU-based tuning framework for Reduce collective
      for Broadwell+EDR+K80 based systems

* Bug Fixes (since 2.2rc1):
    - Correctly guard core-direct support for GPU-based non-blocking collectives
    - Minor fixes to GPU-based collective tuning framework for Wilkes and CSCS
    - Fix compilation warnings

MVAPICH2-GDR 2.2rc1 (05/27/2016)

* Features and Enhancements (since 2.2b):
    - Based on MVAPICH2 2.2rc1
    - Support for high-performance non-blocking send operations from GPU buffers
    - Enhancing Intranode CUDA Managed-Aware communication using a new
      CUDA-IPC-based design
    - Adding support for RDMA-CM communication
    - Introducing support for RoCE-V1 and RoCE-V2
    - Introducing GPU-based tuning framework for Bcast operation
    - Introducing GPU-based tuning framework for Gather operation

* Bug Fixes (since 2.2b):
    - Properly handle socket / numa node binding
    - Remove the usage of default stream during the communication
    - Fix compile warnings
    - Properly handle out of WQE scenarios
    - Fix memory leaks in multicast code path

MVAPICH2-GDR-2.2b (02/04/2016)

* Features and Enhancements (since MVAPICH2-GDR 2.2a)
    - Based on MVAPICH2-2.2b
    - Support for CUDA-Aware Managed memory
        - Efficient intra-node and inter-node communication directly from/to
          Managed pointers
        - Automatic support for heterogeneous program with both managed and no
          managed memory (traditional) allocation

* Bug Fixes (since MVAPICH2-GDR 2.2a)
    - Provide clearer message when GDR library is unable to use GDR kernel
      driver

MVAPICH2-GDR-2.2a (11/09/2015)

* Features and Enhancements (since MVAPICH2-GDR 2.1)
    - Based on MVAPICH2-2.2.a
    - Support for efficient Non-Blocking Collectives for Device buffers
    - Exploit both Core-Direct and GPUDirect RDMA features
        - Maximal overlap of communication and computation on both CPU and GPU
    - Enable Support on GPU-Clusters using regular OFED (without GPUDirect
      RDMA)
        - Capability to use IPC
        - Capability to use GDRCOPY
    - Tuned IPC thresholds for multi-GPU nodes


* Bug Fixes (since MVAPICH2-GDR 2.1)
    - Fix IPC interaction with RMA Lock synchronization
        - Thanks to Akihiro Tabuchi@ University of Tsukuba


MVAPICH2-GDR-2.1 (10/16/2015)

* Features and Enhancements (since MVAPICH2-GDR 2.1rc2)
    - Based on MVAPICH2-2.1
    - CUDA 7.5 compatibility
    - Multi-rail support for loopback design
    - Multi-rail support for medium and large message size for H-H, H-D, D-H,
      and D-D
    - Automatic and dynamic rail and CPU binding
    - Support, optimization, and tuning for CS-Storm architectures
    - Optimization and tuning for point-point and collective operations
    - Support SLURM launcher environment


* Bug Fixes (since MVAPICH2-GDR 2.1rc2)
    - Fix IPC interaction with SLURM
        - Thanks to Mark Klein@CSCS for the report
    - Fix the heterogeneous support where some processes do not use GPUs
        - Thanks to Xavier Lapillonne@meteoswiss for the report
    - Fix for on-demand connection with loopback design
        - Thanks to Mark Klein@CSCS for the report


MVAPICH2-GDR-2.1rc2 (06/24/2015)

* Features and Enhancements (since MVAPICH2-GDR 2.1a)
    - Based on MVAPICH2-2.1rc2
    - CUDA 7.0 compatibility
    - CUDA-Aware support for MPI_Rsend and MPI_Irsend primitives
    - Parallel intranode communication channels (shared memory for H-H and GDR
      for D-D)
    - Optimized H-H, H-D and D-H communication
    - Optimized intranode D-D communication
    - Optimization and tuning for point-point and collective operations
    - Update to sm_20 kernel optimization for Datatype processing


* Bug Fixes (since MVAPICH2-GDR 2.1a)
    - Fix data validation in some scenarios for GDR-based communication
    - Fix the overlap performance issue for IPC-based communication
    - Fix for RMA operations with R3 protocol


MVAPICH2-GDR-2.1a (12/20/2014)

* Features and Enhancements (since MVAPICH2-GDR 2.0)
    - Based on MVAPICH2 2.1a
    - Optimized design for GPU based small message transfers
    - Adding R3 support for GPU based packetized transfer
    - Enhanced performance for small message host-to-device transfers
    - Support for MPI_Scan and MPI_Exscan collective operations from GPU buffers
    - Optimization of collectives with new copy designs


* Bug Fixes (since MVAPICH2-GDR 2.0)
    - Fix issue in one-sided tests with post-start-complete-wait synchronization
    - Fix error with void pointer arithmetic for PGI compiler
    - Fix issue in CUDA-aware MPI_Unpack
    - Fix compiler warnings and memory leaks


MVAPICH2-GDR-2.0 (08/23/2014)

* Features and Enhancements (since MVAPICH2-GDR 2.0b):
  - Based on MVAPICH2 2.0 (OFA-IB-CH3 interface)
  - Support for efficient MPI-3 RMA (One Sided) communication using GPUDirect
    RDMA and pipelining
  - Efficient small message inter-node communication using the new NVIDIA
    GDRCOPY module
  - Efficient small message inter-node communication using loopback design
  - Automatic communication channel selection for different GPU communication
    modes (DD, HH and HD) in different configurations (intra-IOH and inter-IOH)
  - Automatic selection and binding of the best GPU/HCA pair in a multi-GPU/HCA
    system configuration
  - Optimized and tuned support for collective communication from GPU buffers
  - Enhanced and Efficient support for datatype processing on GPU buffers
    including support for vector/h-vector, index/h-index, array and subarray

* Bug-Fixes (since MVAPICH2-GDR 2.0b):
  - Fix a bug in the registration cache of GPU buffers