MVAPICH :: Features

MVAPICH2 2.3.7 Features and Supported Platforms

New features and enhancements compared to MVAPICH2 2.2 release are marked as (NEW) .

MVAPICH2 (MPI-3.1 over OpenFabrics-IB, Omni-Path, OpenFabrics-iWARP, PSM, and TCP/IP) is an MPI-3.1 implementation based on MPICH ADI3 layer. MVAPICH2 2.3.7 is available as a single integrated package (with MPICH-3.2.1). The current release supports the following ten underlying transport interfaces:

OFA-IB-CH3: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the CH3 Channel (OSU enhanced) of MPICH2 stack. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and Intel TrueScale adapters.
OFA-IB-Nemesis (Deprecated): This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the emerging Nemesis channel of the MPICH2 stack. This interface can be used by all Mellanox InfiniBand adapters.
OFA-iWARP-CH3: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
OFA-RoCE-CH3: This interface supports the emerging RDMA over Converged Ethernet (RoCE) interface for Mellanox ConnectX-EN adapters with 10/40GigE switches. It provides support for RoCE v1 and v2.
TrueScale(PSM-CH3): This interface provides native support for TrueScale adapters from Intel over the PSM interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
Omni-Path(PSM2-CH3): This interface provides native support for Omni-Path adapters from Intel over the PSM2 interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
Shared-Memory-CH3: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
TCP/IP-CH3: The standard TCP/IP interface (provided by MPICH2 CH3 channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
TCP/IP-Nemesis: The standard TCP/IP interface (provided by MPICH2 Nemesis channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
Shared-Memory-Nemesis: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.

MVAPICH2 2.3.7 is compliant with MPI-3.1 standard. In addition, MVAPICH2 2.3.7 provides support and optimizations for NVIDIA GPU, multi-threading and fault-tolerance (Checkpoint-restart, Job-pause-migration-resume). New features compared to 2.2 are indicated as (NEW). A complete set of features of MVAPICH2 2.3.7 are:

(NEW) Based on and ABI compatible with MPICH-3.2.1
Based on MPI-3.1 standard
- Nonblocking collectives
- Neighborhood collectives
- MPI_Comm_split_type support
- Support for MPI_Type_create_hindexed_block
- (NEW) Enhanced support for MPI_T PVARs and CVARs
- (NEW) Enhanced performance for Allreduce, Reduce_scatter_block, Allgather, Allgatherv through new algorithms
- (NEW) Improved performance for small message collective operations
- (NEW) Improved performance of data transfers from/to non-contiguous buffers used by user-defined datatypes
- Nonblocking communicator duplication routine MPI_Comm_idup (will only work for single-threaded programs)
- MPI_Comm_create_group support
- Support for matched probe functionality
- Support for "Const" (disabled by default)
CH3-level design for scaling to multi-thousand cores with highest performance and reduced memory usage.
- Support for MPI-3.1 RMA in OFA-IB-CH3, OFA-IWARP-CH3, OFA-RoCE-CH3 (v1 and v2), TrueScale(PSM-CH3) and Omni-Path(PSM2-CH3)
- Support for Omni-Path architecture
  - Introduce a new PSM2-CH3 channel for Omni-Path
- (NEW) Support for Cray Slingshot 10 interconnect
- (NEW) Support for Rockport Networks switchless networking
- (NEW) Support for Marvel QEDR RoCE adapters
- (NEW) Support for PMIx protocol for SLURM and JSM
- (NEW) Support for RDMA_CM based multicast group creation
- Support for OpenPower architecture
- (NEW) Support IBM POWER9 and POWER8 architecture
- (NEW) Support Microsoft Azure HPC cloud platform
- (NEW) Support Cavium ARM (ThunderX2) systems
- (NEW) Support Intel Skylake architecture
- (NEW) Support Intel Cascade Lake architecture
- (NEW) Support AMD EPYC architectures
- (NEW) Support for Broadcom NetXtreme RoCE HCA
  - (NEW) Enhanced inter-node point-to-point for Broadcom NetXtreme RoCE HCA
- (NEW) Support architecture detection for Fujitsu A64fx processor
  - (NEW) Enhanced point-to-point and collective tuning for Fujitsu A64fx processor
- (NEW) Support architecture detection for Oracle BM.HPC2 cloud shape
  - (NEW) Enhanced point-to-point and collective tuning for Oracle BM.HPC2 cloud shape
- Support for Intel Knights Landing architecture
  - (NEW) Efficient support for different Intel Knight's Landing (KNL) models
  - Optimized inter-node and intra-node communication
  - (NEW) Enhance large message intra-node performance with CH3-IB-Gen2 channel on Intel Knight's Landing
- (NEW) Support for executing MPI jobs in Singularity
- Exposing several performance and control variables to MPI-3.1 Tools information interface (MPIT)
  - Enhanced PVAR support
  - (NEW) Add multiple MPI_T PVARs and CVARs for point-to-point and collective operations
- (NEW) Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand), CH3-PSM, and CH3-PSM2 (Omni-Path) channels
- Enhanced performance for small messages
- Reduced memory footprint
- Flexibility to use internal communication buffers of different size
- Tuning internal communication buffer size for performance
- Enhanced performance for MPI_Comm_split through new bitonic algorithm
- Enable support for multiple MPI initializations
- Improve communication performance by removing locks from critical path
- Enhanced communication performance for small/medium message sizes
- Multi-rail support for UD-Hybrid channel
- (NEW) Multi-rail support for UD-Hybrid channel
- (NEW) Enhanced performance for UD-Hybrid code
- Support for InfiniBand hardware UD-Multicast based collectives
- (NEW) Gracefully handle any number of HCAs
- HugePage support
- Integrated Hybrid (UD-RC/XRC) design to get best performance on large-scale systems with reduced/constant memory footprint
- Support for running with UD only mode
- Support for MPI-2 Dynamic Process Management on InfiniBand Clusters
- eXtended Reliable Connection (XRC) support
  - Enable XRC by default at configure time
- Multiple CQ-based design for Chelsio 10GigE/iWARP
- Multi-port support for Chelsio 10GigE/iWARP
- Enhanced iWARP design for scalability to higher process count
- Support iWARP interoperability between Intel NE020 and Chelsio T4 adapters
- Support for 3D torus topology with appropriate SL settings
- (NEW) Capability to run MPI jobs across multiple InfiniBand subnets
- Quality of Service (QoS) support with multiple InfiniBand SL
- On-demand Connection Management: This feature enables InfiniBand connections to be setup dynamically, enhancing the scalability of MVAPICH2 on clusters of thousands of nodes.
  - Support for backing on-demand UD CM information with shared memory for minimizing memory footprint
  - Improved on-demand InfiniBand connection setup
  - On-demand connection management support with IB CM (RoCE Interface)
  - Native InfiniBand Unreliable Datagram (UD) based asynchronous connection management for OpenFabrics-IB interface.
  - RDMA CM based on-demand connection management for OpenFabrics-IB and OpenFabrics-iWARP interfaces.
  - (NEW) Support to automatically detect IP address of IB/RoCE interfaces when RDMA_CM is enabled without relying on mv2.conf file
- Enabling support for intra-node communications in RoCE mode without shared memory
- Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for OFA-IB-CH3 interface.
- RDMA Read utilized for increased overlap of computation and communication for OpenFabrics device. Available for OFA-IB-CH3 and OFA-IB-iWARP-CH3 interfaces.
- Shared Receive Queue (SRQ) with flow control. This design uses significantly less memory for MPI library. Available for OFA-IB-CH3 interface.
- Adaptive RDMA Fast Path with Polling Set for low-latency messaging. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Header caching for low-latency
- CH3 shared memory channel for standalone hosts (including SMP-only systems and laptops) without any InfiniBand adapters
- Unify process affinity support in OFA-IB-CH3, PSM-CH3 and PSM2-CH3 channels
- Support to enable affinity with asynchronous progress thread
- Allow processes to request MPI_THREAD_MULTIPLE when socket or NUMA node level affinity is specified
- Reorganized HCA-aware process mapping
- Dynamic identification of maximum read/atomic operations supported by HCA
- (NEW) Update maximum HCAs supported by default from 4 to 10
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Removed libibumad dependency for building the library
- Option for selecting non-default gid-index in a loss-less fabric setup in RoCE mode
- Option to disable signal handler setup
- Option to use IP address as a fallback if hostname cannot be resolved
- (NEW) Improved job-startup performance
- (NEW) Gracefully handle RDMA_CM failures during job-startup
- Enhanced startup time for UD-Hybrid channel
- Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters)
- Improved debug messages and error reporting
- Supporting large data transfers (> 2GB)
Support for MPI communication from NVIDIA GPU device memory
- (NEW) Improved performance for Host buffers when CUDA is enabled
- (NEW) Add custom API to identify if MVAPICH2 has in-built CUDA support
- Support for MPI_Scan and MPI_Exscan collective operations from GPU buffers
- Multi-rail support for GPU communication
- Support for non-blocking streams in asynchronous CUDA transfers for better overlap
- Dynamic CUDA initialization. Support GPU device selection after MPI_Init
- Support for running on heterogeneous clusters with GPU and non-GPU nodes
- Tunable CUDA kernels for vector datatype processing for GPU communication
- Optimized sub-array data-type processing for GPU-to-GPU communication
- Added options to specify CUDA library paths
- Efficient vector, hindexed datatype processing on GPU buffers
- Tuned MPI performance on Kepler GPUs
- Improved intra-node communication with GPU buffers using pipelined design
- Improved inter-node communication with GPU buffers with non-blocking CUDA copies
- Improved small message communication performance with CUDA IPC design
- Improved automatic GPU device selection and CUDA context management
- Optimal communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-IOH a and inter-IOH)
- Provided option to use CUDA library call instead of CUDA driver to check buffer pointer type
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Enhanced designs for Alltoall and Allgather collective communication from GPU device buffers
- Optimized and tuned support for collective communication from GPU buffers
- Non-contiguous datatype support in point-to-point and collective communication from GPU buffers
- Updated to sm_20 kernel optimizations for MPI Datatypes
- Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Efficient synchronization mechanism using CUDA Events for pipelined device data transfers
OFA-IB-Nemesis interface design ((Deprecated))
- OpenFabrics InfiniBand network module support for MPICH2 Nemesis modular design
- Optimized adaptive RDMA fast path with Polling Set for high-performance inter-node communication
- Shared Receive Queue (SRQ) support with flow control, uses significantly less memory for MPI library
- Header caching for low-latency
- Support for additional features (such as hwloc, hierarchical collectives, one-sided, multi-threading, etc.), as included in the MPICH2 Nemesis channel
- Support of Shared-Memory-Nemesis interface on multi-core platforms requiring intra-node communication only (SMP-only systems, laptops, etc.)
- Support for 3D torus topology with appropriate SL settings
- Quality of Service (QoS) support with multiple InfiniBand SL
- Automatic inter-node communication parameter tuning based on platform and adapter detection
- Flexible HCA selection
- Checkpoint-Restart support
- Run-through stabilization support to handle process failures
- Enhancements to handle IB errors gracefully
Flexible process manager support
- Support for PMI-2 based startup with SLURM
- Enhanced startup performance with SLURM
  - Support for PMIX_Iallgather and PMIX_Ifence
- Enhanced startup performance and reduced memory footprint for storing InfiniBand end-point information with SLURM
  - Support for shared memory based PMI operations
- (NEW) On-demand connection managment for PSM-CH3 and PSM2-CH3 channels
- (NEW) Support for JSM and Flux resource managers
- (NEW) Enhanced job-startup performance for flux job launcher
- (NEW) Improved launch time for large-scale jobs with mpirun_rsh
- (NEW) Support in mpirun_rsh for using srun daemons to launch jobs
- (NEW) Support in mpirun_rsh hostfile-less launch with '-ppn' option
- (NEW) Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3
- Improved startup performance for TrueScale(PSM-CH3) channel
- Improved hierarchical job startup performance
- Enhanced hierarchical ssh-based robust mpirun_rsh framework to work with any interface (CH3 and Nemesis channel-based) including OFA-IB-Nemesis, TCP/IP-CH3 and TCP/IP-Nemesis to launch jobs on multi-thousand core clusters
- Introduced option to export environment variables automatically with mpirun_rsh
- Support for automatic detection of path to utilities(rsh, ssh, xterm, TotalView) used by mpirun_rsh during configuration
- Support for launching jobs on heterogeneous networks with mpirun_rsh
- MPMD job launch capability
- Hydra process manager to work with any of the ten interfaces (CH3 and Nemesis channel-based) including OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3
- Improved debug message output in process management and fault tolerance functionality
- Better handling of process signals and error management in mpispawn
- Flexibility for process execution with alternate group IDs
- Using in-band IB communication with MPD
- SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts without specifying a hostfile
- Support added to automatically use PBS_NODEFILE in Torque and PBS environments
- Support for suspend/resume functionality with mpirun_rsh framework
- Exporting local rank, local size, global rank and global size through environment variables (both mpirun_rsh and hydra)
Support for various job launchers and job schedulers (such as SGE and OpenPBS/Torque)
Configuration file support (similar to one available in MVAPICH). Provides a convenient method for handling all runtime variables through a configuration file.
Fault-tolerance support
- Checkpoint-Restart Support with DMTCP (Distributed MultiThreaded CheckPointing)
- Enable hierarchical SSH-based startup with Checkpoint-Restart
- Enable the use of Hydra launcher with Checkpoint-Restart for OFA-IB-CH3 and OFA-IB-Nemesis interfaces
- Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
  - Support for application-level checkpointing
  - Support for hierarchical system-level checkpointing
- Checkpoint-restart support for application transparent systems-level fault tolerance. BLCR-based support using OFA-IB-CH3 and OFA-IB-Nemesis interfaces.
  - Scalable Checkpoint-restart with mpirun_rsh framework
  - Checkpoint-restart with Fault-Tolerance Backplane (FTB-CR) support
  - Checkpoint-restart with intra-node shared memory (user-level) support
  - Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support
  - Checkpoint-restart support with pure SMP mode
  - Allows best performance and scalability with fault-tolerance support
  - Run-through stabilization support to handle process failures using OFA-IB-Nemesis interface
  - Enhancements to handle IB errors gracefully using OFA-IB-Nemesis interface
- Application-initiated system-level checkpointing is also supported. User application can request a whole program checkpoint synchronously by calling special MVAPICH2 functions.
  - Flexible interface to work with different files systems. Tested with ext3 (local disk), NFS and PVFS2.
- Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand.
- Fast Checkpoint-Restart support with aggregation scheme
- Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
  - Enable signal-triggered (SIGUSR2) migration
- Fast process migration using RDMA
- Support for new standardized Fault Tolerant Backplane (FTB) Events for Checkpoint-Restart and Job Pause-Migration-Restart Framework
Enhancement to software installation
- Revamped Build system:
  - Uses automake instead of simplemake,
  - Allows for parallel builds ("make -j8" and similar)
- Full autoconf-based configuration
- Automatically detects system architecture and adapter types and optimizes MVAPICH2 for any particular installation.
- A utility (mpiname) for querying the MVAPICH2 library version and configuration information
- Automatically builds and installs OSU Benchmarks for end-user convenience
Optimized intra-node communication support by taking advantage of shared-memory communication. Available for all interfaces (IB and iWARP).
- (NEW) Improve support for large processes per node and hugepages on SMP systems
- Enhanced intra-node SMP performance
- Tuned SMP eager threshold parameters
- New shared memory design for enhanced intra-node small message performance
- (NEW) Enhanced performance for shared-memory collectives
- Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)
  - Enabled by default
  - Give preference to CMA if LiMIC2 and CMA are enabled at the same time
- Kernel-level single-copy intra-node communication solution based on LiMIC2
  - Upgraded to LiMIC2 version 0.5.6 to support unlocked ioctl calls
  - LiMIC2 is designed and developed by jointly by The Ohio State University and System Software Laboratory at Konkuk University, Korea.
- Efficient Buffer Organization for Memory Scalability of Intra-node Communication
- Multi-core optimized
- Adjust shared-memory communication block size at runtime
- (NEW) Enhanced intra-node and inter-node tuning for PSM-CH3 and PSM2-CH3 channels
- (NEW) Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and PSM2-CH3 channels
- (NEW) support for process placement aware HCA
- Automatic intra-node communication parameter tuning based on platform
- Efficient connection set-up for multi-core systems
- (NEW) Portable Hardware Locality support for defining CPU affinity
- (NEW) NUMA-aware hybrid binding policy for dense numa systems such as AMD EPYC
- (NEW) Add support to select hwloc v1 and hwloc v2 at configure time
- (NEW) Update hwloc v1 code to v1.11.14
- (NEW) Update hwloc v2 code to v2.4.2
- (NEW) Efficient CPU binding policies (spread, bunch, scatter, and numa) to specify CPU binding per job for modern multi-core platforms with SMT support
- (NEW) Improved heterogeneity detection logic for HCA
- (NEW) Improved multi-rail selection logic
- Enhanced support for CPU binding with socket and numanode level granularity
- Show current CPU bindings with MV2_SHOW_CPU_BINDING
- (NEW) Improve performance of architecture detection
- (NEW) Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all nodes
- (NEW) Enhance MV2_SHOW_CPU_BINDING to display number of OMP threads
- (NEW) Enhance process mapping support for multi-threaded MPI applications
  - (NEW) Improve support for process to core mapping on many-core systems
    - New environment variable MV2_HYBRID_BINDING_POLICY for
    - multi-threaded MPI and MPI+OpenMP applications
    - Support `spread', `linear', and `compact' placement of threads
    - Warn user if oversubcription of core is detected
  - (NEW) Introduce MV2_CPU_BINDING_POLICY=hybrid
  - (NEW) Introduce MV2_THREADS_PER_PROCESS
- Improved usability of process to CPU mapping with support of delimiters (',' , '-') in CPU listing
- Also allows user-defined CPU binding
- Optimized for Bus-based SMP and NUMA-Based SMP systems.
- Efficient support for diskless clusters
Optimized collective communication operations. Available for OFA-IB-CH3, OFA-iWARP-CH3, and OFA-RoCE-CH3 interfaces.
- (NEW) Enhance collective tuning for OpenPOWER (POWER8 and POWER9), Intel Skylake, and Cavium ARM (ThunderX2) systems
- (NEW) Enhance point-to-point and collective tuning for AMD EPYC Rome, Frontera@TACC, Longhorn@TACC, Mayer@Sandia, Pitzer@OSC, Catalyst@EPCC, Summit@ORNL, Lassen@LLNL, Sierra@LLNL, Expanse@SDSC, Ookami@StonyBrook, and bb5@EPFL systems
- (NEW) Enhance collective tuning for Intel Knight's Landing and Intel Omni-Path based systems
- (NEW) Enhance collective tuning for Bebop@ANL, Bridges@PSC, and Stampede2@TACC systems
- (NEW) Enhanced small message performance for Alltoallv
- (NEW) Support collective offload using Mellanox's SHARP for Allreduce and Barrier
- (NEW) Support collective offload using Mellanox's SHARP for Reduce and Bcast
- (NEW) Support collective offload using Mellanox's SHARP for Scatter and Scatterv
- (NEW) Support non-blocking collective offload using Mellanox's SHARP for Iallreduce, Ibarrier, Ibcast, and Ireduce
- (NEW) Enhanced tuning framework for Reduce and Bcast using SHARP
- (NEW) Enhance collective tuning for OpenPOWER (POWER8 and POWER9), AMD EPYC, Intel Skylake and Cavium ARM (ThunderX2) systems
- (NEW) Efficient CPU binding policies
- Optimized collectives (bcast, reduce, and allreduce) for 4K processes
- Optimized and tuned blocking and non-blocking collectives for OFA-IB-CH3, OFA-IB-Nemesis and TrueScale(PSM-CH3) channels
- Enhanced MPI_Bcast, MPI_Reduce, MPI_Scatter, MPI_Gather performance
- Hardware UD-Multicast based designs for collectives - Bcast, Allreduce and Scatter
- Intra-node Zero-Copy designs (using LiMIC) for MPI_Gather collective
- Enhancements and optimizations for point-to-point designs for Broadcast
- Improved performance for shared-memory based collectives - Broadcast, Barrier, Allreduce, Reduce
- Performance improvements in Scatterv and Gatherv collectives for CH3 interface
- Enhancements and optimizations for collectives (Alltoallv, Allgather)
- Tuned Bcast, Alltoall, Scatter, Allgather, Allgatherv, Reduce, Reduce_Scatter, Allreduce collectives
Integrated multi-rail communication support. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Supports multiple queue pairs per port and multiple ports per adapter
- Supports multiple adapters
- Support to selectively use some or all rails according to user specification
- Support for both one-sided and point-to-point operations
- Reduced stack size of internal threads to dramatically reduce memory requirement on multi-rail systems
- Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- (NEW) Enhance HCA detection to handle cases where node has both IB and RoCE HCAs
- (NEW) Add support to auto-detect RoCE HCAs and auto-detect GID index
- (NEW) Add support to use RoCE/Ethernet and InfiniBand HCAs at the same time
Support for InfiniBand Quality of Service (QoS) with multiple lanes
Multi-threading support. Available for all interfaces (IB and iWARP), including TCP/IP.
- Enhanced support for multi-threaded applications
High-performance optimized and scalable support for one-sided communication: Put, Get and Accumulate. Supported synchronization calls: Fence, Active Target, Passive (lock and unlock). Available for all interfaces.
- (NEW) Improve performance for MPI-3.1 RMA operations
- Support for handling very large messages in RMA
- Enhanced direct RDMA based designs for MPI_Put and MPI_Get operations in OFA-IB-CH3 channel
- Optimized communication when using MPI_Win_allocate for OFA-IB-CH3 channel
- MPI-3.1 RMA support for TrueScale(PSM-CH3) channel
- Direct RDMA based One-sided communication support for OpenFabrics Gen2-iWARP and RDMA CM (with Gen2-IB)
- Shared memory backed Windows for one-sided communication
Two modes of communication progress
- Polling
- Blocking (enables running multiple MPI processes/processor). Available for OpenFabrics (IB and iWARP) interfaces.
Advanced AVL tree-based Resource-aware registration cache
Adaptive number of registration cache entries based on job size
Automatic detection and tuning for 24-core Haswell architecture
Automatic detection and tuning for 28-core Broadwell architecture
Automatic detection and tuning for Intel Knights Landing architecture
Automatic tuning based on both platform type and network adapter
Progress engine optimization for TrueScale(PSM-CH3) interface
Improved performance for medium size messages for TrueScale(PSM-CH3) channel
Multi-core-aware collective support for TrueScale(PSM-CH3) channel
Collective optimization for TrueScale(PSM-CH3) channel
Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
Warn and continue when ptmalloc fails to initialize
(NEW) Add support to intercept aligned_alloc in ptmalloc
(NEW) Remove dependency on underlying libibverbs, libibmad, libibumad, and librdmacm libraries using dlopen
Support for TotalView debugger with mpirun_rsh framework
Support for linking Intel Trace Analyzer and Collector
Shared library support for existing binary MPI application programs to run.
Enhanced debugging config options to generate core files and back-traces
Use of gfortran as the default F77 compiler
(NEW) Add support for MPI_REAL16 based reduction operations for Fortran programs
(NEW) Supports AMD Optimizing C/C++ (AOCC) compiler v2.1.0
(NEW) Enhanced support for SHArP v2.1.0
ROMIO Support for MPI-IO.
- (NEW) Support for DDN Infinite Memory Engine (IME)
- Optimized, high-performance ADIO driver for Lustre
Single code base for the following platforms (Architecture, OS, Compilers, Devices and InfiniBand adapters)
- Architecture: Knights Landing, OpenPOWER (POWER8 and POWER9), EM64T, Cavium ARM (ThunderX2), x86_64, and x86
- Operating Systems: (tested with) Linux
- Compilers: GCC, Intel, PGI, Ekopath and Open64
  - (NEW)Support for GCC compiler v11
  - (NEW)Support for Intel IFX compiler v11
- Devices: OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3, TrueScale(PSM-CH3), Omni-Path(PSM2-CH3), TCP/IP-CH3, OFA-IB-Nemesis and TCP/IP-Nemesis
- InfiniBand adapters (tested with):
  - Mellanox InfiniHost adapters (SDR and DDR)
  - Mellanox ConnectX (DDR and QDR with PCIe2)
  - Mellanox ConnectX-2 (QDR with PCIe2)
  - Mellanox ConnectX-3 (FDR with PCIe3)
  - Mellanox Connect-IB (Dual FDR with PCIe3)
  - Mellanox ConnectX-4 (EDR with PCIe3)
  - Mellanox ConnectX-5 (EDR with PCIe3)
  - Mellanox ConnectX-6 (HDR with PCIe3)
  - Intel TrueScale adapter (SDR)
  - Intel TrueScale adapter (DDR and QDR with PCIe2)
- Intel Omni-Path adapters
  - Intel Omni-Path adapter (100 Gbps with PCIe3)
- 10GigE (iWARP and RoCE) adapters:
  - (tested with) Chelsio T3 and T4 adapter with iWARP support
  - (tested with) Mellanox ConnectX-EN 10GigE adapter
  - (tested with) Intel NE020 adapter with iWARP support
- 40GigE RoCE adapters:
  - (tested with) Mellanox ConnectX-EN 40GigE adapter

MVAPICH2-X 2.3 Features

MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for PGAS programming model. This minimizes the development overheads that have been a huge deterrent in porting MPI applications to use PGAS models. The unified runtime also delivers superior performance compared to using different MPI, UPC, UPC++, OpenSHMEM and CAF libraries by optimizing use of network and memory resources. MVAPICH2-X supports pure MPI programs, MPI+OpenMP programs, pure UPC, pure OpenSHMEM, pure CAF as well as hybrid MPI(+OpenMP) + PGAS programs. MVAPICH2-X supports UPC, UPC++, OpenSHMEM and CAF as PGAS models. High-level features of MVAPICH2-X are listed below. New features compared to MVAPICH2-X 2.2 are indicated as (NEW).

MPI Features
- Support for MPI-3 features
- (NEW) Based on MVAPICH2 2.3.4. MPI programs can take advantage of all the features enabled by default in OFA-IB-CH3, OFA-RoCE-CH3, PSM-CH3, and PSM2-CH3 interfaces of MVAPICH2 2.3.4
- (NEW) Support for ARM architecture
- (NEW) Collective tuning for ARM architecture
- (NEW) Collective tuning for Intel Skylake architecture
- (NEW) Support for Efficient Asynchronous Communication Progress
- (NEW) Improved Alltoallv algorithm for small messages
- (NEW) Enhanced point-to-point and collective tunings for AMD EPYC, Catalyst@EPCC, Mayer@Sandia, Auzre@Microsoft, AWS, and Frontera@TACC
- Support for Omni-Path architecture
  - Introduction of a new PSM2 channel for Omni-Path
- (NEW) Optimized support for OpenPower architecture
  - (NEW) Improved performance for Intra- and Inter-node communication
  - (NEW) Optimized inter-node and intra-node communication
  - High performance two-sided communication scalable to multi-thousand nodes
  - Optimized collective communication operations
    - Shared-memory optimized algorithms for barrier, broadcast, reduce and allreduce operations
    - Optimized two-level designs for scatter and gather operations
    - Improved implementation of allgather, alltoall operations
  - High-performance and scalable support for one-sided communication
    - Direct RDMA based designs for one-sided communication
    - Shared memory backed Windows for One-Sided communication
    - Support for truly passive locking for intra-node RMA in shared memory backed windows
  - Multi-threading support
    - Enhanced support for multi-threaded MPI applications
    - (NEW) Add support to enable fork safety in MVAPICH2 using environment variable
(NEW) MPI Advanced Features
- (NEW) Improved performance of large message communication
  - Support for advanced co-operative (COOP) rendezvous protocols in SMP channel (OFA-IB-CH3 and OFA-RoCE-CH3 interfaces)
- (NEW) Support for RGET, RPUT, and COOP protocols for CMA and XPMEM
  - Support for load balanced and dynamic rendezvous protocol selection (OFA-IB-CH3 and OFA-RoCE-CH3 interfaces)
- (NEW) Support for Shared Address Space based MPI Communication Using XPMEM
  - Support for pt-to-pt communication (OFA-IB-CH3 and OFA-RoCE-CH3 interfaces)
  - Support for Reduce, Allreduce, Broadcast, Gather, Scatter, Allgather collectives (OFA-IB-CH3, OFA-RoCE-CH3, PSM-CH3, and PSM2-CH3 interfaces)
- Support for Dynamically Connected (DC) transport protocol (OFA-IB-CH3 interface)
  - Support for pt-to-pt, RMA and collectives
  - (NEW) Improved connection establishment for DC transport
  - (NEW) Improved performance for communication using DC transport
- (NEW) Support Data Partitioning-based Multi-Leader Design (DPML) for MPI collectives (OFA-IB-CH3, PSM-CH3, and PSM2-CH3 interfaces)
- (NEW) Support Contention Aware Kernel-Assisted MPI collectives (OFA-IB-CH3, OFA-RoCE-CH3, PSM-CH3, and PSM2-CH3 interfaces)
- (NEW) Enhanced support for AWS EFA adapter and SRD transport protocol
- (NEW) Enhanced point-to-point and collective tuning for AWS EFA adapter and SRD transport protocol
- (NEW) Support for Efficient Asynchronous Communication Progress
- (NEW) Improved Alltoallv algorithm for small messages
- (NEW) Optimized support for large message MPI_Allreduce and MPI_Reduce
- Support for Hybrid mode with RC/DC/UD/XRC
- Support User Mode Memory Registration (UMR) for high performance non-contiguous data transfer
- Efficient support for On Demand Paging (ODP) feature of Mellanox for point-to-point and RMA operations
- Support for Core-Direct based Non-blocking collectives
  - Support available for Ibcast, Ibarrier, Iscatter, Igather, Ialltoall, Iallgather, Iallgatherv, Ialltoallv, Igatherv, Iscatterv, and Ialltoallw
- (NEW) Add multiple MPI_T PVARs and CVARs for point-to-point and collective operations
- (NEW) Tuning for MPI collective operations for Intel Broadwell, Intel CascadeLake, Azure HB (AMD EPYC), and Azure HC (Intel Skylake) systems
Unified Parallel C (UPC) Features
- UPC Language Specification v1.2 standard compliance
- Based on Berkeley UPC v2.20.2 (contains changes/additions in preparation for UPC 1.3 specification)
- Support for OpenPower architecture
- Support for Intel Knights Landing Architecture
  - Optimized inter-node and intra-node communication
- Optimized RDMA-based implementation of UPC data movement routines
- Improved UPC memput design for small/medium size messages
- Support for GNU UPC translator
- Optimized UPC collectives (Improved performance for upc_all_broadcast, upc_all_scatter, upc_all_gather, upc_all_gather_all, and upc_all_exchange)
- Support for RoCE (v1 and v2)
- Support for Dynamically Connected (DC) transport protocol
- (NEW) Support for XPMEM-based collectives operations (Broadcast, Collect, Scatter, Gather)
Unified Parallel C++ (UPC++) Features
- Based on Berkeley UPC++ v0.1
- Introduce UPC++ level support for new scatter collective operation (upcxx_scatter)
- Optimized UPC collectives (improved performance for upcxx_reduce, upcxx_bcast, upcxx_gather, upcxx_allgather, upcxx_alltoall)
- Support for RoCE (v1 and v2)
- Support for Dynamically Connected (DC) transport protocol
- Support for OpenPower Architecture
- Support for Intel Knights Landing Architecture
- (NEW) Support for XPMEM-based collectives operations (Broadcast, Collect, Scatter, Gather)
OpenSHMEM Features
- (NEW) Based on OpenSHMEM v1.3 standard
- (NEW) Support Non-Blocking remote memory access routines
- Support for on-demand establishment of connections
- Improved job start up and memory footprint
- Optimized RDMA-based implementation of OpenSHMEM data movement routines
- Support for OpenSHMEM 'shmem_ptr' functionality
- Support for OpenPower architecture
- Support for Intel Knights Landing Architecture
  - Optimized inter-node and intra-node communication
- Support for RoCE (v1 and v2)
- Support for Dynamically Connected (DC) transport protocol
- Efficient implementation of OpenSHMEM atomics using RDMA atomics
- Optimized OpenSHMEM put routines for small/medium message sizes
- Optimized OpenSHMEM collectives (Improved performance for shmem_collect, shmem_fcollect, shmem_barrier, shmem_reduce and shmem_broadcast)
- Optimized 'shmalloc' routine
- Improved intra-node communication performance using shared memory and CMA designs
- (NEW) Support for XPMEM-based collectives operations (Broadcast, Collect, Reduce_all, Reduce, Scatter, Gather)
CAF Features
- Based on University of Houston CAF implementation
- Efficient point-point read/write operations
- Efficient CO_REDUCE and CO_BROADCAST collective operations
- Support for RoCE (v1 and v2)
- Support for Dynamically Connected (DC) transport protocol
- Support for Intel Knights Landing Architecture
Hybrid Program Features
- Introduce support for hybrid MPI+UPC++ applications
- Support OpenPower architecture for hybrid MPI+UPC and MPI+OpenSHMEM applications
- Support Intel Knights Landing architecture for hybrid MPI+PGAS applications
  - (NEW) Optimized inter-node and intra-node communication
- Supports hybrid programming using MPI(+OpenMP), MPI(+OpenMP)+UPC, MPI(+OpenMP)+OpenSHMEM and MPI(+OpenMP)+CAF
- Compliance to MPI-3, UPC v1.2, OpenSHMEM v1.0 and CAF Fortran 2015 standards
- Optimized network resource utilization through the unified communication runtime
- Efficient deadlock-free progress of MPI and UPC/OpenSHMEM calls
Unified Runtime Features
- Introduce support for UPC++ programming model
- (NEW) Based on MVAPICH2 2.3.4 (OFA-IB-CH3 and OFA-RoCE-CH3 interfaces). MPI, UPC, OpenSHMEM and Hybrid programs benefit from its features listed below
  - Scalable inter-node communication with highest performance and reduced memory usage
    - Integrated RC/XRC design to get best performance on large-scale systems with reduced/constant memory footprint
    - RDMA Fast Path connections for efficient small message communication
    - Shared Receive Queue (SRQ) with flow control to significantly reduce memory footprint of the library
    - AVL tree-based resource-aware registration cache
    - Automatic tuning based on network adapter and host architecture
  - (NEW) The advanced MPI features listed in Section ”MPI Advanced Features” are available with the unified runtime
  - Optimized intra-node communication support by taking advantage of shared-memory communication
    - Efficient Buffer Organization for Memory Scalability of Intra-node Communication
    - Automatic intra-node communication parameter tuning based on platform
  - Flexible CPU binding capabilities
    - (NEW) Portable Hardware Locality (hwloc v1.11.6) support for defining CPU affinity
    - Efficient CPU binding policies (bunch and scatter patterns, socket and numanode granularities) to specify CPU binding per job for modern multi-core platforms
    - Allow user-defined flexible processor affinity
  - Two modes of communication progress
    - Polling
    - Blocking (enables running multiple processes/processor)
- Flexible process manager support
  - Support for mpirun rsh, hydra, upcrun and oshrun process managers
(NEW) Support for InfiniBand Network Analysis and Management (INAM) Tool v0.9.6
- Capability to profile and report node-level, job-level and process-level activities for MPI communication (Point-to-Point, Collectives and RMA) at user specified granularity in conjunction with OSU INAM
- Capability to profile and report process to node communication matrix for MPI processes at user specified granularity in conjunction with OSU INAM
- Capability to classify data flowing over a network link at job level and process level granularity in conjunction with OSU INAM
- Capability to profile and report the following parameters of MPI processes at node-level, job-level and process-level at user specified granularity in conjunction with OSU INAM
  - CPU Utilization
  - Memory Utilization
  - Inter-node communication buffer usage for RC transport
  - Inter-node communication buffer usage for UD transport

MVAPICH2-GDR 2.3.7 Features

Features for supporting GPU-GPU communication on clusters with NVIDIA and AMD GPUs.

MVAPICH2-GDR 2.3.7 derives from MVAPICH2 2.3.7, which is an MPI-3 implementation based on MPICH ADI3 layer. All the features available with the OFA-IB-CH3 channel of MVAPICH2 2.3.7 are available with this release and incorporates designs that take advantage of the GPUDirect RDMA (GDR) technology for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand interconnect. MVAPICH2-GDR 2.3.7 also adds support for AMD GPUs via Radeon Open Compute (ROCm) software stack and exploits ROCmRDMA technology for direct communication between AMD GPUs and Mellanox InfiniBand adapters. It also provides support for OpenPower and NVLink, efficient intra-node CUDA-Aware unified memory communication and support for RDMA_CM, RoCE-V1, and RoCE-V2. Further, MVAPICH2-GDR 2.3.7 provides optimized large message collectives (broadcast, reduce and allreduce) for emerging Deep Learning frameworks like TensorFlow and PyTorch on NVIDIA DGX-2, ABCI system @AIST, and POWER9 systems like Sierra and Lassen @LLNL and Summit @ORNL. It also enhances the performance of dense collectives e.g., Alltoall and Allgather on multi-GPU systems like Lassen @LLNL and Summit @ORNL. Further, it provides efficient support for NonBlocking Collectives (NBC) from GPU buffers by combining GPUDirect RDMA and Core-Direct features. It also supports the CUDA managed memory feature and optimize large message collectives targeting Deep Learning frameworks. New features compared to MVAPICH2-GDR 2.2 are indicated as (NEW).

The list of features for supporting MPI communication from NVIDIA and AMD GPU device memory is provided below.

(NEW) Support for automatic GPU rebinding for optimal performance
(NEW) Support for 'on-the-fly' compression of point-to-point messages for NVIDIA GPU to GPU communications
(NEW) Support for NCCL and RCCL communication substrate for various MPI Collectives
- Support for tuning-based hybrid communication protocols using NCCL-based, CUDA-based, and IB verbs-based primitives
- MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv, MPI_alltoall, MPI_alltoallv, MPI_Scatter, MPI_Scatterv, MPI_Gather, MPI_Gatherv, and MPI_Bcast.
(NEW) Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100 systems
- Enhanced architecture detection, process placement, and HCA selection
- Enhanced intra-node and inter-node point-to-point tuning
- Enhanced collective tuning
(NEW) Support for AMD GPUs via Radeon Open Compute (ROCm) platform
(NEW) Support for ROCm PeerDirect, ROCm IPC, and unified memory based device-to-device communication for AMD GPUs
(NEW) Support for enhanced MPI derived datatype processing via kernel fusion on NVIDIA GPUs
(NEW) Support for Apache MXNet Deep Learning Framework
High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) using GPUDirect RDMA / ROCmRDMA and pipelining
Support for MPI-3 RMA (one-sided) and atomic operations using GPU buffers with GPUDirect RDMA and pipelining
Support for efficient Non-Blocking Collectives for Device buffers
- Exploiting Core-Direct and GPUDirect RDMA features
- Maximal overlap of communication and computation on both CPU and GPU
Support for CUDA-Aware Managed Memory
- Efficient CUDA-Aware Managed Memory communication using a new CUDA-IPC-based design
- Efficient inter-node communication directly from/to Managed pointers
- Automatic support for heterogeneous program with both managed and no managed memory (traditional) allocation
(NEW) Enhanced support for CUDA-aware large message collectives targeting Deep Learning frameworks
- Efficient support for Alltoall
- Efficient support for Allgather
- Efficient support for Bcast
- Efficient support for Reduce
- Efficient support for Allreduce
(NEW) Support for GDRCopy v2.0
(NEW) Support for jsrun
(NEW) Enhanced CUDA-Aware MPI_Pack and MPI_Unpack
(NEW) Enhanced MPI_Allreduce performance on DGX-2 and POWER9 systems
(NEW) Reduced the CUDA interception overhead for non-CUDA symbols
(NEW) Enhanced performance for point-to-point and collective operations on Frontera's RTX nodes
(NEW) Added compilation and runtime methods for checking CUDA support
(NEW) Enhanced GDR output for runtime variable MV2_SHOW_ENV_INFO
(NEW) Tested with Horovod and common DL Frameworks (TensorFlow, PyTorch, and MXNet)
(NEW) Tested with PyTorch Distributed
(NEW) Enhanced GPU-based point-to-point and collective tuning on:
- OpenPOWER systems such as Summit @ORNL, Sierra and Lassen @LLNL
- ABCI system @AIST, Japan
- Owens and Pitzer systems @Ohio Supercomputer Center
(NEW) Provide MPI_Allreduce that scales up to 24,576 Volta GPUs on Summit
(NEW) Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode
(NEW) Flexible support for running TensorFlow (Horovod) jobs
(NEW) Efficient support for CUDA-aware large message Allreduce on DGX-1, DGX-2 and POWER9 systems with NVLink interconnect
(NEW) InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications
- Basic support for IB-MCAST designs with GPUDirect RDMA
- Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
- Advanced reliability support for IB-MCAST designs
(NEW) Support IBM XLC and PGI compilers with CUDA kernel features
(NEW) Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink
(NEW) Enhanced performance of GPU-based point-to-point communication
(NEW) Enhanced point-to-point performance for CPU-based small messages
(NEW) Enhanced inter-node point-to-point performance for CUDA managed buffers on POWER9 system
(NEW) Enhanced Alltoallv operation for host buffers
(NEW) Enhanced CUDA-based collective tuning on OpenPOWER9 systems
(NEW) Add new runtime variable 'MV2_SUPPORT_DL' to replace 'MV2_SUPPORT_TENSOR_FLOW'
(NEW) Support collective offload using Mellanox's SHArP for Allreduce on host-buffers
- Enhance tuning framework for Allreduce using SHArP
(NEW) Enhanced host-based collectives for IBM POWER9/POWER8, Intel Skylake, Intel KNL, and Intel Broadwell architectures
(NEW) Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication
Support for high-performance non-blocking send operations from GPU buffers
Multirail support for loopback design
Multirail support for medium and large message size for H-H, H-D, D-H and D-D
(NEW) Support for CUDA 11
(NEW) Support for NVIDIA Ampere (A100) GPU
(NEW) Support for OpenPOWER with NVLink
Adding R3 support for GPU based packetized transfer
Optimized design for GPU based small message transfers
Optimized intranode D-D communication
Efficient small message inter-node communication using the new NVIDIA GDR-COPY module
Efficient small message inter-node communication using loopback design
Enabling Support on GPU-Clusters using regular OFED (without GPUDirect RDMA)
- Capability to use IPC
- Capability to use GDRCOPY
(NEW) Enhanced intra-node and inter-node point-to-point performance for DGX-1, DGX-2, IBM POWER8, and IBM POWER9 systems
(NEW) Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
CUDA-Aware support for MPI_Rsend and MPI_Irsend primitives
Adding support for RDMA_CM communication
Introducing support for RoCE-V1 and RoCE-V2
Support for MPI_Scan and MPI_Exscan collective operations from GPU buffers
High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) using CUDA IPC and pipelining
Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-IOH a and inter-IOH)
Automatic and dynamic rail and CPU selection and binding
Support, optimization and tuning for CS-Storm architectures
Optimized and tuned support for collective communication from GPU buffers
Tuning of IPC thresholds for multi-GPU nodes
Introducing GPU-based tuning framework for Bcast, Gather and Reduce operations
Enhanced designs for Alltoall and Allgather collective communication from GPU device buffers
Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index, array and subarray
(NEW) Enhanced performance of datatype support for GPU-resident data
- Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe
(NEW) Enhanced datatype support for CUDA kernel-based Allreduce
Dynamic CUDA initialization. Support GPU device selection after MPI_Init
Support for non-blocking streams in asynchronous CUDA transfers for better overlap
Efficient synchronization using CUDA Events for pipelined device data transfers
Support for running on heterogeneous clusters with GPU and non-GPU nodes
Support Slurm launcher environment
Update to sm_20 kernel optimization for Datatype processing

MVAPICH2-J Features

MVAPICH2-J 2.3.7 is a Java wrapper or bindings to the MVAPICH2 MPI library and its other variations that we offer. The Java MPI library, MVAPICH2-J, relies on the Java Native Interface (JNI) to allow for an implementation of a Java MPI effort that is lean and easy to develop and maintain. The Java bindings inherit all of the features for communication that our native family of MPI libraries support.

The list of features of MVAPICH2-J is listed below.

(NEW) Provides Java bindings to the MVAPICH2 family of libraries
(NEW) Support for communication of basic Java datatypes and Java new I/O (NIO) package direct ByteBuffers
(NEW) Support for blocking and non-blocking point-point communication protocols
(NEW) Support for blocking collective and strided collective communication protocols
(NEW) Support for Dynamic Process Managment (DPM) functionality
(NEW) Relies on the explicit memory management buffering layer from the MPJ Express software
(NEW) Utilizes a pool of direct Java NIO ByteBuffers used in communication of basic datatype Java arrays
(NEW) Support for all high-speed interconnects that MVAPICH2 supports including InfiniBand, Internet Wide-area RDMA Protocol (iWARP), RDMA over Converged Ethernet (RoCE), Intel's Performance Scaled Messaging (PSM), Omni-Path, etc.

MVAPICH2-MIC 2.0 Features

Based on MVAPICH2 2.0.1
Support for native, symmetric and offload modes of MIC usage
Optimized intra-MIC communication using SCIF and shared-memory channels
Optimized intra-Node Host-to-MIC communication using SCIF and IB channels
Enhanced mpirun_rsh to launch jobs in symmetric mode from the host
Support for proxy-based communication for inter-node transfers
- Active-proxy, 1-hop and 2-hop designs (actively using host CPU)
- Passive-proxy (passively using host CPU)
Support for MIC-aware MPI_Bcast()
Improved SCIF performance for pipelined communication
Optimized shared-memory communication performance for single-MIC jobs
Supports an explicit CPU-binding mechanism for MIC processes
Tuned pt-to-pt intra-MIC, intra-node, and inter-node transfers
Supports hwloc v1.9

MVAPICH2-Virt 2.2 Features

MVAPICH2-Virt 2.2 derives from MVAPICH2, which incorporates designs that take advantage of the new features and mechanisms of high-performance networking technologies with SR-IOV as well as other virtualization technologies such as Inter-VM Shared Memory (IVSHMEM), IPC enabled Inter-Container Shared Memory (IPC-SHM), Cross Memory Attach (CMA), and OpenStack. For SR-IOV-enabled InfiniBand virtual machine environments and InfiniBand based Docker/Singularity container environments, MVAPICH2-Virt has very little overhead compared to MVAPICH2 running over InfiniBand in native mode. MVAPICH2-Virt can deliver the best performance and scalability to MPI applications running inside both virtual machines and Docker/Singularity containers over SR-IOV enabled InfiniBand clusters. MVAPICH2-Virt also inherits all the features for communication on HPC Clusters that are available in the MVAPICH2 software stack. New features compared to MVAPICH2-Virt 2.2rc1 are indicated as (NEW).

The list of features for supporting MPI communication in virtualized environment is provided below.

(NEW) Based on MVAPICH2 2.2
Support for efficient MPI communication over SR-IOV enabled InfiniBand networks
High-performance and locality-aware MPI communication with IVSHMEM for virtual machines
High-performance and locality-aware MPI communication with IPC-SHM and CMA for containers
Support for auto-detection of IVSHMEM device in virtual machines
Support for locality auto-detection in containers
Automatic communication channel selection among SR-IOV, IVSHMEM, and CMA/LiMIC2 in virtual machines
Automatic communication channel selection among IPC-SHM, CMA, and HCA in containers
(NEW) Support for both Docker and Singularity Container Technologies
Support for integration with OpenStack
Support for easy configuration through runtime parameters
Tested with
- Docker 1.9.1 and 1.10.3
- (NEW) Singularity development branch
- (NEW) Mellanox InfiniBand adapters (ConnectX-3 (56Gbps), ConnectX-4 (56Gbps))
- OpenStack Juno

MVAPICH2-EA 2.1 Features

Based on MVAPICH2 2.1
Energy-Efficient support for IB, RoCE and IWARP
Energy-Efficient blocking and non-blocking point-point communication protocols
Energy-Efficient collective communication protocols
User defined Energy-Performance trade-off levels using tunable overhead tolerance parameter
Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

CUDA

ROCM