MVAPICH2 1.8 Features and Supported Platforms
New features and enhancements compared to MVAPICH2 1.7 release are marked as (NEW) .
MVAPICH2 (MPI-2 over InfiniBand) is an MPI-2 implementation based on MPICH2 ADI3 layer. The latest release is compliant with the MPI 2.2 standard. It also supports all MPI-1 functionalities. MVAPICH2 1.8 is available as a single integrated package (with MPICH2-1.4.1p1). The current release supports the following ten underlying transport interfaces:- OFA-IB-CH3: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the CH3 Channel (OSU enhanced) of MPICH2 stack. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and Qlogic adapters.
- OFA-IB-Nemesis: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the emerging Nemesis channel of the MPICH2 stack. This interface can be used by all Mellanox InfiniBand adapters.
- OFA-iWARP-CH3: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
- OFA-RoCE-CH3: This interface supports the emerging RDMA over Converged Ethernet (RoCE) interface for Mellanox ConnectX-EN adapters with 10/40GigE switches.
- PSM-CH3: This interface provides native support for InfiniPath adapters from QLogic over the PSM interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
- uDAPL-CH3: This interface supports all network-adapters and software stacks which implement the portable DAPL interface from the DAT Collaborative. For example, this interface can be used over all Mellanox adapters and Chelsio adapters. It can also be used with Solaris uDAPL-IBTL implementation over InfiniBand adapters.
- Shared-Memory-CH3: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
- TCP/IP-CH3: The standard TCP/IP interface (provided by MPICH2 CH3 channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- TCP/IP-Nemesis: The standard TCP/IP interface (provided by MPICH2 Nemesis channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- Shared-Memory-Nemesis: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
MVAPICH2 1.8 delivers better performance (especially with one-copy intra-node communication support with LiMIC2) compared to MVAPICH 1.2, the latest release package of MVAPICH supporting MPI-1 standard. MVAPICH2 1.8 is compliant with MPI 2.2 standard. In addition, MVAPICH2 1.8 provides support and optimizations for other MPI-2 features, NVIDIA GPU, multi-threading and fault-tolerance (Checkpoint-restart, Job-pause-migration-resume). A complete set of features of MVAPICH2 1.8 are indicated below. New features compared to 1.7 are indicated as (NEW). A complete set of features of MVAPICH2 1.8 are:
- MPI-2.2 standard compliance
- Based on MPICH2-1.4.1p1
- OFA-IB-Nemesis interface design
- OpenFabrics InfiniBand network module support for MPICH2 Nemesis modular design
- Support for high-performance intra-node shared memory communication provided by the Nemesis design
- Adaptive RDMA fast path with Polling Set for high-performance inter-node communication
- Tuned RDMA Fast Path buffer size to get better performance with less memory footprint
- Optimization to limit number of RDMA Fast Path connections for very large clusters
- Shared Receive Queue (SRQ) support with flow control, uses significantly less memory for MPI library
- Header caching for low-latency
- Advanced AVL tree-based Resource-aware registration cache
- Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
- Support for TotalView debugger
- Shared Library Support for existing binary MPI application programs to run
- ROMIO Support for MPI-IO
- Support for additional features (such as hwloc, hierarchical collectives, one-sided, multi-threading, etc.), as included in the MPICH2 1.2.1p1 Nemesis channel
- Support of Shared-Memory-Nemesis interface on multi-core platforms requiring intra-node communication only (SMP-only systems, laptops, etc.)
- Support for 3D torus topology with appropriate SL settings
- Quality of Service (QoS) support with multiple InfiniBand SL
- Automatic inter-node communication parameter tuning based on platform and adapter detection
- (NEW) Flexible HCA selection
- (NEW) Checkpoint-Restart support
- (NEW) Run-through stabilization support to handle process failures
- (NEW) Enhancements to handle IB errors gracefully
- Flexible process manager support
- mpirun_rsh to work with any of the ten interfaces (CH3 and Nemesis channel-based) including OFA-IB-Nemesis, TCP/IP-CH3 and TCP/IP-Nemesis
- Hydra process manager to work with any of the ten interfaces (CH3 and Nemesis channel-based) including OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3
- XRC support with Hydra Process Manager
- Improved debug message output in process management and fault tolerance functionality
- Better handling of process signals and error management in mpispawn
- Enhanced mpirun_rsh design to avoid race conditions and support for improved debug messages
- Support for various job launchers and job schedulers (such as SGE and OpenPBS/Torque)
- CH3 shared memory channel for standalone hosts (including SMP-only systems and laptops) without any InfiniBand adapters
- CH3-level design for scaling to multi-thousand cores with highest
performance and reduced memory usage.
- HugePage support
- Integrated Hybrid (UD-RC/XRC) design to get best performance on large-scale systems with reduced/constant memory footprint
- (NEW) Support for running with UD only mode
- Support for MPI-2 Dynamic Process Management on InfiniBand Clusters
- eXtended Reliable
Connection (XRC) support
- (NEW) Enable XRC by default at configure time
- Multiple CQ-based design for Chelsio 10GigE/iWARP
- Multi-port support for Chelsio 10GigE/iWARP
- Enhanced iWARP design for scalability to higher process count
- (NEW) Support iWARP interoperability between Intel NE020 and Chelsio T4 adapters
- Support for 3D torus topology with appropriate SL settings
- Quality of Service (QoS) support with multiple InfiniBand SL
- Scalable and robust daemon-less job startup
- Enhanced and robust mpirun_rsh framework (non-MPD-based) to provide scalable job launching on multi-thousand core clusters
- Hierarchical ssh to nodes to speedup job start-up
- MPMD job launch capability
- Available for all CH3- and Nemesis-channel-based interfaces
- Optimization to limit number of RDMA Fast Path Connections for very large clusters
- Tuned RDMA Fast Path buffer size to get better performance with less memory footprint
- (NEW) Optimization in buffer usage to achieve lesser memory footprint
- On-demand Connection Management: This feature enables
InfiniBand connections to be setup dynamically, enhancing the
scalability of MVAPICH2 on clusters of thousands of nodes.
- Improved on-demand InfiniBand connection setup
- On-demand connection management support with IB CM (RoCE Interface)
- Native InfiniBand Unreliable Datagram (UD) based asynchronous connection management for OpenFabrics-IB interface.
- RDMA CM based on-demand connection management for OpenFabrics-IB and OpenFabrics-iWARP interfaces.
- uDAPL on-demand connection management based on standard uDAPL interface
- Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for OFA-IB-CH3 interface.
- Hot-Spot Avoidance Mechanism (HSAM) for alleviating network-congestion in large scale clusters. Available for OFA-IB-CH3 interface.
- RDMA Read utilized for increased overlap of computation and communication for OpenFabrics device. Available for OFA-IB-CH3 and OFA-IB-iWARP-CH3 interfaces.
- Shared Receive Queue (SRQ) with flow control. This design uses significantly less memory for MPI library. Available for OFA-IB-CH3 interface.
- Adaptive RDMA Fast Path with Polling Set for low-latency messaging. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Supporting large data transfers (> 2GB)
- (NEW) Support for fallback to R3 rendezvous protocol if RGET fails
- (NEW) Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Enhanced designs for Alltoall and Allgather collective communication from GPU device buffers
- Optimized and tuned support for collective communication from GPU buffers
- Non-contiguous datatype support in point-to-point and collective communication from GPU buffers
- Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Efficient synchronization mechanism using CUDA Events for pipelined device data transfers
- Dynamic Process Management (DPM) support with mpirun_rsh framework. Available for OFA-IB-CH3 interface.
- Configuration file suppport (similar to one available in MVAPICH). Provides a convenient method for handling all runtime variables through a configuration file.
-
Fault-tolerance support
- Checkpoint-restart support for application transparent systems-level fault tolerance. BLCR-based support using OFA-IB-CH3 and OFA-IB-Nemesis interfaces.
- Scalable Checkpoint-restart with mpirun_rsh framework
- Checkpoint-restart with Fault-Tolerance Backplane (FTB-CR) support
- Checkpoint-restart with intra-node shared memory (user-level) support
- Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support
- Checkpoint-restart support with pure SMP mode
- Allows best performance and scalability with fault-tolerance support
- (NEW) Run-through stabilization support to handle process failures using OFA-IB-Nemesis interface
- (NEW) Enhancements to handle IB errors gracefully using OFA-IB-Nemesis interface
- Application-initiated system-level checkpointing is
also supported. User application can request a whole program
checkpoint synchronously by calling special MVAPICH2 functions.
- Flexible interface to work with different files systems. Tested with ext3 (local disk), NFS and PVFS2.
- Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand.
- Fast Checkpoint-Restart support with aggregation scheme
- Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
- (NEW) Enable signal-triggered (SIGUSR2) migration
- Fast process migration using RDMA
- Support for new standardized Fault Tolerant Backplane (FTB) Events for Checkpoint-Restart and Job Pause-Migration-Restart Framework
- Enhancement to software installation
- Full autoconf-based configuration
- Automatically detects system architecture and adapter types and optimizes MVAPICH2 for any particular installation.
- An utility (mpiname) for querying the MVAPICH2 library version and configuration information
- Automatically builds and installs OSU Benchmarks for end-user convenience
-
Optimized intra-node communication support by taking advantage of
shared-memory communication. Available for all interfaces
(IB, iWARP and uDAPL).
- (NEW) New shared memory design for enhanced intra-node small message performance
- Kernel-level single-copy
intra-node communication solution based on LiMIC2
- Upgraded to LiMIC2 version 0.5.5 to support large Intra-node large message (>2GB) transfers
- LiMIC2 is designed and developed by jointly by The Ohio State University and System Software Laboratory at Konkuk University, Korea.
- Efficient Buffer Organization for Memory Scalability of Intra-node Communication
- Multi-core optimized
- (NEW) Adjust shared-memory communication block size at runtime
- Automatic intra-node communication parameter tuning based on platform
- Efficient connection set-up for multi-core systems
- Portable Hardware Locality (hwloc) support for defining CPU affinity
- (NEW) Integrated with Portable Hardware Locality (hwloc v1.4)
- Efficient CPU binding policies (bunch and scatter) to specify CPU binding per job for modern multi-core platforms
- (NEW) Enhanced support for CPU binding with socket and numanode level granularity
- (NEW) Show current CPU bindings with MV2_SHOW_CPU_BINDING
- Improved Bunch/Scatter mapping for process binding with HWLOC and SMT support
- Improved usability of process to CPU mapping with support of delimiters (',' , '-') in CPU listing
- Also allows user-defined CPU binding
- Optimized for Bus-based SMP and NUMA-Based SMP systems.
- Efficient support for diskless clusters
-
Optimized collective communication
operations. Available for all interfaces.
- Shared-memory aware K-nomial tree-based solution together with shared memory-based broadcast for scalable MPI_Bcast operations
- Shared-memory optimized algorithms and optimizations for Barrier, Reduce and All-reduce operations
- Performance improvements in Scatterv and Gatherv collectives for CH3 interface
- (NEW) Enhancements and optimizations for collectives (Bcast and Alltoallv)
-
Integrated multi-rail communication support. Available for
OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Supports multiple queue pairs per port and multiple ports per adapter
- Supports multiple adapters
- Support to selectively use some or all rails according to user specification
- Support for both one-sided and point-to-point operations
- Reduced stack size of internal threads to dramatically reduce memory requirement on multi-rail systems
- Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- Support for InfiniBand Quality of Service (QoS) with multiple lanes
-
Multi-threading support. Available for all interfaces (IB, iWARP and uDAPL),
including TCP/IP.
- Enhanced support for multi-threaded applications
-
High-performance optimized and scalable support for one-sided
communication: Put, Get and Accumulate. Supported synchronization
calls: Fence, Active Target, Passive (lock and unlock). Available
for all interfaces.
- Direct RDMA based One-sided communication support for OpenFabrics Gen2-iWARP and RDMA CM (with Gen2-IB)
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource.
- Enhancement to the design of Win_complete for RMA operations
- Removing the limitation on number of concurrent windows in RMA operations
- (NEW) Enhanced one-sided communication design with reduced memory requirement
- Using LiMIC2 for efficient intra-node RMA transfer to avoid extra memory copies
- Optimized Fence synchronization (with and without LIMIC2 support)
- Shared memory backed Windows for one-sided communication
- Support for truly passive locking for intra-node RMA in shared memory and LiMIC2-based Windows
-
Two modes of communication progress
- Polling
- Blocking (enables running multiple MPI processes/processor). Available for OpenFabrics (IB and iWARP) interfaces.
-
Scalable job startup schemes
- Enhanced and robust mpirun_rsh framework
- Hierarchical ssh-based schemes to nodes
- Flexibility for process execution with alternate group IDs
- Ring-based startup for RoCE
- Using in-band IB communication with MPD
- Support for SLURM
- (NEW) SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts without specifying a hostfile
- (NEW) Support added to automatically use PBS_NODEFILE in Torque and PBS environments
- (NEW) Support for suspend/resume functionality with mpirun_rsh framework
- (NEW) Exporting local rank, local size, global rank and global size through environment variables (both mpirun_rsh and hydra)
- Advanced AVL tree-based Resource-aware registration cache
- Automatic tuning based on both platform type and network adapter
- Progress loop optimization for PSM-CH3 interface
- Improved performance for medium size messages for QLogic PSM
- Multi-core-aware collective support for QLogic PSM
- Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
-
High Performance and Portable Support for multiple networks and
operating systems through uDAPL interface.
- InfiniBand (tested with)
- uDAPL over OpenFabrics-IB on Linux
- uDAPL over IBTL on Solaris
- InfiniBand (tested with)
- Support for TotalView debugger with mpirun_rsh framework
- Shared Library Support for existing binary MPI application programs to run.
- Enhanced debugging config options to generate core files and back-traces
- Use of gfortran as the default F77 compiler
- ROMIO Support for MPI-IO.
- Optimized, high-performance ADIO driver for Lustre
- Single code base for the following platforms (Architecture, OS,
Compilers, Devices and InfiniBand adapters)
- Architecture: EM64T, x86_64 and x86
- Operating Systems: (tested with) Linux
- Compilers: GCC, Intel, PGI, Ekopath and Open64
- Devices: OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3, PSM-CH3, uDAPL-CH3, TCP/IP-CH3, OFA-IB-Nemesis and TCP/IP-Nemesis
- InfiniBand adapters (tested with):
- Mellanox InfiniHost adapters (SDR and DDR)
- Mellanox ConnectX (DDR and QDR with PCIe2)
- Mellanox ConnectX-2 (QDR with PCIe2)
- Mellanox ConnectX-2 (FDR with PCIe3)
- QLogic adapter (SDR)
- QLogic adapter (DDR and QDR with PCIe2)
- 10GigE (iWARP and RoCE) adapters:
- (tested with) Chelsio T3 and T4 adapter with iWARP support
- (tested with) Mellanox ConnectX-EN 10GigE adapter
- (tested with) Intel NE020 adapter with iWARP support
- 40GigE RoCE adapters:
- (tested with) Mellanox ConnectX-EN 40GigE adapter
- Public SVN access of the codebase
- A set of micro-benchmarks (including multi-threading latency test) for carrying out MPI-level performance evaluation after the installation
- Public mvapich-discuss mailing list for MVAPICH users to
- Ask for help and support from each other and get prompt response
- Enable users and developers to contribute patches and enhancements

