MVAPICH2 2.0rc1 Features and Supported Platforms
New features and enhancements compared to MVAPICH2 1.9 release are marked as (NEW) .MVAPICH2 (MPI-3 over InfiniBand) is an MPI-3 implementation based on MPICH ADI3 layer. MVAPICH2 2.0rc1 is available as a single integrated package (with MPICH-3.1). The current release supports the following ten underlying transport interfaces:
- OFA-IB-CH3: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the CH3 Channel (OSU enhanced) of MPICH2 stack. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and QLogic adapters.
- OFA-IB-Nemesis: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the emerging Nemesis channel of the MPICH2 stack. This interface can be used by all Mellanox InfiniBand adapters.
- OFA-iWARP-CH3: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
- OFA-RoCE-CH3: This interface supports the emerging RDMA over Converged Ethernet (RoCE) interface for Mellanox ConnectX-EN adapters with 10/40GigE switches.
- PSM-CH3: This interface provides native support for InfiniPath adapters from QLogic over the PSM interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
- Shared-Memory-CH3: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
- TCP/IP-CH3: The standard TCP/IP interface (provided by MPICH2 CH3 channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- TCP/IP-Nemesis: The standard TCP/IP interface (provided by MPICH2 Nemesis channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- Shared-Memory-Nemesis: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
MVAPICH2 2.0rc1 is compliant with MPI-3 standard. In addition, MVAPICH2 2.0rc1 provides support and optimizations for NVIDIA GPU, multi-threading and fault-tolerance (Checkpoint-restart, Job-pause-migration-resume). A complete set of features of MVAPICH2 2.0rc1 are indicated below. New features compared to 1.9 are indicated as (NEW). A complete set of features of MVAPICH2 2.0rc1 are:
- (NEW) Based on MPICH-3.1
- MPI-3 standard compliance
- Nonblocking collectives
- Neighborhood collectives
- MPI_Comm_split_type support
- Support for MPI_Type_create_hindexed_block
- Nonblocking communicator duplication routine MPI_Comm_idup (will only work for single-threaded programs)
- MPI_Comm_create_group support
- Support for matched probe functionality
- Support for "Const" (disabled by default)
- CH3-level design for scaling to multi-thousand cores with highest
performance and reduced memory usage.
- (NEW) Support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, OFA-RoCE-CH3, and PSM-CH3
- (NEW) Exposing some internal performance variables to MPI-3 Tools information interface (MPIT)
- (NEW) Reduced memory footprint
- (NEW) Multi-rail support for UD-Hybrid channel
- Support for InfiniBand hardware UD-Multicast based collectives
- HugePage support
- Integrated Hybrid (UD-RC/XRC) design to get best performance on large-scale systems with reduced/constant memory footprint
- Support for running with UD only mode
- Support for MPI-2 Dynamic Process Management on InfiniBand Clusters
- eXtended Reliable
Connection (XRC) support
- Enable XRC by default at configure time
- Multiple CQ-based design for Chelsio 10GigE/iWARP
- Multi-port support for Chelsio 10GigE/iWARP
- Enhanced iWARP design for scalability to higher process count
- Support iWARP interoperability between Intel NE020 and Chelsio T4 adapters
- Support for 3D torus topology with appropriate SL settings
- Quality of Service (QoS) support with multiple InfiniBand SL
- On-demand Connection Management: This feature enables
InfiniBand connections to be setup dynamically, enhancing the
scalability of MVAPICH2 on clusters of thousands of nodes.
- Improved on-demand InfiniBand connection setup
- On-demand connection management support with IB CM (RoCE Interface)
- Native InfiniBand Unreliable Datagram (UD) based asynchronous connection management for OpenFabrics-IB interface.
- RDMA CM based on-demand connection management for OpenFabrics-IB and OpenFabrics-iWARP interfaces.
- Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for OFA-IB-CH3 interface.
- RDMA Read utilized for increased overlap of computation and communication for OpenFabrics device. Available for OFA-IB-CH3 and OFA-IB-iWARP-CH3 interfaces.
- Shared Receive Queue (SRQ) with flow control. This design uses significantly less memory for MPI library. Available for OFA-IB-CH3 interface.
- Adaptive RDMA Fast Path with Polling Set for low-latency messaging. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Header caching for low-latency
- CH3 shared memory channel for standalone hosts (including SMP-only systems and laptops) without any InfiniBand adapters
- Enhanced scalability for RDMA-based direct one-sided communication with less communication resource. Available for OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Removed libibumad dependency for building the library
- Option for selecting non-default gid-index in a loss-less fabric setup in RoCE mode
- Option to disable signal handler setup
- Option to use IP address as a fallback if hostname cannot be resolved
- (NEW) Improved job-startup performance
- Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters)
- Improved debug messages and error reporting
- Supporting large data transfers (> 2GB)
- Support for MPI communication from NVIDIA GPU device memory
- (NEW) Multi-rail support for GPU communication
- (NEW) Support for non-blocking streams in asynchronous CUDA transfers for better overlap
- (NEW) Dynamic CUDA initialization. Support GPU device selection after MPI_Init
- (NEW) Support for running on heterogeneous clusters with GPU and non-GPU nodes
- (NEW) Tunable CUDA kernels for vector datatype processing for GPU communication
- (NEW) Optimized sub-array data-type processing for GPU-to-GPU communication
- (NEW) Added options to specify CUDA library paths
- Efficient vector, hindexed datatype processing on GPU buffers
- Tuned MPI performance on Kepler GPUs
- Improved intra-node communication with GPU buffers using pipelined design
- Improved inter-node communication with GPU buffers with non-blocking CUDA copies
- Improved small message communication performance with CUDA IPC design
- Improved automatic GPU device selection and CUDA context management
- Optimal communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-IOH a and inter-IOH)
- Provided option to use CUDA library call instead of CUDA driver to check buffer pointer type
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Enhanced designs for Alltoall and Allgather collective communication from GPU device buffers
- Optimized and tuned support for collective communication from GPU buffers
- Non-contiguous datatype support in point-to-point and collective communication from GPU buffers
- Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Efficient synchronization mechanism using CUDA Events for pipelined device data transfers
- OFA-IB-Nemesis interface design
- OpenFabrics InfiniBand network module support for MPICH2 Nemesis modular design
- Optimized adaptive RDMA fast path with Polling Set for high-performance inter-node communication
- Shared Receive Queue (SRQ) support with flow control, uses significantly less memory for MPI library
- Header caching for low-latency
- Support for additional features (such as hwloc, hierarchical collectives, one-sided, multi-threading, etc.), as included in the MPICH2 Nemesis channel
- Support of Shared-Memory-Nemesis interface on multi-core platforms requiring intra-node communication only (SMP-only systems, laptops, etc.)
- Support for 3D torus topology with appropriate SL settings
- Quality of Service (QoS) support with multiple InfiniBand SL
- Automatic inter-node communication parameter tuning based on platform and adapter detection
- Flexible HCA selection
- Checkpoint-Restart support
- Run-through stabilization support to handle process failures
- Enhancements to handle IB errors gracefully
- Flexible process manager support
- (NEW) Improved hierarchical job startup performance
- Enhanced hierarchical ssh-based robust mpirun_rsh framework to work with any interface (CH3 and Nemesis channel-based) including OFA-IB-Nemesis, TCP/IP-CH3 and TCP/IP-Nemesis to launch jobs on multi-thousand core clusters
- Introduced option to export environment variables automatically with mpirun_rsh
- Support for automatic detection of path to utilities(rsh, ssh, xterm, TotalView) used by mpirun_rsh during configuration
- Support for launching jobs on heterogeneous networks with mpirun_rsh
- MPMD job launch capability
- Hydra process manager to work with any of the ten interfaces (CH3 and Nemesis channel-based) including OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3
- Improved debug message output in process management and fault tolerance functionality
- Better handling of process signals and error management in mpispawn
- Flexibility for process execution with alternate group IDs
- Using in-band IB communication with MPD
- SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts without specifying a hostfile
- Support added to automatically use PBS_NODEFILE in Torque and PBS environments
- Support for suspend/resume functionality with mpirun_rsh framework
- Exporting local rank, local size, global rank and global size through environment variables (both mpirun_rsh and hydra)
- Support for various job launchers and job schedulers (such as SGE and OpenPBS/Torque)
- Configuration file support (similar to one available in MVAPICH). Provides a convenient method for handling all runtime variables through a configuration file.
- (NEW) Enable hierarchical SSH-based startup with Checkpoint-Restart
- (NEW) Enable the use of Hydra launcher with Checkpoint-Restart for OFA-IB-CH3 and OFA-IB-Nemesis interfaces
- Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
- Support for application-level checkpointing
- Support for hierarchical system-level checkpointing
- Checkpoint-restart support for application transparent systems-level fault tolerance. BLCR-based support using OFA-IB-CH3 and OFA-IB-Nemesis interfaces.
- Scalable Checkpoint-restart with mpirun_rsh framework
- Checkpoint-restart with Fault-Tolerance Backplane (FTB-CR) support
- Checkpoint-restart with intra-node shared memory (user-level) support
- Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support
- Checkpoint-restart support with pure SMP mode
- Allows best performance and scalability with fault-tolerance support
- Run-through stabilization support to handle process failures using OFA-IB-Nemesis interface
- Enhancements to handle IB errors gracefully using OFA-IB-Nemesis interface
- Application-initiated system-level checkpointing is
also supported. User application can request a whole program
checkpoint synchronously by calling special MVAPICH2 functions.
- Flexible interface to work with different files systems. Tested with ext3 (local disk), NFS and PVFS2.
- Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand.
- Fast Checkpoint-Restart support with aggregation scheme
- Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
- Enable signal-triggered (SIGUSR2) migration
- Fast process migration using RDMA
- Support for new standardized Fault Tolerant Backplane (FTB) Events for Checkpoint-Restart and Job Pause-Migration-Restart Framework
- Enhancement to software installation
- Revamped Build system:
- Uses automake instead of simplemake,
- Allows for parallel builds ("make -j8" and similar)
- Full autoconf-based configuration
- Automatically detects system architecture and adapter types and optimizes MVAPICH2 for any particular installation.
- An utility (mpiname) for querying the MVAPICH2 library version and configuration information
- Automatically builds and installs OSU Benchmarks for end-user convenience
- Revamped Build system:
Optimized intra-node communication support by taking advantage of
shared-memory communication. Available for all interfaces (IB and iWARP).
- (NEW) Enhanced intra-node SMP performance
- (NEW) Tuned SMP eager threshold parameters
- New shared memory design for enhanced intra-node small message performance
- Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)
- Kernel-level single-copy
intra-node communication solution based on LiMIC2
- Upgraded to LiMIC2 version 0.5.6 to support unlocked ioctl calls
- LiMIC2 is designed and developed by jointly by The Ohio State University and System Software Laboratory at Konkuk University, Korea.
- Efficient Buffer Organization for Memory Scalability of Intra-node Communication
- Multi-core optimized
- Adjust shared-memory communication block size at runtime
- Automatic intra-node communication parameter tuning based on platform
- Efficient connection set-up for multi-core systems
- (NEW) Portable Hardware Locality (hwloc v1.8) support for defining CPU affinity
- Efficient CPU binding policies (bunch and scatter) to specify CPU binding per job for modern multi-core platforms with SMT support
- Enhanced support for CPU binding with socket and numanode level granularity
- Show current CPU bindings with MV2_SHOW_CPU_BINDING
- Improved usability of process to CPU mapping with support of delimiters (',' , '-') in CPU listing
- Also allows user-defined CPU binding
- Optimized for Bus-based SMP and NUMA-Based SMP systems.
- Efficient support for diskless clusters
Optimized collective communication
operations. Available for OFA-IB-CH3, OFA-iWARP-CH3, and OFA-RoCE-CH3
- (NEW) Optimized and tuned blocking and non-blocking collectives for OFA-IB-CH3, OFA-IB-Nemesis and PSM-CH3 channels
- (NEW) Enhanced MPI_Bcast, MPI_Reduce, MPI_Scatter, MPI_Gather performance
- Hardware UD-Multicast based designs for collectives - Bcast, Allreduce and Scatter
- Intra-node Zero-Copy designs (using LiMIC) for MPI_Gather collective
- Enhancements and optimizations for point-to-point designs for Broadcast
- Improved performance for shared-memory based collectives - Broadcast, Barrier, Allreduce, Reduce
- Performance improvements in Scatterv and Gatherv collectives for CH3 interface
- Enhancements and optimizations for collectives (Alltoallv, Allgather)
- Tuned Bcast, Alltoall, Scatter, Allgather, Allgatherv, Reduce, Reduce_Scatter, Allreduce collectives
Integrated multi-rail communication support. Available for
OFA-IB-CH3 and OFA-iWARP-CH3 interfaces.
- Supports multiple queue pairs per port and multiple ports per adapter
- Supports multiple adapters
- Support to selectively use some or all rails according to user specification
- Support for both one-sided and point-to-point operations
- Reduced stack size of internal threads to dramatically reduce memory requirement on multi-rail systems
- Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces)
- Support for InfiniBand Quality of Service (QoS) with multiple lanes
Multi-threading support. Available for all interfaces (IB and iWARP),
- Enhanced support for multi-threaded applications
High-performance optimized and scalable support for one-sided
communication: Put, Get and Accumulate. Supported synchronization
calls: Fence, Active Target, Passive (lock and unlock). Available
for all interfaces.
- (NEW) Enhanced direct RDMA based designs for MPI_Put and MPI_Get operations in OFA-IB-CH3 channel
- (NEW) Optimized communication when using MPI_Win_allocate for OFA-IB-CH3 channel
- (NEW) MPI-3 RMA support for PSM-CH3 channel
- Direct RDMA based One-sided communication support for OpenFabrics Gen2-iWARP and RDMA CM (with Gen2-IB)
- Shared memory backed Windows for one-sided communication
Two modes of communication progress
- Blocking (enables running multiple MPI processes/processor). Available for OpenFabrics (IB and iWARP) interfaces.
- Advanced AVL tree-based Resource-aware registration cache
- Adaptive number of registration cache entries based on job size
- Automatic tuning based on both platform type and network adapter
- Progress engine optimization for PSM-CH3 interface
- Improved performance for medium size messages for QLogic PSM
- Multi-core-aware collective support for QLogic PSM
- Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
- (NEW) Warn and continue when ptmalloc fails to initialize
- Support for TotalView debugger with mpirun_rsh framework
- Shared library support for existing binary MPI application programs to run.
- Enhanced debugging config options to generate core files and back-traces
- Use of gfortran as the default F77 compiler
- ROMIO Support for MPI-IO.
- Optimized, high-performance ADIO driver for Lustre
- Single code base for the following platforms (Architecture, OS,
Compilers, Devices and InfiniBand adapters)
- Architecture: EM64T, x86_64 and x86
- Operating Systems: (tested with) Linux
- Compilers: GCC, Intel, PGI, Ekopath and Open64
- Devices: OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3, PSM-CH3, TCP/IP-CH3, OFA-IB-Nemesis and TCP/IP-Nemesis
- InfiniBand adapters (tested with):
- Mellanox InfiniHost adapters (SDR and DDR)
- Mellanox ConnectX (DDR and QDR with PCIe2)
- Mellanox ConnectX-2 (QDR with PCIe2)
- Mellanox ConnectX-2 (FDR with PCIe3)
- Mellanox Connect-IB (Dual FDR with PCIe3)
- QLogic adapter (SDR)
- QLogic adapter (DDR and QDR with PCIe2)
- 10GigE (iWARP and RoCE) adapters:
- (tested with) Chelsio T3 and T4 adapter with iWARP support
- (tested with) Mellanox ConnectX-EN 10GigE adapter
- (tested with) Intel NE020 adapter with iWARP support
- 40GigE RoCE adapters:
- (tested with) Mellanox ConnectX-EN 40GigE adapter
- Public SVN access of the codebase
- A set of micro-benchmarks (including multi-threading latency test) for carrying out MPI-level performance evaluation after the installation
- Public mvapich-discuss mailing list for MVAPICH users to
- Ask for help and support from each other and get prompt response
- Enable users and developers to contribute patches and enhancements