MVAPICH2 (MPI-3 over OpenFabrics-IB, OpenFabrics-iWARP, PSM, and TCP/IP)
This is an MPI-3 implementation. The latest release is MVAPICH2 2.1rc1 (includes MPICH-3.1.3). It is available under BSD licensing.The current release supports the following ten underlying transport interfaces:
- OFA-IB-CH3: This interface supports all InfiniBand compliant devices based on the OpenFabrics Gen2 layer. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and Qlogic adapters.
This interface supports all InfiniBand
compliant devices based on the
OpenFabrics libibverbs layer with the emerging
Nemesis channel of the
MPICH2 stack. This interface can be used by all Mellanox
- OFA-iWARP-CH3: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
- OFA-RoCE-CH3: This interface supports the emerging RoCE (RDMA over Convergence Ethernet) interface for Mellanox ConnectX-EN adapters with 10/40GigE switches.
- PSM-CH3: This interface provides native support
for InfiniPath adapters from QLogic over PSM interface. It provides
high-performance point-to-point communication for both
one-sided and two-sided operations.
- Shared-Memory-CH3: This interface provides native shared
memory support on multi-core platforms where communication is
required only within a node. Such as SMP-only systems, laptops,
- TCP/IP-CH3: The standard TCP/IP interface (provided by MPICH2) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- TCP/IP-Nemesis: The standard TCP/IP interface (provided by MPICH2 Nemesis channel) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
- Shared-Memory-Nemesis: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
MVAPICH2 2.1rc1 provides many features including MPI-3 standard compliance, single copy intra-node communication using Linux supported CMA (Cross Memory Attach), Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR), high-performance and scalable InfiniBand hardware multicast-based collectives, enhanced shared-memory-aware and intra-node Zero-Copy collectives (using LiMIC), high-performance communication support for NVIDIA GPU with IPC, collective and non-contiguous datatype support, integrated hybrid UD-RC/XRC design, support for UD only mode, nemesis-based interface, shared memory interface, scalable and robust daemon-less job startup (mpirun-rsh), flexible process manager support (mpirun-rsh and hydra.mpiexec), full autoconf-based configuration, portable hardware locality (hwloc) with flexible CPU granularity policies (core, socket and numanode) and binding policies (bunch and scatter) with SMT support, flexible rail binding with processes for multirail configurations, message coalescing, dynamic process migration, fast process-level fault-tolerance with checkpoint-restart, fast job-pause-migration-resume framework for pro-active fault-tolerance, suspend/resume, network-level fault-tolerance with Automatic Path Migration (APM), RDMA CM support, iWARP support, optimized collectives, on-demand connection management, multi-pathing, RDMA Read-based and RDMA-write-based designs, polling and blocking-based communication progress, multi-core optimized and scalable shared memory support, LiMIC2-based kernel-level shared memory support for both two-sided and one-sided operations, shared memory backed Windows for one-Sided communication, HugePage support, and memory hook with ptmalloc2 library support. The ADI-3-level design of MVAPICH2 2.1rc1 supports many features including: MPI-2 functionalities (one-sided, dynamic process management, collectives and datatype), multi-threading and all MPI-1 functionalities. It also supports a wide range of platforms, architectures, OS, compilers, InfiniBand adapters (Mellanox and QLogic), iWARP adapters (including the new Chelsio T4 adapter) and RoCE adapters. A complete set of features and supported platforms can be found here..
The complete MVAPICH2 2.1rc1 package is available through public anonymous MVAPICH SVN.
Successive versions with additional features (such as collective offload with CORE-Direct and support of advanced features for Nemesis) will be available soon.
MVAPICH2-X (Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems)
Message Passing Interface (MPI) has been the most popular programming model for developing parallel scientific applications. Partitioned Global Address Space (PGAS) programming models are an attractive alternative for designing applications with irregular communication patterns. They improve programmability by providing a shared memory abstraction while exposing locality control required for performance. It is widely believed that hybrid programming model (MPI+X, where X is a PGAS model) is optimal for many scientific computing problems, especially for exascale computing.
MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for PGAS programming model. This minimizes the development overheads that have been a huge deterrent in porting MPI applications to use PGAS models. The unified runtime also delivers superior performance compared to using separate MPI and PGAS libraries by optimizing use of network and memory resources.
MVAPICH2-X supports Unified Parallel C (UPC), OpenSHMEM and Coarray Fortran (CAF) as PGAS models. It can be used to run pure MPI, MPI+OpenMP, pure UPC, pure OpenSHMEM, pure CAF as well as hybrid MPI(+OpenMP) + PGAS applications. MVAPICH2-X derives from the popular MVAPICH2 library and inherits many of its features for performance and scalability of MPI communication. It takes advantage of the RDMA features offered by the InfiniBand interconnect to support UPC/OpenSHMEM/CAF data transfer and atomic operations. It also provides a high-performance shared memory channel for multi-core InfiniBand clusters.
The MPI implementation of MVAPICH2-X is based on MVAPICH2, which supports all MPI-3 features. The UPC implementation is UPC Language Specification v1.2 standard compliant and is based on Berkeley UPC v2.18.0. OpenSHMEM implementation is OpenSHMEM v1.0 standard compliant and is based on OpenSHMEM Reference Implementation v1.0f. CAF implementation is Coarray Fortran 2015 standard compliant and is based on UH CAF 3.0.39 Reference Implementation. The current release supports communication using InfiniBand Transport (inter-node) and Shared Memory (intra-node). The overall architecture of MVAPICH2-X is shown in the figure below.
MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA)
MVAPICH2-GDR, based on the standard MVAPICH2 software stack, incorporates designs that take advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand interconnect. GPUDirect RDMA completely by-passes the host memory, providing low-latency and completely offloaded communication between NVIDIA GPUs on a cluster. MVAPICH2-GDR reaps the benefits of this new fast communication path while offering hybrid designs that help work around peer-to-peer bandwidth bottlenecks seen on modern node architectures. It provides significantly improved performance for small and medium messages while achieving close to peak network bandwidth for large messages.
MVAPICH2-GDR also inherits all the features for communication on NVIDIA GPU clusters that are available in the MVAPICH2 software stack. A complete list of these features can be found here.
MVAPICH2-MIC (MVAPICH2 with Xeon Phi (MIC) Support)
MVAPICH2-MIC, based on the standard MVAPICH2 software stack, incorporates hybrid designs that use Shared memory, SCIF and IB channels to optimize intra-node and inter-node data movement on Xeon Phi clusters.
MVAPICH2-MIC supports the three modes of MIC-based system usage: Offload, Native and Symmetric. In Offload mode, MPI processes are executed on the host processors. The Xeon Phi works as an accelerator to offload computation onto. This mode is similar to most GPGPU clusters where GPUs are only used to accelerate computation. The Coprocessor-only (Native) mode means to run all MPI processes only on the MIC architecture and the Hosts are not involved in the execution. In the Symmetric mode of operation, MPI processes are uniformly spawned over both the host and the coprocessor architectures. The developer has to explicitly manage parallelism across the two different architectures.
With the two last modes, in order to optimize the intra-MIC communication, MVAPICH2-MIC uses hybrid schemes that mix shared memory and SCIF channels. Similar hybrid designs with SCIF and IB are used to efficiently support Intra-node Host-MIC communications. To overcome the Bandwidth limitation of the current chipsets (SandyBridge and Ivybridge), MVAPICH2-MIC, employs 3 different proxy-based designs : active 1-hop, active 2-hops and passive that efficiently and transparently reroute the communication through the host. The proxy designs use a hybrid scheme which efficiently pipelines the communication utilizing SCIF and IB channels. For more information on the usage and tuning of MVAPICH2-MIC please refer to the README.
MVAPICH2-MIC also inherits all the features for communication on HPC Clusters available in the MVAPICH2 software stack. A complete list of these features can be found here