Features | MVAPICH2 | Overview | Network-Based Computing Laboratory

MVAPICH2 1.4 Features and Supported Platforms

New features compared to MVAPICH2 1.2 series are marked as (NEW) .

MVAPICH2 (MPI-2 over InfiniBand) is an MPI-2 implementation based on MPICH2 ADI3 layer. The latest release is compliant with the MPI 2.1 standard. It also supports all MPI-1 functionalities. MVAPICH2 1.4 is available as a single integrated package (with MPICH2 1.0.8p1). The current release supports the following six underlying transport interfaces:
  • OpenFabrics-IB: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer. This interface has the most features and is most widely used. For example, this interface can be used over all Mellanox InfiniBand adapters, IBM eHCA adapters and Qlogic adapters.
  • OpenFabrics-iWARP: This interface supports all iWARP compliant devices supported by OpenFabrics. For example, this layer supports Chelsio T3 adapters with the native iWARP mode.
  • (NEW) OpenFabrics Gen2-RDMAoE: This interface supports the emerging RDMAoE (RDMA over Ethernet) interface for Mellanox ConnectX-EN adapters with 10GigE switches.
  • (NEW) QLogic InfiniPath: This interface provides native support for InfiniPath adapters from QLogic over PSM interface. It provides high-performance point-to-point communication for both one-sided and two-sided operations.
  • uDAPL: This interface supports all network-adapters and software stacks which implement the portable DAPL interface from the DAT Collaborative. For example, this interface can be used over all Mellanox adapters, Chelsio adapters and NetEffect adapters. It can also be used with Solaris uDAPL-IBTL implementation over InfiniBand adapters.
  • TCP/IP: The standard TCP/IP interface (provided by MPICH2) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also. However, it will not deliver good performance/ scalability as compared to the other interfaces.
Please note that the support for VAPI interface has been deprecated from MVAPICH2 (since 1.2 version) because OpenFabrics interface is getting more popular. MVAPICH2 users still using VAPI interface are strongly requested to migrate to the OpenFabrics-IB interface.

MVAPICH2-1.4 delivers better performance (especially with one-copy intra-node communication support with LiMIC2) than MVAPICH 1.1, the latest release package of MVAPICH supporting MPI-1 standard. In addition, MVAPICH2 1.2 series provides support and optimizations for other MPI-2 features (including dynamic process management), multi-threading and fault-tolerance (Checkpoint-restart). A complete set of features of MVAPICH2 1.4 are:

  • Design for scaling to multi-thousand cores with highest performance and reduced memory usage.
    • (NEW) Support for MPI-2 Dynamic Process Management on InfiniBand Clusters
    • (NEW) eXtended Reliable Connection (XRC) support
    • (NEW) Multiple CQ-based design for Chelsio 10GigE/iWARP
    • Scalable and robust daemon-less job startup
      • Enhanced and robust mpirun_rsh framework (non-MPD-based) to provide scalable job launching on multi-thousand core clusters
      • (NEW) Hierarchical ssh to nodes to speedup job start-up
      • Available for OpenFabrics (IB and iWARP) and uDAPL interfaces (including Solaris)
    • On-demand Connection Management: This feature enables InfiniBand connections to be setup dynamically, enhancing the scalability of MVAPICH2 on clusters of thousands of nodes.
      • Native InfiniBand Unreliable Datagram (UD) based asynchronous connection management for OpenFabrics-IB interface.
      • RDMA CM based on-demand connection management for OpenFabrics-IB and OpenFabrics-iWARP interfaces.
      • uDAPL on-demand connection management based on standard uDAPL interface (including uDAPL-IB support in Solaris).
    • Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for OpenFabrics-IB interface.
    • Hot-Spot Avoidance Mechanism (HSAM) for alleviating network-congestion in large scale clusters. Available for OpenFabrics-IB interface.
    • RDMA Read utilized for increased overlap of computation and communication for OpenFabrics device. Available for OpenFabrics (IB and iWARP) interfaces.
    • Shared Receive Queue (SRQ) with flow control. This design uses significantly less memory for MPI library. Available for OpenFabrics-IB interface.
    • Adaptive RDMA Fast Path with Polling Set for low-latency messaging. Available for OpenFabrics (IB and iWARP) interfaces.
    • Enhanced scalability for RDMA-based direct one-sided communication with less communication resource. Available for OpenFabrics (IB and iWARP) interfaces.
  • (NEW) Dynamic Process Management (DPM) support with mpirun_rsh framework. Available for OpenFabrics IB interface.
  • Fault tolerance support
    • Checkpoint-restart support for application transparent systems-level fault tolerance. BLCR-based support using OpenFabrics-IB interface.
      • (NEW) Scalable Checkpoint-restart with mpirun_rsh framework
      • (NEW) Checkpoint-restart with Fault-Tolerant Backplane (FTB-CR) support
      • Checkpoint-restart with intra-node shared memory (user-level) support
      • (NEW) Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support
      • Allows best performance and scalability with fault-tolerance support
    • Application-initiated system-level checkpointing is also supported. User application can request a whole program checkpoint synchronously by calling special MVAPICH2 functions.
      • Flexible interface to work with different files systems. Tested with ext3 (local disk), NFS and PVFS2.
    • Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand.
  • Enhancement to software installation
    • Automatically detects system architecture and adapter types and optimizes MVAPICH2 for any particular installation.
    • An application (mpiname) for querying the MVAPICH2 library version and configuration information
  • Optimized intra-node communication support by taking advantage of shared-memory communication. Available for all interfaces (IB, iWARP and uDAPL).
    • (NEW) Kernel-level single-copy intra-node communication solution based on LiMIC2
    • Efficient Buffer Organization for Memory Scalability of Intra-node Communication
    • Multi-core optimized
    • (NEW) Efficient CPU binding for different multi-core clusters
    • Optimized for Bus-based SMP and NUMA-Based SMP systems.
    • Efficient support for diskless clusters
    • Enhanced processor affinity using PLPA for multi-core architectures
      • Allows user-defined flexible processor affinity
  • Shared memory optimizations for collective communication operations. Available for all interfaces (IB, iWARP and uDAPL).
    • (NEW) K-nomial tree-based solution together with shared memory-based broadcast for scalable MPI_Bcast operations
    • Optimized and tuned MPI_Alltoall
    • Efficient algorithms and optimizations for Barrier, Reduce and All-reduce operations
  • Integrated multi-rail communication support. Available for OpenFabrics (IB and iWARP) interfaces.
    • Multiple queue pairs per port
    • Multiple ports per adapter
    • Multiple adapters
    • Support for both one-sided and point-to-point operations
    • Support for OpenFabrics-iWARP interface and RDMA CM (with IB)
  • Multi-threading support. Available for all interfaces (IB, iWARP and uDAPL), including TCP/IP.
  • High-performance optimized and scalable support for one-sided communication: Put, Get and Accumulate. Supported synchronization calls: Fence, Active Target, Passive (lock and unlock). Available for all interfaces.
    • Support for RDMA based direct one-sided communication with iWARP and RDMA CM (with IB)
    • Enhanced scalability for RDMA-based direct one-sided communication with less communication resource.
  • Two modes of communication progress
    • Polling
    • Blocking (enables running multiple MPI processes/processor). Available for OpenFabrics (IB and iWARP) interfaces.
  • Scalable job startup schemes
    • Enhanced and robust mpirun_rsh framework
    • (NEW) Hierarchical ssh-based schemes to nodes
    • Using in-band IB communication with MPD
    • Support for SLURM
  • Advanced AVL tree-based Resource-aware registration cache
  • Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations.
  • High Performance and Portable Support for multiple networks and operating systems through uDAPL interface.
    • InfiniBand (tested with)
      • uDAPL over OpenFabrics-IB on Linux
      • uDAPL over IBTL on Solaris
    This uDAPL support is generic and can work with other networks that provide uDAPL interface. Please note that the stability and performance of MVAPICH2 with uDAPL depends on the stability and performance of the uDAPL library used.
  • Support for TotalView debugger with mpirun_rsh framework
  • Shared Library Support for existing binary MPI application programs to run.
  • ROMIO Support for MPI-IO.
    • Optimized, high-performance ADIO driver for Lustre
  • Single code base for the following platforms (Architecture, OS, Compilers, Devices and InfiniBand adapters)
    • Architecture: EM64T, Opteron, IA-32 and IBM PPC
    • Operating Systems: Linux and Solaris
    • Compilers: gcc, intel, pathscale, pgi and sun studio
    • Devices: OpenFabrics-IB, OpenFabrics-iWARP, and uDAPL; and TCP/IP
    • InfiniBand adapters (tested with):
      • Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards)
      • Mellanox ConnectX (DDR)
      • Mellanox ConnectX (QDR) with PCI-Express Gen2
      • (NEW) QLogic adapter (SDR)
      • (NEW) QLogic adapter (DDR) with PCI-Express Gen2
    • 10GigE adapters:
      • (tested with) Chelsio T3 adapter with iWARP support
      • (tested with) (NEW) Mellanox ConnectX-EN adapter (DDR)
The MVAPICH2 1.4 package and the project also includes the following provisions:
  • Public SVN access of the codebase
  • A set of micro-benchmarks (including multi-threading latency test) for carrying out MPI-level performance evaluation after the installation
  • Public mvapich-discuss mailing list for mvapich users to
    • Ask for help and support from each other and get prompt response
    • Enable users and developers to contribute patches and enhancements