MVAPICH/MVAPICH2 Project
Ohio State University



Features | MVAPICH | Overview | Network-Based Computing Laboratory

MVAPICH 1.2 Features and Supported Platforms

New features compared to MVAPICH 1.1 are marked as (NEW) .

  • Single code base with multiple underlying transport interfaces:
    • OpenFabrics/Gen2
      • This interface support has the highest performing and most scalable features, such as eXtended Reliable Connection (XRC), SRQ, multi-core-aware shared memory collectives, on-demand connection management, lock-free asynchronous progress, etc.
    • OpenFabrics/Gen2-Hybrid
      • This interface is targeted for emerging clusters with multi-thousand cores to deliver best performance and scalability with constant memory footprint for communication contexts.
      • Provides capabilities to use the Unreliable Datagram (UD), Reliable Connection (RC) and eXtended Reliable Connection (XRC) transports of InfiniBand.
    • (NEW) OpenFabrics/Gen2-RDMAoE.
      • This interface supports the emerging RDMAoE (RDMA over Ethernet) interface for Mellanox ConnectX-EN adapters with 10GigE switches.
    • Shared-Memory only channel
      • This interface support is useful for running MPI jobs on multi-processor systems without using any high-performance network. For example, multi-core servers, desktops, and laptops; and clusters with serial nodes.
    • QLogic InfiniPath
      • This interface provides native support for InfiniPath adapters from QLogic. It provides high-performance point-to-point communication as well as optimized collectives (MPI_Bcast and MPI_Barrier) with k-nomial algorithms while exploiting multi-core architecture.
    • TCP/IP
      • The standard TCP/IP interface (provided by MPICH) to work with a wide range of networks. This interface can also be used with IPoIB support of InfiniBand. However, it will not deliver good performance/scalability as compared to the other three interfaces.
    (Please note that VAPI (single-rail and multi-rail), OpenFabrics/Gen2 (multi-rail) and uDAPL interfaces have been deprecated from the MVAPICH code base starting with 1.1 version. To take advantage of integrated multi-rail support with OpenFabrics/Gen2 and uDAPL support, users can use MVAPICH2 codebase.)
  • Scalable and robust job startup
    • Enhanced and robust mpirun_rsh framework to provide scalable launching on multi-thousand node clusters
      • Running time of `MPI Hello World' program on 1K cores is around 4 sec and on 32K cores is around 80 sec
      • Available for OpenFabrics/Gen2, OpenFabrics/Gen2-Hybrid and QLogic InfiniPath devices
    • Support for SLURM
      • Available for OpenFabrics/Gen2, OpenFabrics/Gen2-Hybrid and QLogic InfiniPath devices
    • Flexibility for using rsh/ssh-based startup
  • Designs for scaling to multi-thousand nodes with highest performance and minimal memory usage (using OpenFabrics/Gen2-Hybrid interface)
    • Delivers performance and scalability with near constant memory footprint for communication contexts
    • Adaptive selection during run-time (based on application and systems characteristics) to switch between RC and UD (or between XRC and UD) transports
    • Zero-copy protocol with UD for large data transfer
    • Multiple buffer organizations with XRC support
    • Shared memory communication between cores within a node
    • Multi-core optimized collectives (MPI_Bcast, MPI_Barrier, MPI_Reduce and MPI_Allreduce)
    • Enhanced MPI_Allgather collective
  • Designs for scaling to multi-thousand nodes with highest performance and reduced memory usage (using OpenFabrics/Gen2 interface)
    • eXtended Reliable Connection (XRC) support
    • Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly.
    • Enhanced coalescing support with varying degree of coalescing
    • Asynchronous and scalable On-demand connection management using native InfiniBand Unreliable Datagram (UD) support. This feature enables InfiniBand connections to be setup dynamically, enhancing the scalability of MVAPICH on clusters of thousands of nodes.
    • Shared Receive Queue (SRQ) support with flow control. The new design uses significantly less memory for MPI library.
    • Adaptive RDMA Fast Path
    • Lock-free design to provide support for asynchronous progress at both sender and receiver to overlap computation and communication
    • Multi-pathing support leveraging LMC mechanism to avoid hotspots on large fabrics
    • Multi-port/Multi-HCA support for enabling user processes to bind to different IB ports for balanced communication performance on multi-core platforms with multiple HCAs and/or ports.
  • Optimized intra-node communication support by taking advantage of shared-memory communication
    • Multi-core aware scalable shared memory design
    • Efficient support for diskless clusters
    • Bus-based SMP systems
    • NUMA-based SMP systems
    • Processor Affinity
    • Flexible user defined processor affinity for better resource utilization on multi-core systems
  • Support for Fault Tolerance
    • Mem-to-mem reliable data transfer (detection of I/O bus error with 32bit CRC). This mode enables MVAPICH to deliver messages reliably in presence of I/O errors.
    • Network-level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand
    • (NEW) Network Fault Resiliency (NFR) for tolerating transient network failures. Using this feature, long running MPI codes can tolerate network failures. Infini- Band HCAs are automatically reset and MPI programs can continue computation without having to restart.
  • Single codebase for following platforms (Architecture, OS, compilers, Devices, and InfiniBand adapters):
    • Architecture: EM64T, Opteron, IA-32 and IBM PPC
    • Operating Systems: Linux and Mac OSX
    • Compilers: gcc, intel, pathscale, pgi and sun studio
    • Devices: OpenFabrics/Gen2, OpenFabrics/Gen2-Hybrid, (NEW) OpenFabrics/Gen2-RDMAoE, shared-memory and QLogic/InfiniPath and TCP/IP
    • InfiniBand adapters (tested with):
      • Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards)
      • Mellanox ConnectX (DDR)
      • Mellanox ConnectX (QDR) with PCI-Express Gen2
      • QLogic/InfiniPath (DDR) with PCI-Express Gen2
    • 10GigE adapters (tested with):
      • (NEW) Mellanox ConnectX-EN adapter (DDR)
  • Optimized RDMA Write-based scheme for Eager protocol (short message transfer)
  • Optimized implementation of Rendezvous protocol (large message transfer) for better computation-communication overlap and progress
    • RDMA Write-based
    • RDMA Read-based
    • RDMA Read with Asynchronous Progress
  • Two modes of communication progress
    • Polling
    • Blocking (enables running multiple MPI processes/processor)
  • Advanced AVL tree-based Resource-aware registration cache
    • Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that frequently use malloc and free operations.
  • High performance and scalable collective communication support
    • Optimized, high-performance collective operations for multi-core platforms: Shared Memory MPI_Bcast and Enhanced MPI_Allgather.
    • Tuning and Optimization of various collective algorithms for a wide range of system sizes and network adapter characteristics
  • Schemes for minimizing memory resource usage on large scale systems
    • Automatic tuning for small, medium and large clusters
    • Shared Receive Queue (SRQ) support
    • On-Demand Connection Management
  • Shared library support for existing binary MPI application programs to run
  • Shared library support for Solaris
  • ROMIO support for MPI-IO
  • Enhanced support for TotalView debugger
  • Integrated and easy-to-use build script which automatically detects systems architecture and InfiniBand adapter types and optimizes MVAPICH for any particular installation
  • Tuned thresholds and associated optimizations for
    • different architectures/platforms mentioned above
    • different memory/system bus characteristics
    • different network interfaces (PCI-X, PCI-Express with SDR and DDR; PCI-Express-Gen2 with QDR and IBM eHCA adapter with GX interface)
    • different networks enabled by multiple devices/interfaces
  • Incorporates a set of runtime and compile time tunable parameters (at MPI and network layers) for convenient tuning on
    • large scale systems
    • future platforms
The MVAPICH 1.2 package and the project also includes the following provisions:
  • Public SVN access of the codebase
  • A well-documented user guide
  • A set of micro-benchmarks for carrying out MPI-level performance evaluation after the installation
  • Public mvapich-discuss mailing list for mvapich users to
    • ask for help and support from each other and get prompt response
    • enable users and developers to contribute patches and enhancements