MVAPICH/MVAPICH2 Project
Ohio State University



Features | MVAPICH2-X | Overview | Network-Based Computing Laboratory

MVAPICH2-X 2.0rc1 Features

MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for PGAS programming model. This minimizes the development overheads that have been a huge deterrent in porting MPI applications to use PGAS models. The unified runtime also delivers superior performance compared to using different MPI, UPC and OpenSHMEM libraries by optimizing use of network and memory resources. MVAPICH2-X supports pure MPI programs, MPI+OpenMP programs, pure UPC, pure OpenSHMEM as well as hybrid MPI(+OpenMP) + PGAS programs. MVAPICH2-X supports UPC and OpenSHMEM as PGAS models. High-level features of MVAPICH2-X are listed below.
  • MPI Features
    • Support for MPI-3 features
    • Based on MVAPICH2 2.0rc1 (OFA-IB-CH3 interface). MPI programs can take advantage of all the features enabled by default in OFA-IB-CH3 interface of MVAPICH2 2.0rc1
      • High performance two-sided communication scalable to multi-thousand nodes
      • Optimized collective communication operations
        • Shared-memory optimized algorithms for barrier, broadcast, reduce and allreduce operations
        • Optimized two-level designs for scatter and gather operations
        • Improved implementation of allgather, alltoall operations
      • High-performance and scalable support for one-sided communication
        • Direct RDMA based designs for one-sided communication
        • Shared memory backed Windows for One-Sided communication
        • Support for truly passive locking for intra-node RMA in shared memory backed windows
      • Multi-threading support
        • Enhanced support for multi-threaded MPI applications
  • Unified Parallel C (UPC) Features
    • UPC Language Specification v1.2 standard compliance
    • (NEW) Based on Berkeley UPC v2.18.0 (contains changes/additions in preparation for UPC 1.3 specification)
    • Optimized RDMA-based implementation of UPC data movement routines
    • Improved UPC memput design for small/medium size messages
    • (NEW) Support for GNU UPC translator
    • (NEW) Optimized UPC collectives (Improved performance for upc_all_broadcast, upc_all_scatter, upc_all_gather, upc_all_gather_all, and upc_all_exchange)
  • OpenSHMEM Features
    • OpenSHMEM v1.0 standard compliance
    • Based on OpenSHMEM reference implementation v1.0f
    • Optimized RDMA-based implementation of OpenSHMEM data movement routines
    • Support for OpenSHMEM 'shmem_ptr' functionality
    • Efficient implementation of OpenSHMEM atomics using RDMA atomics
    • Optimized OpenSHMEM put routines for small/medium message sizes
    • (NEW) Optimized OpenSHMEM collectives (Improved performance for shmem_collect, shmem_fcollect, shmem_barrier, shmem_reduce and shmem_broadcast)
    • (NEW) Optimized 'shmalloc' routine
    • (NEW) Improved intra-node communication performance using shared memory and CMA designs
  • Hybrid Program Features
    • Supports hybrid programming using MPI(+OpenMP), MPI(+OpenMP)+UPC and MPI(+OpenMP)+OpenSHMEM
    • Compliance to MPI-3, UPC v1.2 and OpenSHMEM v1.0 standards
    • Optimized network resource utilization through the unified communication runtime
    • Efficient deadlock-free progress of MPI and UPC/OpenSHMEM calls
  • Unified Runtime Features
    • Based on MVAPICH2 2.0rc1 (OFA-IB-CH3 interface). MPI, UPC, OpenSHMEM and Hybrid programs benefit from its features listed below
      • Scalable inter-node communication with highest performance and reduced memory usage
        • Integrated RC/XRC design to get best performance on large-scale systems with reduced/constant memory footprint
        • RDMA Fast Path connections for efficient small message communication
        • Shared Receive Queue (SRQ) with flow control to significantly reduce memory footprint of the library
        • AVL tree-based resource-aware registration cache
        • Automatic tuning based on network adapter and host architecture
      • Optimized intra-node communication support by taking advantage of shared-memory communication
        • Efficient Buffer Organization for Memory Scalability of Intra-node Communication
        • Automatic intra-node communication parameter tuning based on platform
      • Flexible CPU binding capabilities
        • Portable Hardware Locality (hwloc v1.8) support for defining CPU affinity
        • Efficient CPU binding policies (bunch and scatter patterns, socket and numanode granularities) to specify CPU binding per job for modern multi-core platforms
        • Allow user-defined flexible processor affinity
      • Two modes of communication progress
        • Polling
        • Blocking (enables running multiple processes/processor)
    • Flexible process manager support
      • Support for mpirun rsh, hydra, upcrun and oshrun process managers