Job Startup


  • Near-constant MPI and OpenSHMEM initialization time at any process count
  • 10x and 30x improvement in startup time of MPI and OpenSHMEM respectively at 16,384 processes
  • Reduce memory consumption by O(ppn)
  • 1GB Memory saved per node @ 1M processes and 16 processes per node

Best Practices for Enhancing Job Startup Performance with MVAPICH2

Launcher Agnostic Optimizations

  • Configuration options
    • ./configure --disable-rdma-cm
      • Disable RDMA_CM
  • Runtime parameters
      • Set to 1 for homogeneous clusters
      • Job size after which on-demand connection management should be enabled
      • Set to 1 so that jobs of all sizes use on-demand connection management
      • Set to 1 to enable on-demand creation of UD address handles when MVAPICH2 is run in hybrid mode

Optimizations for SLURM

  • Use PMI2
    • ./configure --with-pm=slurm --with-pmi=pmi2
    • srun --mpi=pmi2 ./a.out
  • Use PMI Extensions
    • Patch for SLURM 15 and 16 available here
    • PMI Extensions are automatically detected and used by MVAPICH2 2.2rc1 and higher

Optimizations for mpirun_rsh

    • Degree of the hierarchical tree used by mpirun_rsh
    • Number of nodes beyond which hierarchical-ssh scheme is used
    • Number of nodes beyond which file-based communication is used for hierarchical-ssh during start up

Machine Specifications

CPU Model CPU Core Info Memory IB Card IB Switch OS OFED
Intel E5-2680 2x8 @ 2.7Ghz 32GB Mellanox ConnectX-3 (56Gbps) Mellanox FDR IB Switch CentOS 6.7 OFED

Shared Memory based PMI

  • Introduced in MVAPICH2-2.2rc1, available with srun slurm patch
  • PMI Get takes 0.25 ms with 32 ppn
  • 1,000 times reduction in PMI Get latency compared to default socket based protocol
  • Memory footprint reduced by O(Processes Per Node) ≈ 1GB @ 1M processes, 16 ppn
  • Backward compatible, negligible overhead

Non-blocking PMI

  • Introduced in MVAPICH2-2.2b, enabled with mpirun_rsh by default for srun with slurm patch
  • Near-constant MPI_Init at any scale
  • MPI_Init with Iallgather performs 288% better than the default based on Fence
  • Blocking Allgather is 38% faster than blocking Fence

PMI Ring Extension

  • Superseded by non-blocking PMI
  • MPI_Init based on PMIX_Ring performs 34% better compared to the default PMI2_KVS_Fence
  • Hello World runs 33% faster with 8K processes
  • Up to 20% improvement in total execution time of NAS parallel benchmarks

On-demand Connection

  • Introduced in MVAPICH2-X 2.1rc1, click here for more information
  • 29.6 times faster initialization time
  • Hello world performs 8.31 times better
  • Execution time of NAS benchmarks improved by up to 35% with 256 processes and class B data