MVAPICH :: Benchmarks

OSU Micro-Benchmarks 7.5 (11/01/24) [Tarball]

Please see CHANGES for the full changelog.
You may also take a look at the appropriate README files for more information.

C Benchmarks README.
Java Benchmarks README.
Python Benchmarks README.

The benchmarks are available under the BSD license.

Here, we list various benchmarks that are part of the OMB package in the C, Java, and Python programming languages for various parallel programming models like MPI, OpenSHMEM, UPC, UPC++, and NCCL. A high-level description of these benchmarks are provided below:
- C Benchmarks
  - MPI
    - Host-based Benchmarks
      - Point-to-Point MPI Benchmarks: Latency, multi-threaded latency, multi-pair latency, multiple bandwidth / message rate test, bandwidth, bidirectional bandwidth
      - Blocking Collective MPI Benchmarks: Collective latency tests for various MPI collective operations such as MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives.
      - Non-Blocking Collective (NBC) MPI Benchmarks: Collective latency and Overlap tests for various MPI collective operations such as MPI_Iallgather, MPI_Iallreduce, MPI_Ialltoall, MPI_Ibarrier, MPI_Ibcast, MPI_Igather, MPI_Ireduce, MPI_Iscatter and vector collectives.
      - One-sided MPI Benchmarks: one-sided put latency, one-sided put bandwidth, one-sided put bidirectional bandwidth, one-sided get latency, one-sided get bandwidth, one-sided accumulate latency, compare and swap latency, fetch and operate and get_accumulate latency for MVAPICH2 (MPI-2 and MPI-3).
      - Startup Benchmarks: osu_init, osu_hello
    - Device-based Benchmarks
      - CUDA, ROCm, and OpenACC Extensions to OSU Micro Benchmarks
      - Support for CUDA Managed Memory
  - OpenSHMEM Benchmarks
    - Point-to-Point OpenSHMEM Benchmarks: put latency, get latency, message rate, atomics,
    - Collective OpenSHMEM Benchmarks: collect latency, broadcast latency, reduce latency, and barrier latency
  - UPC Benchmarks
    - Point-to-Point UPC Benchmarks: put latency, get latency
    - Collective UPC Benchmarks: broadcast latency, scatter latency, gather latency, all_gather latency, and exchange latency
  - UPC++ Benchmarks
    - Point-to-Point UPC++ Benchmarks: async copy put latency, async copy get latency
    - Collective UPC++ Benchmarks: broadcast latency, scatter latency, gather latency, reduce latency, all_gather latency, and all_to_all latency
  - NCCL Benchmarks
    - Point-to-Point NCCL Benchmarks: NCCL-based Latency, bandwidth, bidirectional bandwidth
    - Collective NCCL Benchmarks: Collective latency tests for various NCCL collective operations such as Allgather, Allreduce, Bcast, Reduce, Reduce_Scatter, Alltoall.
- Java Benchmarks
  - Point-to-Point Java Bindings Benchmarks: Latency, Bandwidth, Bidirectional Bandwidth, Bandwidth Test for Open MPI Java Bindings, Bidirectional Bandwidth Test for Open MPI Java Bindings
  - Collective Java Bindings Benchmarks: Collective latency tests for various MPI collective operations such as MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Reduce_scatter, MPI_Scatter, MPI_Scatterv
- Python Benchmarks
  - Point-to-Point Python Benchmarks: Latency, Bandwidth, Bidirectional Bandwidth, Multi-pair Latency
  - Collective Python Benchmarks: Collective latency tests for various MPI collective operations such as MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Reduce_scatter, MPI_Scatter, MPI_Scatterv

Please note that there are many different ways to measure these performance parameters. For example, the bandwidth test can have different variations regarding the types of MPI calls (blocking vs. non-blocking) being used, total number of back-to-back messages sent in one iteration, number of iterations, etc. Other ways to measure bandwidth may give different numbers. Readers are welcome to use other tests, as appropriate to their application environments.

C Benchmarks

All C Benchmarks have the ability to evaluate the correctness of the data exchanged through in-built data validation schemes in addition to evaluating the communication performance.

MPI Benchmarks

Host-based Benchmarks

Point-to-Point MPI Benchmarks

osu_latency - Latency Test
osu_latency_mt - Multi-threaded Latency Test
osu_latency_mp - Multi-process Latency Test
osu_bw - Bandwidth Test
osu_bibw - Bidirectional Bandwidth Test
osu_mbw_mr - Multiple Bandwidth / Message Rate Test
osu_multi_lat - Multi-pair Latency Test

Blocking Collective MPI Benchmarks

The latest OMB version includes benchmarks for various MPI blocking collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

osu_allgather - MPI_Allgather Latency Test
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
osu_alltoallv - MPI_Alltoallv Latency Test
osu_barrier - MPI_Barrier Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter - MPI_Scatter Latency Test
osu_scatterv - MPI_Scatterv Latency Test

Non-Blocking Collective MPI Benchmarks

In addition to the blocking collective latency tests mentioned above, we provide several non-blocking collectives (NBC): MPI_Iallgather, MPI_Iallgatherv, MPI_Iallreduce, MPI_Ialltoall, MPI_Ialltoallv, MPI_Ialltoallw, MPI_Ibarrier, MPI_Ibcast, MPI_Igather, MPI_Igatherv, MPI_Ireduce, MPI_Iscatter, and MPI_Iscatterv. These evaluate the same metrics as the blocking operations as well as the additional metric `overlap'. This is defined as the amount of computation that can be performed while the communication progresses in the background. These benchmarks have the additional options: "-t" set the number of MPI_Test() calls during the dummy computation, set CALLS to 100, 1000, or any number > 0. "-r" set the target for dummy computation that imitates the effect of useful computation that can be overlapped with the communication, as we provide CUDA-Aware support for NBC as well, this option can be set to CPU, GPU, or BOTH.

osu_iallgather - MPI_Iallgather Latency Test
osu_iallgatherv - MPI_Iallgatherv Latency Test
osu_iallreduce - MPI_Iallreduce Latency Test
osu_ialltoall - MPI_Ialltoall Latency Test
osu_ialltoallv - MPI_Ialltoallv Latency Test
osu_ialltoallw - MPI_Ialltoallw Latency Test
osu_ibarrier - MPI_Ibarrier Latency Test
osu_ibcast - MPI_Ibcast Latency Test
osu_igather - MPI_Igather Latency Test
osu_igatherv - MPI_Igatherv Latency Test
osu_ireduce - MPI_Ireduce Latency Test
osu_iscatter - MPI_Iscatter Latency Test
osu_iscatterv - MPI_Iscatterv Latency Test

One-sided MPI Benchmarks

osu_put_latency - Latency Test for Put with Active/Passive Synchronization
osu_get_latency - Latency Test for Get with Active/Passive Synchronization
osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization
osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization
osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization
osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization
osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization
osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization
osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization

Startup MPI Benchmarks

osu_init - This benchmark measures the minimum, maximum, and average time each process takes to complete MPI_Init.
osu_hello - This benchmark measures the time it takes for all processes to execute MPI_Init + MPI_Finalize.

Device-based Benchmarks

CUDA, ROCm, and OpenACC Extensions to OSU Micro Benchmarks

The CUDA extensions are enabled when the benchmark suite is configured with --enable-cuda option. The OpenACC extensions are enabled when --enable-openacc is specified. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.
Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.
The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.
The non-blocking collective benchmarks can also use -t for MPI_Test() calls and -r option for setting the target of dummy computation.
The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers on NVIDIA and AMD GPU devices.
- osu_bibw - Bidirectional Bandwidth Test
- osu_bw - Bandwidth Test
- osu_latency - Latency Test
- osu_mbw_mr - Multiple Bandwidth / Message Rate Test
- osu_multi_lat - Multi-pair Latency Test
- osu_put_latency - Latency Test for Put
- osu_get_latency - Latency Test for Get
- osu_put_bw - Bandwidth Test for Put
- osu_get_bw - Bandwidth Test for Get
- osu_put_bibw - Bidirectional Bandwidth Test for Put
- osu_acc_latency - Latency Test for Accumulate
- osu_cas_latency - Latency Test for Compare and Swap
- osu_fop_latency - Latency Test for Fetch and Op
- osu_allgather - MPI_Allgather Latency Test
- osu_allgatherv - MPI_Allgatherv Latency Test
- osu_allreduce - MPI_Allreduce Latency Test
- osu_alltoall - MPI_Alltoall Latency Test
- osu_alltoallv - MPI_Alltoallv Latency Test
- osu_bcast - MPI_Bcast Latency Test
- osu_gather - MPI_Gather Latency Test
- osu_gatherv - MPI_Gatherv Latency Test
- osu_reduce - MPI_Reduce Latency Test
- osu_reduce_scatter - MPI_Reduce_scatter Latency Test
- osu_scatter - MPI_Scatter Latency Test
- osu_scatterv - MPI_Scatterv Latency Test
- osu_iallgather - MPI_Iallgather Latency Test
- osu_iallgatherv - MPI_Iallgatherv Latency Test
- osu_iallreduce - MPI_Iallreduce Latency Test
- osu_ialltoall - MPI_Ialltoall Latency Test
- osu_ialltoallv - MPI_Ialltoallv Latency Test
- osu_ialltoallw - MPI_Ialltoallw Latency Test
- osu_ibcast - MPI_Ibcast Latency Test
- osu_igather - MPI_Igather Latency Test
- osu_igatherv - MPI_Igatherv Latency Test
- osu_ireduce - MPI_Ireduce Latency Test
- osu_iscatter - MPI_Iscatter Latency Test
- osu_iscatterv - MPI_Iscatterv Latency Test

Support for CUDA Managed Memory

In addition to support for communications to and from GPU memories allocated using CUDA or OpenACC, we now provide additional capability of performing communications to and from buffers allocated using the CUDA Managed Memory concept. CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA Managed Memory using the tests mentioned above. These benchmarks have additional options: "M" allocates a send or receive buffer as managed for point to point communication. "-d managed" uses managed memory buffers to perform collective communications.

osu_bibw - Bidirectional Bandwidth Test
osu_bw - Bandwidth Test
osu_latency - Latency Test
osu_mbw_mr - Multiple Bandwidth / Message Rate Test
osu_multi_lat - Multi-pair Latency Test
osu_allgather - MPI_Allgather Latency Test
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
osu_alltoallv - MPI_Alltoallv Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter - MPI_Scatter Latency Test
osu_scatterv - MPI_Scatterv Latency Test

OPENSHMEM Benchmarks

Point-to-Point OpenSHMEM Benchmarks

osu_oshm_put - Latency Test for OpenSHMEM Put Routine
osu_oshm_put_nb - Latency Test for OpenSHMEM Non-blocking Put Routine
osu_oshm_get - Latency Test for OpenSHMEM Get Routine
osu_oshm_get_nb - Latency Test for OpenSHMEM Non-blocking Get Routine
osu_oshm_put_mr - Message Rate Test for OpenSHMEM Put Routine
osu_oshm_put_mr_nb - Message Rate Test for Non-blocking OpenSHMEM Put Routine
osu_oshm_get_mr_nb - Message Rate Test for Non-blocking OpenSHMEM Get Routine
osu_oshm_put_overlap - Non-blocking Message Rate Overlap Test This benchmark measures the aggregate uni-directional operations rate overlap for OpenSHMEM Put between paris of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs. The benchmarks prints statistics for different phases of communication, computation and overlap in the end.
osu_oshm_atomics - Latency and Operation Rate Test for OpenSHMEM Atomics Routines This benchmark measures the performance of atomic fetch-and-operate and atomic operate routines supported in OpenSHMEM for the integer and long datatypes. The buffers can be selected to be in heap memory or global memory. The PEs are paired like in the case of Put Operation Rate benchmark and the first PE in each pair issues back-to-back atomic operations of a type to its peer PE. The average latency per atomic operation and the aggregate operation rate are reported. This is repeated for each of fadd, finc, add, inc, cswap, swap, set, and fetch routines.

Collective OpenSHMEM Benchmarks

Collective Latency Tests

osu_oshm_collect - OpenSHMEM Collect Latency Test
osu_oshm_fcollect - OpenSHMEM FCollect Latency Test
osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test
osu_oshm_reduce - OpenSHMEM Reduce Latency Test
osu_oshm_barrier - OpenSHMEM Barrier Latency Test

UPC Benchmarks

Point-to-Point Unified Parallel C (UPC) Benchmarks

osu_upc_memput - Latency Test for UPC Put Routine
osu_upc_memget - Latency Test for UPC Get Routine

Collective Unified Parallel C (UPC) Benchmarks

Collective Latency Tests

osu_upc_all_barrier, upc_all_broadcast, osu_upc_all_exchange, osu_upc_all_gather_all, osu_upc_all_gather, osu_upc_all_reduce, and osu_upc_all_scatter

osu_upc_all_barrier - UPC Barrier Latency Test
osu_upc_all_broadcast - UPC Broadcast Latency Test
osu_upc_all_exchange - UPC Exchange Latency Test
osu_upc_all_gather_all - UPC GatherAll Latency Test
osu_upc_all_gather - UPC Gather Latency Test
osu_upc_all_reduce - UPC Reduce Latency Test
osu_upc_all_scatter - UPC Scatter Latency Test

UPC++ Benchmarks

Point-to-Point UPC++ Benchmarks

osu_upcxx_async_copy_put - Latency Test for UPC++ Put
osu_upcxx_async_copy_get - Latency Test for UPC++ Get

Collective UPC++ Benchmarks

Collective Latency Tests

osu_upcxx_bcast - UPC++ Broadcast Latency Test
osu_upcxx_reduce - UPC++ Reduce Latency Test
osu_upcxx_allgather - UPC++ Allgather Latency Test
osu_upcxx_gather - UPC++ Gather Latency Test
osu_upcxx_scatter - UPC++ Scatter Latency Test
osu_upcxx_alltoall - UPC++ AlltoAll (exchange) Latency Test

NCCL Benchmarks

Point-to-Point NCCL Benchmarks

osu_nccl_latency - Latency Test
osu_nccl_bw - Bandwidth Test
osu_nccl_bibw - Bidirectional Bandwidth Test

Collective NCCL Benchmarks

The latest OMB version includes benchmarks for various NCCL collective operations (NCCL Allgather, NCCL Allreduce, NCCL Bcast, NCCL Reduce, NCCL Reduce_Scatter, NCCL Alltoall). These benchmarks work in the following manner. Suppose users run the osu_nccl_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the NCCL Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

osu_nccl_allgather - NCCL Allgather Latency Test
osu_nccl_allreduce - NCCL Allreduce Latency Test
osu_nccl_bcast - NCCL Bcast Latency Test
osu_nccl_reduce - NCCL Reduce Latency Test
osu_nccl_reduce_scatter - NCCL Reduce_scatter Latency Test
osu_nccl_alltoall - NCCL Alltoall Latency Test

Java Benchmarks

Point-to-Point Java Bindings Benchmarks

The following are the point-to-point benchmarks for Java MPI libraries such as MVAPICH2-J and the Open MPI Java bindings. There are separate custom bandwidth and bi-bandwidth benchmarks for Open MPI because the API does not support communicating Java arrays using non-blocking point-to-point primitives.

OSULatency - Latency Test
OSUBandwidth - Bandwidth Test
- OSUBandwidthOMPI (exclusively for the Open MPI Java bindings)
OSUBiBandwidth - Bidirectional Bandwidth Test

OSUBiBandwidthOMPI (exclusively for the Open MPI Java bindings)

OSUOMPIBandwidth - Bandwidth Test for Open MPI Java Bindings
OSUOMPIBiBandwidth - Bidirectional Bandwidth Test for Open MPI Java Bindings

Collective Java Bindings Benchmarks

The following are the collective benchmarks for Java MPI libraries such as MVAPICH2-J and the Open MPI Java bindings.

OSUAllgather - MPI_Allgather Latency Test
OSUAllgatherv - MPI_Allgatherv Latency Test
OSUAllReduce - MPI_Allreduce Latency Test
OSUAlltoall - MPI_Alltoall Latency Test
OSUAlltoallv - MPI_Alltoallv Latency Test
OSUBarrier - MPI_Barrier Latency Test
OSUBcast - MPI_Bcast Latency Test
OSUGather - MPI_Gather Latency Test
OSUGatherv - MPI_Gatherv Latency Test
OSUReduce - MPI_Reduce Latency Test
OSUReduceScatter - MPI_Reduce_scatter Latency Test
OSUScatter - MPI_Scatter Latency Test
OSUScatterv - MPI_Scatterv Latency Test

Python Benchmarks

The OMB Python extension offers a variety of point-to-point and collective benchmarks to evaluate communication performance of MPI-based parallel applications in Python. This extension utilizes the mpi4py library to provide Python bindings for the MPI standard. The extension supports a variety of Python buffers including NumPy, CuPy, Numba, and PyCUDA. In addition to the CPU tests, GPU benchmarks are supported by selecting the CUDA-aware buffers. Tests with serialized communicated objects are also supported by using the –pickle runtime flag. The –min and –max flags are used specify the upper and lower bounds for tested message size. The –iterations and –skip flags are used to set the number of testing and warmup iterations for each message size. To enable the Python extension, please configure OMB with the –enable-python option.

Point-to-Point Python Benchmarks

The following are the point-to-point benchmarks to evaluate performance of MPI communication in Python for both CPU and GPU using the mpi4py bindings.

osu_latency - Latency Test
osu_bw - Bandwidth Test
osu_bibw - Bidirectional Bandwidth Test
osu_multi_lat - Multi-pair Latency Test

Collective Python Benchmarks

The following are the collective benchmarks to evaluate performance of MPI communication in Python for both CPU and GPU using the mpi4py bindings.

osu_allgather - MPI_Allgather Latency Test
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
osu_alltoallv - MPI_Alltoallv Latency Test
osu_barrier - MPI_Barrier Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter - MPI_Scatter Latency Test
osu_scatterv - MPI_Scatterv Latency Test

CUDA

ROCM

MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot