MVAPICH/MVAPICH2 Project
Ohio State University



Benchmarks | Network-Based Computing Laboratory

Benchmarks

  • OSU Micro-Benchmarks 3.6 (04/30/12)
    • This version includes the extension of osu_latency, osu_bw, and osu_bibw benchmarks to evaluate the performance of MPI_Send/MPI_Recv operation with NVIDIA GPU device and CUDA support
    • Please see CHANGES for the full changelog.
    • You may also take a look at the README for more information.
    • The benchmarks are available under the BSD license.
  • This page contains descriptions of the following MPI-level tests included in the OMB package:
    • Latency, bandwidth, bidirectional bandwidth, multiple bandwidth / message rate test, multi-pair latency and collective latency tests for MVAPICH (MPI-1)
    • Latency, multi-threaded latency, multi-pair latency, multiple bandwidth / message rate test bandwidth, bidirectional bandwidth, one-sided put latency (active/passive), one-sided put bandwidth (active/passive), one-sided put bidirectional bandwidth, one-sided get latency (active/passive), one-sided get bandwidth (active/passive), one-sided accumulate latency (active/passive) for MVAPICH2 (MPI-2).
    • Collective latency tests for various MPI collective operations such as MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives.
  • CUDA Extensions to OMB
    • The following benchmarks have been extended to evaluate performance of MPI_Send/MPI_Recv from and to buffers on NVIDIA GPU devices.
      • osu_bibw - Bidirectional Bandwidth Test
      • osu_bw - Bandwidth Test
      • osu_latency - Latency Test
    • These extensions are enabled when the benchmark suite is configured with --enable-cuda option, as shown above. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.
    • Each of these benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.

Latency Test

  • The latency tests were carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test were carried out and average one-way latency numbers were obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) were used in the tests.

Multi-threaded Latency Test (MPI-2 Benchmark [requires threading support])

  • The multi-threaded latency test performs a ping-pong test with a single sender process and multiple threads on the receiving process. In this test the sending process sends a message of a given data size to the receiver and waits for a reply from the receiver process. The receiving process has a variable number of receiving threads (set by default to 2), where each thread calls MPI_Recv and upon receiving a message sends back a response of equal size. Many iterations are performed and the average one-way latency numbers are reported.

Bandwidth Test

  • The bandwidth tests were carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were used in the test.

Bidirectional Bandwidth Test

  • The bidirectional bandwidth test is similar to the bandwidth test, except that both the nodes involved send out a fixed number of back-to-back messages and wait for the reply. This test measures the maximum sustainable aggregate bandwidth by two nodes.

Multiple Bandwidth / Message Rate test

  • The multi-pair bandwidth and message rate test evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.

Multi-pair Latency Test

  • This test is very similar to the latency test. However, at the same instant multiple pairs are performing the same test simultaneously. In order to perform the test across just two nodes the hostnames must be specified in block fashion.

Latency Test for Put with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. The origin process calls MPI_Put to directly place data of a certain size in the remote process's window. It waits on the MPI_Win_complete call to ensure the local completion of the message. The target process calls MPI_Win_wait to make sure the message has been received in its window. Then the origin and target are interchanged and the communication happens in the opposite direction. Several iterations of this test is carried out and the average put latency numbers is obtained. The latency includes the synchronization time. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Latency Test for Get with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. The origin process calls MPI_Get to directly fetch data of a certain size from the target process's window into a local buffer. It then waits on a synchronization call (MPI_Win_complete) for local completion of the Gets. The remote process waits on a MPI_Win_wait call. After the synchronization calls, the target and origin process are switched for a message in the opposite direction. Several iterations of this test are carried out and the average get latency numbers is obtained. The latency includes the synchronization time. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Bandwidth Test for Put with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. The test is carried out by the origin process calling a fixed number of back-to-back MPI_Puts on remote window and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Bandwidth Test for Get with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. The test is carried out by origin process calling a fixed number of back-to-back MPI_Gets and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes received by the origin process. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Bidirectional Bandwidth Test for Put with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. This test is similar to the bandwidth test, except that both the processes involved send out a fixed number of back-to-back MPI_Puts and wait for their completion. This test measures the maximum sustainable aggregate bandwidth by two processes. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Latency Test for Accumulate with Active Synchronization (MPI-2 Benchmark)

  • Post-Wait/Start-Complete synchronization is used in this test. The origin process calls MPI_Accumulate to combine data from the local buffer with the data in the remote window and store it in the remote window. The combining operation used in the test is MPI_SUM. The origin process then waits on a synchronization call (MPI_Win_complete) for local completion of the operations. The remote process waits on a MPI_Win_wait call. After the synchronization call, the target and origin process are switched for the pong message. Several iterations of this test are carried out and the average accumulate latency number is obtained. The latency includes the synchronization time. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This hint is specific to MVAPICH2 and is ignored by other MPI libraries. MVAPICH2 takes advantage of this to optimize intra-node one-sided communication. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Latency Test for Put with Passive Synchronization (MPI-2 Benchmark)

  • The origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Put to directly place data of certain size in the window. Then it calls MPI_Win_unlock to ensure completion of the Put and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This optimization is specific to MVAPICH2 and the hint is ignored by other MPI libraries. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Latency Test for Get with Passive Synchronization (MPI-2 Benchmark)

  • The origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Get to directly read data of certain size from the window. Then it calls MPI_Win_unlock to ensure completion of the Get and releases lock on remote window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Get + MPI_Unlock calls is measured. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This optimization is specific to MVAPICH2 and the hint is ignored by other MPI libraries. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Bandwidth Test for Put with Passive Synchronization (MPI-2 Benchmark)

  • The origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Puts to directly place data in the window. Then it calls MPI_Win_unlock to ensure completion of the Puts and release lock on remote window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This optimization is specific to MVAPICH2 and the hint is ignored by other MPI libraries. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Bandwidth Test for Get with Passive Synchronization (MPI-2 Benchmark)

  • The origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Gets to directly get data from the window. Then it calls MPI_Win_unlock to ensure completion of the Gets and release lock on the window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes read by the origin process. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This optimization is specific to MVAPICH2 and the hint is ignored by other MPI libraries. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

Latency Test for Accumulate with Passive Synchronization (MPI-2 Benchmark)

  • The origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Accumulate to combine data from a local buffer with the data in the remote window and store it in the remote window. Then it calls MPI_Win_unlock to ensure completion of the Accumulate and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Accumulate + MPI_Unlock calls is measured. By default, the window memory is allocated in shared memory by providing 'alloc_shm' hint to MPI_Alloc_mem call. This optimization is specific to MVAPICH2 and the hint is ignored by other MPI libraries. It does not have an impact on internode performance. This optimization can be disabled by providing '-n' or '-no-hints' option to the benchmark.

CUDA Extensions to OSU Micro Benchmarks

  • Latency, Bandwidth and Bi-directional Bandwidth tests have been extended to evaluate performance of MPI_Send/MPI_Recv from and to buffers on NVIDIA GPU devices. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time. Each of these benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.

Please note that there are many different ways to measure these performance parameters. For example, the bandwidth test can have different variations wrt the types of MPI calls (blocking vs. non-blocking) being used, total number of back-to-back messages sent in one iteration, number of iterations, etc. Other ways to measure bandwidth may give different numbers. Readers are welcome to use other tests, as appropriate to their application environments.

Collective Latency Tests

  • The latest OMB Version includes benchmarks for various MPI collective operations (MPI_Allgather(*), MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather(*), MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter(*) and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options:
    • "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations.
    • "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths.
    • "-i" can be used to set the number of iterations to run for each message length.
    • "-h" option displays the list of options and their descriptions.
    • "-v" reports the benchmark version.
    (* OMB 3.6 version uses the MPI-2 MPI_IN_PLACE feature for three +collective benchmarks - osu_allgather, osu_gather and osu_scatter)