MVAPICH/MVAPICH2 Project
Ohio State University



Benchmarks | Network-Based Computing Laboratory

Benchmarks

  • OSU Micro-Benchmarks 4.3 (03/24/14)
    • This version features new UPC collective benchmarks (osu_upc_all_broadcast, osu_upc_all_scatter, osu_upc_all_gather, osu_upc_all_gather_all, osu_upc_all_exchange, osu_upc_all_barrier, and osu_upc_all_reduce).
    • It includes several new (or updated) benchmarks to measure performance of MPI-3 RMA communication operations with options to select different window creation and synchronization functions in each benchmark.
    • Please see CHANGES for the full changelog.
    • You may also take a look at the README for more information.
    • The benchmarks are available under the BSD license.
  • This page contains descriptions of the following MPI, OpenSHMEM and UPC tests included in the OMB package:
    • Point-to-Point MPI Benchmarks: Latency, multi-threaded latency, multi-pair latency, multiple bandwidth / message rate test bandwidth, bidirectional bandwidth
    • Collective MPI Benchmarks: Collective latency tests for various MPI collective operations such as MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives.
    • One-sided MPI Benchmarks: one-sided put latency, one-sided put bandwidth, one-sided put bidirectional bandwidth, one-sided get latency, one-sided get bandwidth, one-sided accumulate latency, compare and swap latency, fetch and operate and get_accumulate latency for MVAPICH2 (MPI-2 and MPI-3).
    • Point-to-Point OpenSHMEM Benchmarks: put latency, get latency, message rate, atomics,
    • Collective OpenSHMEM Benchmarks: collect latency, broadcast latency, reduce latency, and barrier latency
    • Point-to-Point UPC Benchmarks: put latency, get latency
    • Collective UPC Benchmarks: broadcast latency, scatter latency, gather latency, all_gather latency, and exchange latency
  • CUDA and OpenACC Extensions to OMB
    • The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers on NVIDIA GPU devices.
      • osu_bibw - Bidirectional Bandwidth Test
      • osu_bw - Bandwidth Test
      • osu_latency - Latency Test
      • osu_allgather - MPI_Allgather Latency Test
      • osu_allgatherv - MPI_Allgatherv Latency Test
      • osu_allreduce - MPI_Allreduce Latency Test
      • osu_alltoall - MPI_Alltoall Latency Test
      • osu_alltoallv - MPI_Alltoallv Latency Test
      • osu_bcast - MPI_Bcast Latency Test
      • osu_gather - MPI_Gather Latency Test
      • osu_gatherv - MPI_Gatherv Latency Test
      • osu_reduce - MPI_Reduce Latency Test
      • osu_reduce_scater - MPI_Reduce_scatter Latency Test
      • osu_scatter - MPI_Scatter Latency Test
      • osu_scatterv - MPI_Scatterv Latency Test

Point-to-Point MPI Benchmarks

  • osu_latency - Latency Test
  • The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are used in the tests.
  • osu_latency_mt - Multi-threaded Latency Test
  • The multi-threaded latency test performs a ping-pong test with a single sender process and multiple threads on the receiving process. In this test the sending process sends a message of a given data size to the receiver and waits for a reply from the receiver process. The receiving process has a variable number of receiving threads (set by default to 2), where each thread calls MPI_Recv and upon receiving a message sends back a response of equal size. Many iterations are performed and the average one-way latency numbers are reported.
  • osu_bw - Bandwidth Test
  • The bandwidth tests were carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were used in the test.
  • osu_bibw - Bidirectional Bandwidth Test
  • The bidirectional bandwidth test is similar to the bandwidth test, except that both the nodes involved send out a fixed number of back-to-back messages and wait for the reply. This test measures the maximum sustainable aggregate bandwidth by two nodes.
  • osu_mbw_mr - Multiple Bandwidth / Message Rate Test
  • The multi-pair bandwidth and message rate test evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.
  • osu_multi_lat - Multi-pair Latency Test (requires threading support from MPI-2 and MPI-3)
  • This test is very similar to the latency test. However, at the same instant multiple pairs are performing the same test simultaneously. In order to perform the test across just two nodes the hostnames must be specified in block fashion.

Collective MPI Benchmarks

  • osu_allgather - MPI_Allgather Latency Test
  • osu_allgatherv - MPI_Allgatherv Latency Test
  • osu_allreduce - MPI_Allreduce Latency Test
  • osu_alltoall - MPI_Alltoall Latency Test
  • osu_alltoallv - MPI_Alltoallv Latency Test
  • osu_barrier - MPI_Barrier Latency Test
  • osu_bcast - MPI_Bcast Latency Test
  • osu_gather - MPI_Gather Latency Test
  • osu_gatherv - MPI_Gatherv Latency Test
  • osu_reduce - MPI_Reduce Latency Test
  • osu_reduce_scater - MPI_Reduce_scatter Latency Test
  • osu_scatter - MPI_Scatter Latency Test
  • osu_scatterv - MPI_Scatterv Latency Test
  • Collective Latency Tests
  • The latest OMB Version includes benchmarks for various MPI-1, MPI-2 and MPI-3 collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

One-sided MPI Benchmarks

  • osu_put_latency - Latency Test for Put with Active/Passive Synchronization
  • The put latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Put to directly place data of a certain size in the remote process's window and then waiting on a synchronization call (MPI_Win_complete) for completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. Several iterations of this test is carried out and the average put latency numbers is reported. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Put to directly place data of certain size in the window. Then it calls MPI_Win_unlock to ensure completion of the Put and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_get_latency - Latency Test for Get with Active/Passive Synchronization
  • The get latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Get to directly fetch data of a certain size from the target process's window into a local buffer. It then waits on a synchronization call (MPI_Win_complete) for local completion of the Gets. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. Several iterations of this test is carried out and the average get latency numbers is reported. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Get to directly read data of certain size from the window. Then it calls MPI_Win_unlock to ensure completion of the Get and releases lock on remote window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Get + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate " use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization
  • The put bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the test is carried out by the origin process calling a fixed number of back-to-back MPI_Puts on remote window and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Puts to directly place data in the window. Then it calls MPI_Win_unlock to ensure completion of the Puts and release lock on remote window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization
  • The get bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the test is carried out by origin process calling a fixed number of back-to-back MPI_Gets and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes received by the origin process. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Gets to directly get data from the window. Then it calls MPI_Win_unlock to ensure completion of the Gets and release lock on the window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes read by the origin process. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local use MPI_Win_flush_local synchronization call. "-s lock_all use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw use Post/Start/Complete/Wait synchronization calls. "-s fence use MPI_Win_fence synchronization.
  • osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization
  • The put bi-directional bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). This test is similar to the bandwidth test, except that both the processes involved send out a fixed number of back-to-back MPI_Puts and wait for their completion. This test measures the maximum sustainable aggregate bandwidth by two processes. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization
  • The accumulate latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Accumulate to combine data from the local buffer with the data in the remote window and store it in the remote window. The combining operation used in the test is MPI_SUM. The origin process then waits on a synchronization call (MPI_Win_complete) for completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average accumulate latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Accumulate to combine data from a local buffer with the data in the remote window and store it in the remote window. Then it calls MPI_Win_unlock to ensure completion of the Accumulate and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Accumulate + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization
  • The Compare_and_swap latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,the origin process calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. The origin process then waits on a synchronization call (MPI_Win_complete) for local completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average Compare_and_swap latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. Then it calls MPI_Win_flush to ensure completion of the Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on the window. This is carried out for several iterations and the average time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization
  • The Fetch_and_op latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Fetch_and_op to increase the element in target buffer by 1. The initial value from the target buffer is returned to the calling process. The origin process waits on a synchronization call (MPI_Win_complete) for completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average Fetch_and_op latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. Then it calls MPI_Win_flush to ensure completion of the Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on the window. This is carried out for several iterations and the average time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization
  • The Get_accumulate latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Get_accumulate to combine data from the local buffer with the data in the remote window and store it in the remote window. The combining operation used in the test is MPI_SUM. The initial value from the target buffer is returned to the calling process. The origin process waits on a synchronization call (MPI_Win_complete) for local completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average get accumulate latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Get_accumulate to combine data from a local buffer with the data in the remote window and store it in the remote window. The initial value from the target buffer is returned to the calling process. Then it calls MPI_Win_unlock to ensure completion of the Get_accumulate and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.

CUDA and OpenACC Extensions to OSU Micro Benchmarks

  • The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers on NVIDIA GPU devices.
    • osu_bibw - Bidirectional Bandwidth Test
    • osu_bw - Bandwidth Test
    • osu_latency - Latency Test
    • osu_allgather - MPI_Allgather Latency Test
    • osu_allgatherv - MPI_Allgatherv Latency Test
    • osu_allreduce - MPI_Allreduce Latency Test
    • osu_alltoall - MPI_Alltoall Latency Test
    • osu_alltoallv - MPI_Alltoallv Latency Test
    • osu_bcast - MPI_Bcast Latency Test
    • osu_gather - MPI_Gather Latency Test
    • osu_gatherv - MPI_Gatherv Latency Test
    • osu_reduce - MPI_Reduce Latency Test
    • osu_reduce_scater - MPI_Reduce_scatter Latency Test
    • osu_scatter - MPI_Scatter Latency Test
    • osu_scatterv - MPI_Scatterv Latency Test
  • The CUDA extensions are enabled when the benchmark suite is configured with --enable-cuda option. The OpenACC extensions are enabled when --enable-openacc is specified. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.
  • Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.
  • The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.

Point-to-Point OpenSHMEM Benchmarks

  • osu_oshm_put - Latency Test for OpenSHMEM Put Routine
  • This benchmark measures latency of a shmem putmem operation for different data sizes. The user is required to select whether the communication buffers should be allocated in global memory or heap memory, through a parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to write data at PE 1 and then calls shmem quiet. This is repeated for a fixed number of iterations, depending on the data size. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. Both PEs call shmem barrier all after the test for each message size.
  • osu_oshm_get - Latency Test for OpenSHMEM Get Routine
  • This benchmark is similar to the one above except that PE 0 does a shmem getmem operation to read data from PE 1 in each iteration. The average latency per iteration is reported.
  • osu_oshm_put_mr - Message Rate Test for OpenSHMEM Put Routine
  • This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Put between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem putmem operations to its peer PE. The total time for the put operations is measured and operation rate per second is reported. All PEs call shmem barrier all after the test for each message size.
  • osu_oshm_atomics - Latency and Operation Rate Test for OpenSHMEM Atomics Routines
  • This benchmark measures the performance of atomic fetch-and-operate and atomic operate routines sup- ported in OpenSHMEM for the integer datatype. The buffers can be selected to be in heap memory or global memory. The PEs are paired like in the case of Put Operation Rate benchmark and the first PE in each pair issues back-to-back atomic operations of a type to its peer PE. The average latency per atomic operation and the aggregate operation rate are reported. This is repeated for each of fadd, finc, add, inc, cswap and swap routines.

Collective OpenSHMEM Benchmarks

  • osu_oshm_collect - OpenSHMEM Collect Latency Test
  • osu_oshm_fcollect - OpenSHMEM FCollect Latency Test
  • osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test
  • osu_oshm_reduce - OpenSHMEM Reduce Latency Test
  • osu_oshm_barrier - OpenSHMEM Barrier Latency Test
  • Collective Latency Tests
  • The latest OMB Version includes benchmarks for various OpenSHMEM collective operations (shmem_collect, shmem_fcollect, shmem_broadcast, shmem_reduce and shmem_barrier). These benchmarks work in the following manner. Suppose users run the osu_oshm_broadcast benchmark with N processes, the benchmark measures the min, max and the average latency of the shmem_broadcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length.

Point-to-Point Unified Parallel C (UPC) Benchmarks

  • osu_upc_memput - Latency Test for UPC Put Routine
  • This benchmark measures the latency of UPC put operation between multiple UPC threads. In this benchmark, UPC threads with ranks less than (THREADS/2) issue UPC memput operations to peer UPC threads. Peer threads are identified as (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations, for varying data sizes. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. All UPC threads call UPC barrier after the test for each message size.
  • osu_upc_memget - Latency Test for UPC Get Routine
  • This benchmark is similar as the UPC put benchmark that is described above. The difference is that the shared string handling function is upc_memget. The average get operation latency per iteration is reported.

Collective Unified Parallel C (UPC) Benchmarks

upc_all_broadcast, upc_all_scatter, upc_all_gather, upc_all_gather_all, and upc_all_exchange
  • osu_upc_all_broadcast - UPC Broadcast Latency Test
  • osu_upc_all_scatter - UPC Scatter Latency Test
  • osu_upc_all_gather - UPC Gather Latency Test
  • osu_upc_all_gather_all - UPC GatherAll Latency Test
  • osu_upc_all_exchange - UPC Exchange Latency Test
  • Collective Latency Tests
  • The latest OMB Version includes benchmarks for various UPC collective operations (upc_all_broadcast, upc_all_scatter, upc_all_gather, upc_all_gather_all, and upc_all_exchange). These benchmarks work in the following manner. Suppose users run the osu_upc_all_broadcast benchmark with N processes, the benchmark measures the min, max and the average latency of the upc_all_broadcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length.

Please note that there are many different ways to measure these performance parameters. For example, the bandwidth test can have different variations wrt the types of MPI calls (blocking vs. non-blocking) being used, total number of back-to-back messages sent in one iteration, number of iterations, etc. Other ways to measure bandwidth may give different numbers. Readers are welcome to use other tests, as appropriate to their application environments.