Please note that there are many different ways to measure these performance parameters. For example, the bandwidth test can have different variations regarding the types of MPI calls (blocking vs. non-blocking) being used, total number of back-to-back messages sent in one iteration, number of iterations, etc. Other ways to measure bandwidth may give different numbers. Readers are welcome to use other tests, as appropriate to their application environments.

All C Benchmarks have the ability to evaluate the correctness of the data exchanged through in-built data validation schemes in addition to evaluating the communication performance.
  • osu_latency - Latency Test
  • The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are used in the tests.
  • osu_latency_mt - Multi-threaded Latency Test
  • The multi-threaded latency test performs a ping-pong test with a single sender process and multiple threads on the receiving process. In this test the sending process sends a message of a given data size to the receiver and waits for a reply from the receiver process. The receiving process has a variable number of receiving threads (set by default to 2), where each thread calls MPI_Recv and upon receiving a message sends back a response of equal size. Many iterations are performed and the average one-way latency numbers are reported. Users can modify the number of communicating threads being used by using the "-t" runtime option. Examples: -t 4 // receiver threads = 4 and sender threads = 1 -t 4:6 // sender threads = 4 and receiver threads = 6 -t 2: // not defined
  • osu_latency_mp - Multi-process Latency Test
  • The multi-process latency test performs a ping-pong test with a single sender process and a single receiver process, both having one or more child processes that are spawned using the fork() system call. In this test the sending process(parent) sends a message of a given data size to the receiver(parent) process and waits for a reply from the receiver process. Both the sending and receiving process have a variable number of child processes (set by default to 1 child process), where each child process sleeps for 2 seconds after the fork call and exits. The parent processes carry out the ping-pong test where many iterations are performed and the average one-way latency numbers are reported. This test is available here. "-t" option can be used to set the number of sender and receiver processes including the parent processes to be used in a benchmark. Examples: -t 4 // receiver processes = 4 and sender processes = 1 -t 4:6 // sender processes = 4 and receiver processes = 6 -t 2: // not defined The purpose of this test is to check if the underlying MPI communication runtime has taken care of fork safety even if the application has not. A new environment variable "MV2_SUPPORT_FORK_SAFETY" was introduced with MVAPICH2 2.3.4 to make MVAPICH2 takes care of fork safety for applications that require it. The support for fork safety is disabled by default in MVAPICH2 due to performance reasons. When running osu_latency_mp with MVAPICH2, set the environment variable MV2_SUPPORT_FORK_SAFETY to 1. When running osu_latency_mp with other MPI libraries that do not support fork safety, set the environment variables RDMAV_FORK_SAFE or IBV_FORK_SAFE to 1.
  • osu_bw - Bandwidth Test
  • The bandwidth tests are carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) are used in the test.
  • osu_bibw - Bidirectional Bandwidth Test
  • The bidirectional bandwidth test is similar to the bandwidth test, except that both the nodes involved send out a fixed number of back-to-back messages and wait for the reply. This test measures the maximum sustainable aggregate bandwidth by two nodes.
  • osu_mbw_mr - Multiple Bandwidth / Message Rate Test
  • The multi-pair bandwidth and message rate test evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.
  • osu_multi_lat - Multi-pair Latency Test
  • This test is very similar to the latency test. However, at the same instant multiple pairs are performing the same test simultaneously. In order to perform the test across just two nodes the hostnames must be specified in block fashion.
The latest OMB version includes benchmarks for various MPI blocking collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

  • osu_allgather - MPI_Allgather Latency Test
  • osu_allgatherv - MPI_Allgatherv Latency Test
  • osu_allreduce - MPI_Allreduce Latency Test
  • osu_alltoall - MPI_Alltoall Latency Test
  • osu_alltoallv - MPI_Alltoallv Latency Test
  • osu_barrier - MPI_Barrier Latency Test
  • osu_bcast - MPI_Bcast Latency Test
  • osu_gather - MPI_Gather Latency Test
  • osu_gatherv - MPI_Gatherv Latency Test
  • osu_reduce - MPI_Reduce Latency Test
  • osu_reduce_scatter - MPI_Reduce_scatter Latency Test
  • osu_scatter - MPI_Scatter Latency Test
  • osu_scatterv - MPI_Scatterv Latency Test
    In addition to the blocking collective latency tests mentioned above, we provide several non-blocking collectives (NBC): MPI_Iallgather, MPI_Iallgatherv, MPI_Iallreduce, MPI_Ialltoall, MPI_Ialltoallv, MPI_Ialltoallw, MPI_Ibarrier, MPI_Ibcast, MPI_Igather, MPI_Igatherv, MPI_Ireduce, MPI_Iscatter, and MPI_Iscatterv. These evaluate the same metrics as the blocking operations as well as the additional metric `overlap'. This is defined as the amount of computation that can be performed while the communication progresses in the background. These benchmarks have the additional options: "-t" set the number of MPI_Test() calls during the dummy computation, set CALLS to 100, 1000, or any number > 0. "-r" set the target for dummy computation that imitates the effect of useful computation that can be overlapped with the communication, as we provide CUDA-Aware support for NBC as well, this option can be set to CPU, GPU, or BOTH.
  • osu_iallgather - MPI_Iallgather Latency Test
  • osu_iallgatherv - MPI_Iallgatherv Latency Test
  • osu_iallreduce - MPI_Iallreduce Latency Test
  • osu_ialltoall - MPI_Ialltoall Latency Test
  • osu_ialltoallv - MPI_Ialltoallv Latency Test
  • osu_ialltoallw - MPI_Ialltoallw Latency Test
  • osu_ibarrier - MPI_Ibarrier Latency Test
  • osu_ibcast - MPI_Ibcast Latency Test
  • osu_igather - MPI_Igather Latency Test
  • osu_igatherv - MPI_Igatherv Latency Test
  • osu_ireduce - MPI_Ireduce Latency Test
  • osu_iscatter - MPI_Iscatter Latency Test
  • osu_iscatterv - MPI_Iscatterv Latency Test
  • osu_put_latency - Latency Test for Put with Active/Passive Synchronization
  • The put latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Put to directly place data of a certain size in the remote process's window and then waiting on a synchronization call (MPI_Win_complete) for completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. Several iterations of this test is carried out and the average put latency numbers is reported. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Put to directly place data of certain size in the window. Then it calls MPI_Win_unlock to ensure completion of the Put and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call. "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length.
  • osu_get_latency - Latency Test for Get with Active/Passive Synchronization
  • The get latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Get to directly fetch data of a certain size from the target process's window into a local buffer. It then waits on a synchronization call (MPI_Win_complete) for local completion of the Gets. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. Several iterations of this test is carried out and the average get latency numbers is reported. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Get to directly read data of certain size from the window. Then it calls MPI_Win_unlock to ensure completion of the Get and releases lock on remote window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Get + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate " use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization
  • The put bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the test is carried out by the origin process calling a fixed number of back-to-back MPI_Puts on remote window and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Puts to directly place data in the window. Then it calls MPI_Win_unlock to ensure completion of the Puts and release lock on remote window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes put by the origin process. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization
  • The get bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the test is carried out by origin process calling a fixed number of back-to-back MPI_Gets and then waiting on a synchronization call (MPI_Win_complete) for their completion. The remote process participates in synchronization with MPI_Win_post and MPI_Win_wait calls. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes received by the origin process. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls a fixed number of back-to-back MPI_Gets to directly get data from the window. Then it calls MPI_Win_unlock to ensure completion of the Gets and release lock on the window. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time and the number of bytes read by the origin process. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local use MPI_Win_flush_local synchronization call. "-s lock_all use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw use Post/Start/Complete/Wait synchronization calls. "-s fence use MPI_Win_fence synchronization.
  • osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization
  • The put bi-directional bandwidth benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). This test is similar to the bandwidth test, except that both the processes involved send out a fixed number of back-to-back MPI_Puts and wait for their completion. This test measures the maximum sustainable aggregate bandwidth by two processes. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization
  • The accumulate latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Accumulate to combine data from the local buffer with the data in the remote window and store it in the remote window. The combining operation used in the test is MPI_SUM. The origin process then waits on a synchronization call (MPI_Win_complete) for completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average accumulate latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Accumulate to combine data from a local buffer with the data in the remote window and store it in the remote window. Then it calls MPI_Win_unlock to ensure completion of the Accumulate and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Accumulate + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization
  • The Compare_and_swap latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,the origin process calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. The origin process then waits on a synchronization call (MPI_Win_complete) for local completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average Compare_and_swap latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. Then it calls MPI_Win_flush to ensure completion of the Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on the window. This is carried out for several iterations and the average time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization
  • The Fetch_and_op latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Fetch_and_op to increase the element in target buffer by 1. The initial value from the target buffer is returned to the calling process. The origin process waits on a synchronization call (MPI_Win_complete) for completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average Fetch_and_op latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Compare_and_swap to place one element from origin buffer to target buffer. The initial value in the target buffer is returned to the calling process. Then it calls MPI_Win_flush to ensure completion of the Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on the window. This is carried out for several iterations and the average time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization
  • The Get_accumulate latency benchmark includes window initialization operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, the origin process calls MPI_Get_accumulate to combine data from the local buffer with the data in the remote window and store it in the remote window. The combining operation used in the test is MPI_SUM. The initial value from the target buffer is returned to the calling process. The origin process waits on a synchronization call (MPI_Win_complete) for local completion of the operations. The remote process waits on a MPI_Win_wait call. Several iterations of this test are carried out and the average get accumulate latency number is obtained. The latency includes the synchronization time also. For passive synchronization, suppose users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the target process's window and calls MPI_Get_accumulate to combine data from a local buffer with the data in the remote window and store it in the remote window. The initial value from the target buffer is returned to the calling process. Then it calls MPI_Win_unlock to ensure completion of the Get_accumulate and release lock on the window. This is carried out for several iterations and the average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is measured. The default window initialization and synchronization operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following options: "-w create" use MPI_Win_create to create an MPI Window object. "-w allocate" use MPI_Win_allocate to create an MPI Window object. "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window object. "-s lock" use MPI_Win_lock/unlock synchronizations calls. "-s flush" use MPI_Win_flush synchronization call. "-s flush_local" use MPI_Win_flush_local synchronization call. "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. "-s pscw" use Post/Start/Complete/Wait synchronization calls. "-s fence" use MPI_Win_fence synchronization call.
  • osu_init - This benchmark measures the minimum, maximum, and average time each process takes to complete MPI_Init.
  • osu_hello - This benchmark measures the time it takes for all processes to execute MPI_Init + MPI_Finalize.
  • The CUDA extensions are enabled when the benchmark suite is configured with --enable-cuda option. The OpenACC extensions are enabled when --enable-openacc is specified. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time.
  • Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host.
  • The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.
  • The non-blocking collective benchmarks can also use -t for MPI_Test() calls and -r option for setting the target of dummy computation.
  • The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers on NVIDIA and AMD GPU devices.
    • osu_bibw - Bidirectional Bandwidth Test
    • osu_bw - Bandwidth Test
    • osu_latency - Latency Test
    • osu_mbw_mr - Multiple Bandwidth / Message Rate Test
    • osu_multi_lat - Multi-pair Latency Test
    • osu_put_latency - Latency Test for Put
    • osu_get_latency - Latency Test for Get
    • osu_put_bw - Bandwidth Test for Put
    • osu_get_bw - Bandwidth Test for Get
    • osu_put_bibw - Bidirectional Bandwidth Test for Put
    • osu_acc_latency - Latency Test for Accumulate
    • osu_cas_latency - Latency Test for Compare and Swap
    • osu_fop_latency - Latency Test for Fetch and Op
    • osu_allgather - MPI_Allgather Latency Test
    • osu_allgatherv - MPI_Allgatherv Latency Test
    • osu_allreduce - MPI_Allreduce Latency Test
    • osu_alltoall - MPI_Alltoall Latency Test
    • osu_alltoallv - MPI_Alltoallv Latency Test
    • osu_bcast - MPI_Bcast Latency Test
    • osu_gather - MPI_Gather Latency Test
    • osu_gatherv - MPI_Gatherv Latency Test
    • osu_reduce - MPI_Reduce Latency Test
    • osu_reduce_scatter - MPI_Reduce_scatter Latency Test
    • osu_scatter - MPI_Scatter Latency Test
    • osu_scatterv - MPI_Scatterv Latency Test
    • osu_iallgather - MPI_Iallgather Latency Test
    • osu_iallgatherv - MPI_Iallgatherv Latency Test
    • osu_iallreduce - MPI_Iallreduce Latency Test
    • osu_ialltoall - MPI_Ialltoall Latency Test
    • osu_ialltoallv - MPI_Ialltoallv Latency Test
    • osu_ialltoallw - MPI_Ialltoallw Latency Test
    • osu_ibcast - MPI_Ibcast Latency Test
    • osu_igather - MPI_Igather Latency Test
    • osu_igatherv - MPI_Igatherv Latency Test
    • osu_ireduce - MPI_Ireduce Latency Test
    • osu_iscatter - MPI_Iscatter Latency Test
    • osu_iscatterv - MPI_Iscatterv Latency Test
    In addition to support for communications to and from GPU memories allocated using CUDA or OpenACC, we now provide additional capability of performing communications to and from buffers allocated using the CUDA Managed Memory concept. CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA Managed Memory using the tests mentioned above. These benchmarks have additional options: "M" allocates a send or receive buffer as managed for point to point communication. "-d managed" uses managed memory buffers to perform collective communications.
    The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers allocated using CUDA Managed Memory.
    • osu_bibw - Bidirectional Bandwidth Test
    • osu_bw - Bandwidth Test
    • osu_latency - Latency Test
    • osu_mbw_mr - Multiple Bandwidth / Message Rate Test
    • osu_multi_lat - Multi-pair Latency Test
    • osu_allgather - MPI_Allgather Latency Test
    • osu_allgatherv - MPI_Allgatherv Latency Test
    • osu_allreduce - MPI_Allreduce Latency Test
    • osu_alltoall - MPI_Alltoall Latency Test
    • osu_alltoallv - MPI_Alltoallv Latency Test
    • osu_bcast - MPI_Bcast Latency Test
    • osu_gather - MPI_Gather Latency Test
    • osu_gatherv - MPI_Gatherv Latency Test
    • osu_reduce - MPI_Reduce Latency Test
    • osu_reduce_scatter - MPI_Reduce_scatter Latency Test
    • osu_scatter - MPI_Scatter Latency Test
    • osu_scatterv - MPI_Scatterv Latency Test
  • osu_oshm_put - Latency Test for OpenSHMEM Put Routine
  • This benchmark measures latency of a shmem putmem operation for different data sizes. The user is required to select whether the communication buffers should be allocated in global memory or heap memory, through a parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to write data at PE 1 and then calls shmem quiet. This is repeated for a fixed number of iterations, depending on the data size. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. Both PEs call shmem barrier all after the test for each message size.
  • osu_oshm_put_nb - Latency Test for OpenSHMEM Non-blocking Put Routine
  • This benchmark measures the non-blocking latency of a shmem putmem_nbi operation for different data sizes. The user is required to select whether the communication buffers should be allocated in global memory or heap memory, through a parameter. The test requires exactly two PEs. PE 0 issues shmem putmem_nbi to write data at PE 1 and then calls shmem quiet. This is repeated for a fixed number of iterations, depending on the data size. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. Both PEs call shmem barrier all after the test for each message size.
  • osu_oshm_get - Latency Test for OpenSHMEM Get Routine
  • This benchmark is similar to the one above except that PE 0 does a shmem getmem operation to read data from PE 1 in each iteration. The average latency per iteration is reported.
  • osu_oshm_get_nb - Latency Test for OpenSHMEM Non-blocking Get Routine
  • This benchmark is similar to the one above except that PE 0 does a shmem getmem_nbi operation to read data from PE 1 in each iteration. The average latency per iteration is reported.
  • osu_oshm_put_mr - Message Rate Test for OpenSHMEM Put Routine
  • This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Put between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem putmem operations to its peer PE. The total time for the put operations is measured and operation rate per second is reported. All PEs call shmem barrier all after the test for each message size.
  • osu_oshm_put_mr_nb - Message Rate Test for Non-blocking OpenSHMEM Put Routine
  • This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Non-blocking Put between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem putmem_nbi operations to its peer PE until the window size. A call to shmem_quite is placed after the window loop to ensure completion of the issued operations. The total time for the non-blocking put operations is measured and operation rate per second is reported. All PEs call shmem barrier all after the test for each message size.
  • osu_oshm_get_mr_nb - Message Rate Test for Non-blocking OpenSHMEM Get Routine
  • This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Non-blocking Get between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem getmem_nbi operations to its peer PE until the window size. A call to shmem_quite is placed after the window loop to ensure completion of the issued operations. The total time for the non-blocking put operations is measured and operation rate per second is reported. All PEs call shmem barrier all after the test for each message size.
  • osu_oshm_put_overlap - Non-blocking Message Rate Overlap Test This benchmark measures the aggregate uni-directional operations rate overlap for OpenSHMEM Put between paris of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs. The benchmarks prints statistics for different phases of communication, computation and overlap in the end.
  • osu_oshm_atomics - Latency and Operation Rate Test for OpenSHMEM Atomics Routines This benchmark measures the performance of atomic fetch-and-operate and atomic operate routines supported in OpenSHMEM for the integer and long datatypes. The buffers can be selected to be in heap memory or global memory. The PEs are paired like in the case of Put Operation Rate benchmark and the first PE in each pair issues back-to-back atomic operations of a type to its peer PE. The average latency per atomic operation and the aggregate operation rate are reported. This is repeated for each of fadd, finc, add, inc, cswap, swap, set, and fetch routines.
  • Collective Latency Tests
  • The latest OMB Version includes benchmarks for various OpenSHMEM collective operations (shmem_collect, shmem_fcollect, shmem_broadcast, shmem_reduce and shmem_barrier). These benchmarks work in the following manner. Suppose users run the osu_oshm_broadcast benchmark with N processes, the benchmark measures the min, max and the average latency of the shmem_broadcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length.
  • osu_oshm_collect - OpenSHMEM Collect Latency Test
  • osu_oshm_fcollect - OpenSHMEM FCollect Latency Test
  • osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test
  • osu_oshm_reduce - OpenSHMEM Reduce Latency Test
  • osu_oshm_barrier - OpenSHMEM Barrier Latency Test
  • osu_upc_memput - Latency Test for UPC Put Routine
  • This benchmark measures the latency of UPC put operation between multiple UPC threads. In this benchmark, UPC threads with ranks less than (THREADS/2) issue UPC memput operations to peer UPC threads. Peer threads are identified as (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations, for varying data sizes. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. All UPC threads call UPC barrier after the test for each message size.
  • osu_upc_memget - Latency Test for UPC Get Routine
  • This benchmark is similar as the UPC put benchmark that is described above. The difference is that the shared string handling function is upc_memget. The average get operation latency per iteration is reported.
  • Collective Latency Tests
  • The latest OMB Version includes benchmarks for various UPC collective operations (osu_upc_all_barrier, upc_all_broadcast, osu_upc_all_exchange, osu_upc_all_gather_all, osu_upc_all_gather, osu_upc_all_reduce, and osu_upc_all_scatter). These benchmarks work in the following manner. Suppose users run the osu_upc_all_broadcast benchmark with N processes, the benchmark measures the min, max and the average latency of the upc_all_broadcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length.
osu_upc_all_barrier, upc_all_broadcast, osu_upc_all_exchange, osu_upc_all_gather_all, osu_upc_all_gather, osu_upc_all_reduce, and osu_upc_all_scatter
  • osu_upc_all_barrier - UPC Barrier Latency Test
  • osu_upc_all_broadcast - UPC Broadcast Latency Test
  • osu_upc_all_exchange - UPC Exchange Latency Test
  • osu_upc_all_gather_all - UPC GatherAll Latency Test
  • osu_upc_all_gather - UPC Gather Latency Test
  • osu_upc_all_reduce - UPC Reduce Latency Test
  • osu_upc_all_scatter - UPC Scatter Latency Test
  • osu_upcxx_async_copy_put - Latency Test for UPC++ Put
  • This benchmark measures the latency of async_copy (memput) operation between multiple UPC++ threads. In this benchmark, UPC++ threads with ranks less than (ranks()/2) copy data from their local memory to their peer thread’s memory using async_copy operation. By changing the source and destination buffers in async_copy, we can mimic the behavior of upc_memput and upc_memget. Peer threads are identified as (myrank()+ranks()/2). This is repeated for a fixed number of iterations, for varying data sizes. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. All UPC++ threads call barrier() function after the test for each message size.
  • osu_upcxx_async_copy_get - Latency Test for UPC++ Get
  • Similar to osu_upcxx_async_copy_put, this benchmark mimics the behavior of upc_memget and measures the latency of async_copy (memget) operation between multiple UPC++ threads. The only difference is that the source and destination buffers in async_copy are swapped. In this benchmark, UPC++ threads with ranks less than (ranks()/2) copy data from their peer thread's memory to their local memory using async_copy operation. The rest of the details are same as discussed above. The average get operation latency per iteration is reported.
  • Collective Latency Tests
  • The latest OMB Version includes the following benchmarks for various UPC++ collective operations (upcxx_reduce, upcxx_bcast, upcxx_gather, upcxx_allgather, upcxx_alltoall, upcxx_scatter). These benchmarks work in the following manner. Suppose users run the osu_upcxx_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the upcxx_bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. "-i" can be used to set the number of iterations to run for each message length.
  • osu_upcxx_bcast - UPC++ Broadcast Latency Test
  • osu_upcxx_reduce - UPC++ Reduce Latency Test
  • osu_upcxx_allgather - UPC++ Allgather Latency Test
  • osu_upcxx_gather - UPC++ Gather Latency Test
  • osu_upcxx_scatter - UPC++ Scatter Latency Test
  • osu_upcxx_alltoall - UPC++ AlltoAll (exchange) Latency Test
  • osu_nccl_latency - Latency Test
  • The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Non-Blocking version of NCCL functions (ncclSend and ncclRecv) are used in the tests.
  • osu_nccl_bw - Bandwidth Test
  • The bandwidth tests are carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of NCCL functions (ncclSend and ncclRecv) are used in the test.
  • osu_nccl_bibw - Bidirectional Bandwidth Test
  • The bidirectional bandwidth test is similar to the bandwidth test, except that both the nodes involved send out a fixed number of back-to-back messages and wait for the reply. This test measures the maximum sustainable aggregate bandwidth by two nodes.
    The latest OMB version includes benchmarks for various NCCL collective operations (NCCL Allgather, NCCL Allreduce, NCCL Bcast, NCCL Reduce, NCCL Reduce_Scatter, NCCL Alltoall). These benchmarks work in the following manner. Suppose users run the osu_nccl_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the NCCL Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.
  • osu_nccl_allgather - NCCL Allgather Latency Test
  • osu_nccl_allreduce - NCCL Allreduce Latency Test
  • osu_nccl_bcast - NCCL Bcast Latency Test
  • osu_nccl_reduce - NCCL Reduce Latency Test
  • osu_nccl_reduce_scatter - NCCL Reduce_scatter Latency Test
  • osu_nccl_alltoall - NCCL Alltoall Latency Test
The following are the point-to-point benchmarks for Java MPI libraries such as MVAPICH2-J and the Open MPI Java bindings. There are separate custom bandwidth and bi-bandwidth benchmarks for Open MPI because the API does not support communicating Java arrays using non-blocking point-to-point primitives.
  • OSULatency - Latency Test
  • OSUBandwidth - Bandwidth Test
    • OSUBandwidthOMPI (exclusively for the Open MPI Java bindings)
  • OSUBiBandwidth - Bidirectional Bandwidth Test
    • OSUBiBandwidthOMPI (exclusively for the Open MPI Java bindings)
  • OSUOMPIBandwidth - Bandwidth Test for Open MPI Java Bindings
  • OSUOMPIBiBandwidth - Bidirectional Bandwidth Test for Open MPI Java Bindings
The following are the collective benchmarks for Java MPI libraries such as MVAPICH2-J and the Open MPI Java bindings.
  • OSUAllgather - MPI_Allgather Latency Test
  • OSUAllgatherv - MPI_Allgatherv Latency Test
  • OSUAllReduce - MPI_Allreduce Latency Test
  • OSUAlltoall - MPI_Alltoall Latency Test
  • OSUAlltoallv - MPI_Alltoallv Latency Test
  • OSUBarrier - MPI_Barrier Latency Test
  • OSUBcast - MPI_Bcast Latency Test
  • OSUGather - MPI_Gather Latency Test
  • OSUGatherv - MPI_Gatherv Latency Test
  • OSUReduce - MPI_Reduce Latency Test
  • OSUReduceScatter - MPI_Reduce_scatter Latency Test
  • OSUScatter - MPI_Scatter Latency Test
  • OSUScatterv - MPI_Scatterv Latency Test
The OMB Python extension offers a variety of point-to-point and collective benchmarks to evaluate communication performance of MPI-based parallel applications in Python. This extension utilizes the mpi4py library to provide Python bindings for the MPI standard. The extension supports a variety of Python buffers including NumPy, CuPy, Numba, and PyCUDA. In addition to the CPU tests, GPU benchmarks are supported by selecting the CUDA-aware buffers. Tests with serialized communicated objects are also supported by using the –pickle runtime flag. The –min and –max flags are used specify the upper and lower bounds for tested message size. The –iterations and –skip flags are used to set the number of testing and warmup iterations for each message size. To enable the Python extension, please configure OMB with the –enable-python option.
The following are the point-to-point benchmarks to evaluate performance of MPI communication in Python for both CPU and GPU using the mpi4py bindings.
  • osu_latency - Latency Test
  • osu_bw - Bandwidth Test
  • osu_bibw - Bidirectional Bandwidth Test
  • osu_multi_lat - Multi-pair Latency Test
The following are the collective benchmarks to evaluate performance of MPI communication in Python for both CPU and GPU using the mpi4py bindings.
  • osu_allgather - MPI_Allgather Latency Test
  • osu_allgatherv - MPI_Allgatherv Latency Test
  • osu_allreduce - MPI_Allreduce Latency Test
  • osu_alltoall - MPI_Alltoall Latency Test
  • osu_alltoallv - MPI_Alltoallv Latency Test
  • osu_barrier - MPI_Barrier Latency Test
  • osu_bcast - MPI_Bcast Latency Test
  • osu_gather - MPI_Gather Latency Test
  • osu_gatherv - MPI_Gatherv Latency Test
  • osu_reduce - MPI_Reduce Latency Test
  • osu_reduce_scatter - MPI_Reduce_scatter Latency Test
  • osu_scatter - MPI_Scatter Latency Test
  • osu_scatterv - MPI_Scatterv Latency Test