- Experimental Testbed: Each node of our testbed has 16 cores (dual oct-core, 2.60 GHz) and 32 GB of main memory. The CPUs are based on Sandy Bridge architecture and run in 64 bit mode. The nodes support 16x PCI Express Gen3 interfaces and each node is equipped with a Mellanox ConnectX-3 FDR HCA and two NVIDIA Tesla K20c GPUs. They have CUDA Toolkit 5.0 and CUDA Driver 310.44 installed. The nodes are connected using a Mellanox FDR InfiniBand switch. The operating system used is RedHat Enterprise Linux Server release 6.3 (Santiago).
- The results reported are for MPI communication between device memory on two NVIDIA GPUs with ECC enabled. MVAPICH2 currently delivers one-way latency of 25.14 microseconds for 4 bytes. It achieves a unidirectional bandwidth upto 4895.78 Million Bytes/sec and a bidirectional bandwidth upto 9858.04 Million Bytes/sec. (1 Mega Byte = 1,048,576 Bytes; 1 Million Byte = 1,000,000 Bytes)
OSU Micro Benchmarks have been modified to measure performance of MPI communication from NVIDIA GPU devices and are available for download at OMB v4.0.1.
- The processes are bound to core 1 on both nodes.