Intra-node intra-IOH GPU-GPU performance numbers of MVAPICH2 on Sandy Bridge Architecture (05/06/13)
- Experimental Testbed: Each node of our testbed has 16 cores (dual oct-core, 2.60 GHz) and 32 GB of main memory. The CPUs are based on Sandy Bridge architecture and run in 64 bit mode. The nodes support 16x PCI Express Gen3 interfaces and each node is equipped with a Mellanox ConnectX-3 FDR HCA and two NVIDIA Tesla K20c GPUs. They have CUDA Toolkit 5.0 and CUDA Driver 310.44 installed. The nodes are connected using a Mellanox FDR InfiniBand switch. The operating system used is RedHat Enterprise Linux Server release 6.3 (Santiago).
- The results reported are for MPI communication between device memory on two GPUs with ECC enabled. MVAPICH2 currently delivers one-way latencie of 25.4 microseconds for 4 bytes. It achieves a unidirectional bandwidth upto 4531.74 Million Bytes/sec and a bidirectional bandwidth upto 6372.73 Million Bytes/sec. (1 Mega Byte = 1,048,576 Bytes; 1 Million Byte = 1,000,000 Bytes)
OSU Micro Benchmarks have been modified to measure performance of MPI communication from NVIDIA GPU devices and are available for download at OMB v4.0.1.
- The two GPU devices on this node were connected to the same I/O Hub. The two processes were mapped onto cores 1 and 2 (intra-socket). Process 0 used GPU Device 0 and Process 1 used GPU Device 1.