AllReduce | Collectives | Performance | Network-Based Computing Laboratory

Shared memory Allreduce on the TACC Ranger (10/29/09)

  • Experimental Testbed: Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and each node has 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an independent memory controller connected directly to an L3 cache. TACC uses a 7-stage, full fat-tree topology with InfiniBand SDR interconnects and two large Sun InfiniBand Datacenter switches. The following plot compares the latency of the MPI_Bcast operation for small messages between the 1.2p1 and the 1.4 versions for small messages on 1K processes. In 1.2p1, we had the conventional binomial tree based algorithm. In 1.4, we use the 2-level knomial algorithm.