AllReduce | Collectives | Performance | Network-Based Computing Laboratory

Shared memory Allreduce on the TACC Ranger (10/29/09)

  • Experimental Testbed: Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and each node has 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an independent memory controller connected directly to an L3 cache. TACC uses a 7-stage, full fat-tree topology with InfiniBand SDR interconnects and two large Sun InfiniBand Datacenter switches. The following plot compares the latency of the MPI_Reduce operation for small messages between the default point-to-point based algorithms and the shared-memory based algorithms.