Shared memory Allreduce on the TACC Ranger (10/29/09)
-
Experimental Testbed:
Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and each
node has 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport
system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an
independent memory controller connected directly to an L3 cache. TACC uses a
7-stage, full fat-tree topology with InfiniBand SDR interconnects and two large
Sun InfiniBand Datacenter switches.
The following plot compares the latency of the MPI_Allreduce operation for
small messages between the default point-to-point based algorithms and the
shared-memory based algorithms on 4K processes.

