Performance of Hardware Multicast-aware Broadcast Collective on TACC Stampede (05/06/13)
- Experimental Testbed: Each node of our testbed has sixteen cores (2.70 GHz dual octa-core) and 32 GB main memory. The CPUs based on Sandy-Bridge architecture and run in 64 bit mode. The nodes support 16x PCI Express Gen3 interfaces and are equipped with Mellanox ConnectX-3 FDR HCAs with PCI Express Gen3 interfaces. The operating system used was CentOS release 6.3 (Final). The processes are bound to the cores using the default ``block'' mapping.
- The following graphs demonstrate the improvement obtained in the latency of the MPI_Bcast operation for by using InfiniBand hardware UD-Multicast

