CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters C. Chu, K. Hamidouche, A. Venkatesh, A. Awan, D. Panda 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016.