Abstract
The parallelization process for a sequential applications involves handling of concurrent shared memory object updates. One important type of parallelism is exploited when the order of the memory updates to the same location does not change the output of the program. This type of parallelism is reduction type parallelism. It typically exists in many important applications such as data mining, numerical analysis and scientific simulation. The implementation of these applications for multi-core architectures is typically accomplished by using thread(s) private data objects to hold partial results and applying a sequential final stage to aggregate the partial results. Porting this type of applications to massively parallel GPU processors faces new challenges. One major challenge is work partitioning, the target of which is minimal communication between individual threads that run in parallel and less work in the sequential reduction stage to aggregate all partial results. However, when the number of threads explodes to thousands of or millions of, the workload partitioning becomes much more complicated than that of less than ten threads. This may ultimately lead to load unbalancing or extra control code for handling boundary cases in irregular applications. Extra control code, may on one hand lead to increased coding complexity, and on the other hand, runtime thread divergence and thus serious performance degradation. In this paper, we propose a novel approach to handle concurrent shared memory objects on GPUs. This approach not only yields good performance, but also good programmability. This approach uses atomic operations extensively. Atomic operations for shared memory updates are known to be expensive and may cause serialization for parallel threads. However, we discovered that, with appropriate atomic collision elimination techniques, we can achieve similar performance or even better performance than the traditional non-atomics involved implementation. We implemented these techniques as a library of functions with simple interface. The programmers can call these procedures to perform shared memory object updates without worrying about the order of these operations or workload balancing, while achieving significant performance gains brought by the massive parallelism in GPUs.