Using Massive Atomics Operations for Massively Parallel GPU Applications: Inevitable or Indispensable?

Ian Egielski; Chi Zhang; Eddy Z. Zhang

doi:10.7282/T3DJ5K42

Back

Using Massive Atomics Operations for Massively Parallel GPU Applications: Inevitable or Indispensable?

Technical documentation

Open access

Using Massive Atomics Operations for Massively Parallel GPU Applications: Inevitable or Indispensable?

Ian Egielski, Chi Zhang and Eddy Z. Zhang

Rutgers University

2013

DOI:

https://doi.org/10.7282/T3DJ5K42

Abstract

The parallelization process for a sequential applications involves handling of concurrent shared memory object updates. One important type of parallelism is exploited when the order of the memory updates to the same location does not change the output of the program. This type of parallelism is reduction type parallelism. It typically exists in many important applications such as data mining, numerical analysis and scientific simulation. The implementation of these applications for multi-core architectures is typically accomplished by using thread(s) private data objects to hold partial results and applying a sequential final stage to aggregate the partial results. Porting this type of applications to massively parallel GPU processors faces new challenges. One major challenge is work partitioning, the target of which is minimal communication between individual threads that run in parallel and less work in the sequential reduction stage to aggregate all partial results. However, when the number of threads explodes to thousands of or millions of, the workload partitioning becomes much more complicated than that of less than ten threads. This may ultimately lead to load unbalancing or extra control code for handling boundary cases in irregular applications. Extra control code, may on one hand lead to increased coding complexity, and on the other hand, runtime thread divergence and thus serious performance degradation. In this paper, we propose a novel approach to handle concurrent shared memory objects on GPUs. This approach not only yields good performance, but also good programmability. This approach uses atomic operations extensively. Atomic operations for shared memory updates are known to be expensive and may cause serialization for parallel threads. However, we discovered that, with appropriate atomic collision elimination techniques, we can achieve similar performance or even better performance than the traditional non-atomics involved implementation. We implemented these techniques as a library of functions with simple interface. The programmers can call these procedures to perform shared memory object updates without worrying about the order of these operations or workload balancing, while achieving significant performance gains brought by the massive parallelism in GPUs.

Files and links (2)

pdf

tr5b528b5705591323.73 kBDownload View

Version of Record (VoR) Technical Documentation Open Access

url

Report an accessibility issueView

Please complete a content remediation request to report an accessibility issue with a library electronic resource, website, or service.

Metrics

152 File downloads

104 Record Views

Details

Title: Subtitle: Using Massive Atomics Operations for Massively Parallel GPU Applications: Inevitable or Indispensable?
Creators: Ian Egielski (Author) - Computer Science (New Brunswick)
Chi Zhang (Author) - Computer Science (New Brunswick)
Eddy Z. Zhang (Author) - Computer Science (New Brunswick)
Date published: 2013
Publisher: Rutgers University
Number of pages: 10 p.
Academic Unit: School of Arts and Sciences; Computer Science (SAS)
Language: English
Resource Type: Technical documentation
Comment: Technical report DCS-TR-700
Identifiers: 991031549898304646