A faster OpenMP Radix Sort implementation

I was the TA of CS484 Parallel Programming at UIUC. One of the course machine problem is implementing a OpenMP version radix sort. There are many implementations in the web. And the top one result from google search “openmp radix sort” is the version by Alexis Daboville on Github.

The code is pretty code. It uses MSD based radix sort. It parallelized the counting part, and uses the parallel recursive call. And it switches to the quick sort when the bin is small enough. It also uses a lot of coding tricks to make it run faster.

But the absolute performance of Alexis’s implementation is not good. The reasons include limited parallel portions, the MSD algorithm, and recursive call. In fact, there are three major phases in a radix sort, counting, prefix, and data moving. We should try to parallel all the three portions, not just one. Furthermore, a LSD based algorithm can remove the overhead from the recursive call. By the way, I’m a compiler guy, and I don’t like the programming tricks here and there in the code to make it run faster. A compiler may do better than these tricks most of the time. And some of these tricks may interfere the compiler analysis.

Here is my implementation (BSD License). Gist link

#define BASE_BITS 8
#define BASE (1 <> shift) & MASK)

void omp_lsd_radix_sort(size_t n, unsigned data[n]) {
    unsigned * buffer = malloc(n*sizeof(unsigned));
    int total_digits = sizeof(unsigned)*8;

    //Each thread use local_bucket to move data
    size_t i;
    for(int shift = 0; shift < total_digits; shift+=BASE_BITS) {
        size_t bucket[BASE] = {0};

        size_t local_bucket[BASE] = {0}; // size needed in each bucket/thread
        //1st pass, scan whole and check the count
        #pragma omp parallel firstprivate(local_bucket)
        {
            #pragma omp for schedule(static) nowait
            for(i = 0; i < n; i++){
                local_bucket[DIGITS(data[i], shift)]++;
            }
            #pragma omp critical
            for(i = 0; i < BASE; i++) {
                bucket[i] += local_bucket[i];
            }
            #pragma omp barrier
            #pragma omp single
            for (i = 1; i = 0; cur_t--) {
                if(cur_t == tid) {
                    for(i = 0; i < BASE; i++) {
                        bucket[i] -= local_bucket[i];
                        local_bucket[i] = bucket[i];
                    }
                } else { //just do barrier
                    #pragma omp barrier
                }

            }
            #pragma omp for schedule(static)
            for(i = 0; i < n; i++) { //note here the end condition
                buffer[local_bucket[DIGITS(data[i], shift)]++] = data[i];
            }
        }
        //now move data
        unsigned* tmp = data;
        data = buffer;
        buffer = tmp;
    }
    free(buffer);
}

The above code can only support unsigned data type. But it parallelized the counting phase and the data moving phase. Regarding the prefix phase, the total sequence is small, and the sequential version is still faster. There is a minor issue. It relies on the OpenMP's static scheduling partition the task in the same way in the counting phase and the data moving phase. I believe the OpenMP compiler will do it in this way, but I'm not sure. Another limitation here is the code uses OpenMP runtime function. But I don't know other ways to pre-allocate data movement buffer for different threads without calling OpenMP runtime functions.

My quick performance testing result. On a server with one Intel E31245, I turned off hyper-threading(only use four threads), turned off turbo boost (fix to 3.3GHz). In sorting 16M unsigned integers, the above code can achieve ~218 M/s.

Can we do better? Yes. Intel published a paper about the implementation in SIGMOD2010. There is the latest report. The performance is about 250M/s on a slower (3.2GHz i7) processor. One of the important implementation is optimizing for memory bandwidth. There is no code available.

But the CPU version is still much slower than a CUDA implementation. There are many gather/scatter like operations in the parallel radix sort, which limit the usage of CPU SIMD. Maybe we can get a better CPU peak performance if we have an efficient gather / scatter SIMD operations available.

 

 

Leave a comment