Sharing an implementation of SGEMM using GPU and OpenCL

With this script:

function test(runs)
    variable runs
    
    variable m
    variable dim = 8192
    
    make/o/n=(runs) CPUTakes, GPUTakes
    make/o/n=(dim,dim) AA, BB, CC, DD
    
    variable timer, GPUTook, CPUTook
    
    for (m=0;m<runs;m=m+1)
        
        timer=startMSTimer
        GPU_SGEMM(AA,BB,CC)
        GPUTook=StopMSTimer(timer)/1e6
        
        GPUTakes[m]=GPUTook
        
        MultiThreadingControl setMode=8
        timer=startMSTimer
        MatrixOp/O DD = AA x BB
        CPUTook=StopMSTimer(timer)/1e6  
        MultiThreadingControl setMode=0
        
        CPUTakes[m]=CPUTook
        endfor
        
    variable/g CPUTake=mean(CPUTakes)
    variable/g GPUTake=mean(GPUTakes)
    
    variable/g CPUGFLOPS = 2*dim*dim*dim/CPUTake*1e-9
    variable/g GPUGFLOPS = 2*dim*dim*dim/GPUTake*1e-9
    
    print "CPU SGEMM GFLOPS = "+num2str(CPUGFLOPS)
    print "GPU SGEMM GFLOPS = "+num2str(GPUGFLOPS)
    
    killwaves/Z CPUTakes, GPUTakes, AA, BB, CC, DD
    
end

Just some performance metrics:
Intel Core i7-6700k@4.4GHz SGEMM GFLOPS = 239.81
Intel Core i7-4790 SGEMM GFLOPS = 196.84
AMD HD 7950 SGEMM GFLOPS = 585.71
AMD RX 480 SGEMM GFLOPS = 908.5

Also, with Python and Numpy, Threadripper 1950x SGEMM FLOPS = 238.38 (AMD has to optimize this)

Log in or register to post comments

September 20, 2017 at 09:54 pm - Permalink

thomas_braun

Nice project!

Have you seen http://www.igorexchange.com/project/IgorCL/https://github.com/pdedecker… which allows to load openCL code from within Igor? In additionw with this XOP you can also execute the CL code on the CPU.

Here I get:

•test(1)
Result is equal 1
CPU SGEMM GFLOPS = 216.82
GPU SGEMM GFLOPS = 472.03

with a Radeon (TM) Pro WX 5100 Graphics and a i7-3930k.

When doing these tests it usually a good idea to compare the results with e.g. printf "Result is equal %d\r", EqualWaves(DD, CC, 1, 1e-6).

I've tried fiddling with the TS/WPT parameters in the opencl file but if I do that I always get wrong results.

[Edited]

Log in or register to post comments

September 21, 2017 at 05:35 am - Permalink

Sandbo

thomas_braun wrote:
Nice project!

Have you seen http://www.igorexchange.com/project/IgorCL/https://github.com/pdedecker… which allows to load openCL code from within Igor? In additionw with this XOP you can also execute the CL code on the CPU.

Here I get:

•test(1)
Result is equal 1
CPU SGEMM GFLOPS = 216.82
GPU SGEMM GFLOPS = 472.03

with a Radeon (TM) Pro WX 5100 Graphics and a i7-3930k.

When doing these tests it usually a good idea to compare the results with e.g. printf "Result is equal %d\r", EqualWaves(DD, CC, 1, 1e-6).

I've tried fiddling with the TS/WPT parameters in the opencl file but if I do that I always get wrong results.

[Edited]

For further performance tuning, you may want to refer to this page:
https://cnugteren.github.io/tutorial/pages/page1.html

A large portion of my implementation was based on it.
His kernel 6 and 7 are another factor of 2-3 faster than what is being used in my source now, but I couldn't get correct results with them, I am finding out why. He also has got this: https://github.com/CNugteren/CLTune
Also, it might be good to see if the clBlas kernel can be used, which didn't look too straight-forward at the moment: https://github.com/clMathLibraries/clBLAS
If I could push the GFLOPS of RX 480 to close to 3000 I would be grateful, and probably then I can convince my boss to get me a Vega :D.

I found a bug in the memory check (I forgot to divide sizeof(float) by 8 to get the unit of bytes), I will update it soon, but you may just change it in the source.
*Forgot to add: the implementation I used only works for matrix with dimensions divisible by 16 (or 4, need to check), otherwise the boundary results will be invalid.
This kind of boundary issue also needs to be handled more gracefully.

Log in or register to post comments

September 21, 2017 at 07:14 am - Permalink