Sharing an implementation of SGEMM using GPU and OpenCL
Sandbo
I just created a project with the XOP and source files for a simple (and un-optimized) version of SGEMM (Single-precision General Matrix Muliplication).
http://www.igorexchange.com/project/OpenCLSGEMM
I am new to C programming and only spent a few days assembling the codes from tutorials around the web (I tried to reference to each of them within the source code).
At the moment I only got a factor of 3 improvement (for large input matrices) using a AMD HD 7950 over Intel Core i7-4790. While I expect using a newer and faster GPU will help quite a lot, the implementation can definitely be improved by a fair amount.
It is done by using this kernel: (with work per thread set to 4).
https://cnugteren.github.io/tutorial/pages/page5.html
I will try to improve the performance over time, but it would be great if someone who is experience in using OpenCL and GPU can give me some hints.
If you are also interested in trying it out, I will glad to know if it works out for you, and provide help where possible.
variable runs
variable m
variable dim = 8192
make/o/n=(runs) CPUTakes, GPUTakes
make/o/n=(dim,dim) AA, BB, CC, DD
variable timer, GPUTook, CPUTook
for (m=0;m<runs;m=m+1)
timer=startMSTimer
GPU_SGEMM(AA,BB,CC)
GPUTook=StopMSTimer(timer)/1e6
GPUTakes[m]=GPUTook
MultiThreadingControl setMode=8
timer=startMSTimer
MatrixOp/O DD = AA x BB
CPUTook=StopMSTimer(timer)/1e6
MultiThreadingControl setMode=0
CPUTakes[m]=CPUTook
endfor
variable/g CPUTake=mean(CPUTakes)
variable/g GPUTake=mean(GPUTakes)
variable/g CPUGFLOPS = 2*dim*dim*dim/CPUTake*1e-9
variable/g GPUGFLOPS = 2*dim*dim*dim/GPUTake*1e-9
print "CPU SGEMM GFLOPS = "+num2str(CPUGFLOPS)
print "GPU SGEMM GFLOPS = "+num2str(GPUGFLOPS)
killwaves/Z CPUTakes, GPUTakes, AA, BB, CC, DD
end
Just some performance metrics:
Intel Core i7-6700k@4.4GHz SGEMM GFLOPS = 239.81
Intel Core i7-4790 SGEMM GFLOPS = 196.84
AMD HD 7950 SGEMM GFLOPS = 585.71
AMD RX 480 SGEMM GFLOPS = 908.5
Also, with Python and Numpy, Threadripper 1950x SGEMM FLOPS = 238.38 (AMD has to optimize this)
September 20, 2017 at 09:54 pm - Permalink
Have you seen http://www.igorexchange.com/project/IgorCL/https://github.com/pdedecker… which allows to load openCL code from within Igor? In additionw with this XOP you can also execute the CL code on the CPU.
Here I get:
•test(1)
Result is equal 1
CPU SGEMM GFLOPS = 216.82
GPU SGEMM GFLOPS = 472.03
with a Radeon (TM) Pro WX 5100 Graphics and a i7-3930k.
When doing these tests it usually a good idea to compare the results with e.g.
printf "Result is equal %d\r", EqualWaves(DD, CC, 1, 1e-6)
.I've tried fiddling with the TS/WPT parameters in the opencl file but if I do that I always get wrong results.
[Edited]
September 21, 2017 at 05:35 am - Permalink
For further performance tuning, you may want to refer to this page:
https://cnugteren.github.io/tutorial/pages/page1.html
A large portion of my implementation was based on it.
His kernel 6 and 7 are another factor of 2-3 faster than what is being used in my source now, but I couldn't get correct results with them, I am finding out why. He also has got this: https://github.com/CNugteren/CLTune
Also, it might be good to see if the clBlas kernel can be used, which didn't look too straight-forward at the moment: https://github.com/clMathLibraries/clBLAS
If I could push the GFLOPS of RX 480 to close to 3000 I would be grateful, and probably then I can convince my boss to get me a Vega :D.
I found a bug in the memory check (I forgot to divide sizeof(float) by 8 to get the unit of bytes), I will update it soon, but you may just change it in the source.
*Forgot to add: the implementation I used only works for matrix with dimensions divisible by 16 (or 4, need to check), otherwise the boundary results will be invalid.
This kind of boundary issue also needs to be handled more gracefully.
September 21, 2017 at 07:14 am - Permalink