parallel processing issue
ChrLie
function GetSum(cube, mask)
wave cube, mask
Make/O/N=(DimSize(cube, 2)) total=0
variable cx, cy
for(cx = 0; cx < DimSize(Cube, 0); cx += 1)
for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
if (mask[cx][cy] ==1) // mask[cx][cy] can be 1 or 0
ImageTransform /Beam ={(cx), (cy)} getBeam cube
wave W_Beam
total+=W_Beam
endif
endfor
endfor
KillWaves/Z W_Beam
end
wave cube, mask
Make/O/N=(DimSize(cube, 2)) total=0
variable cx, cy
for(cx = 0; cx < DimSize(Cube, 0); cx += 1)
for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
if (mask[cx][cy] ==1) // mask[cx][cy] can be 1 or 0
ImageTransform /Beam ={(cx), (cy)} getBeam cube
wave W_Beam
total+=W_Beam
endif
endfor
endfor
KillWaves/Z W_Beam
end
For the size of my 3D wave this process takes about 2-3 seconds. To improve performance I copied and modified the parallel processing example from the help files to this:
Threadsafe function workerFunc(cube, total, mask, col)
wave cube, total, mask
variable col
variable i
for (i=0; i<DimSize(cube, 0);i+=1)
if (mask[i][col] ==1)
ImageTransform /Beam ={(i), (col)} getBeam cube
wave W_Beam
total+=W_Beam
endif
endfor
end
Function MT_GetSum(cube, mask)
WAVE cube, mask
Make/O/N=(DimSize(cube, 2)) total=0
Variable ncol= DimSize(cube,1)
Variable i,col,nthreads= ThreadProcessorCount
Variable threadGroupID= ThreadGroupCreate(nthreads)
for(col=0; col<ncol;)
for(i=0; i<nthreads; i+=1)
ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col)
col+=1
if( col>=ncol )
break
endif
endfor
do
Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
while( threadGroupStatus != 0 )
endfor
Variable dummy= ThreadGroupRelease(threadGroupID)
End
wave cube, total, mask
variable col
variable i
for (i=0; i<DimSize(cube, 0);i+=1)
if (mask[i][col] ==1)
ImageTransform /Beam ={(i), (col)} getBeam cube
wave W_Beam
total+=W_Beam
endif
endfor
end
Function MT_GetSum(cube, mask)
WAVE cube, mask
Make/O/N=(DimSize(cube, 2)) total=0
Variable ncol= DimSize(cube,1)
Variable i,col,nthreads= ThreadProcessorCount
Variable threadGroupID= ThreadGroupCreate(nthreads)
for(col=0; col<ncol;)
for(i=0; i<nthreads; i+=1)
ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col)
col+=1
if( col>=ncol )
break
endif
endfor
do
Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
while( threadGroupStatus != 0 )
endfor
Variable dummy= ThreadGroupRelease(threadGroupID)
End
This gives a 60-70% speedup but it appears that wave "total" from the two methods does not contain identical values. The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.
Any ideas what I'm doing wrong?
Think about what happens in your parallel calculation. Igor spawns a bunch of threads and these start crunching away, independently and regardless of what other threads may be doing. This also means that if different threads write to the same memory, each will modify this memory as it pleases and the result is unpredictable. Read the section called "Example atomic operation" here to understand what that means with respect to your problem.
Igor tries to prevent these problems from occurring by giving each thread its own, private data folder hierarchy, such that all waves created by the thread are accessible only by that thread. The exception, of course, are function arguments. In other words, every thread refers to the same "total" wave.
So when you do this:
The result is undefined because multiple threads can modify it simultaneously.
Your only option is to provide each thread with its own copy of total (so if there are 8 threads, create 8 copies of total, all initialized to zero), and to add all of these waves together into the final result after the threads have completed.
Also, the loop in which you call ThreadStart has issues, but I don't have time now so maybe someone else will chip in on that.
October 19, 2012 at 07:36 am - Permalink
I played with this a bit:
wave cube, mask
Make/O/N=(DimSize(cube, 2))/D total=0
variable cx, cy
for(cx = 0; cx < DimSize(Cube, 0); cx += 1)
for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
if (mask[cx][cy] ==1) // mask[cx][cy] can be 1 or 0
ImageTransform /Beam ={(cx), (cy)} getBeam cube
wave W_Beam
total+=W_Beam[p]
endif
endfor
endfor
KillWaves/Z W_Beam
end
Function GetSum2(cube, mask) // MatrixOP-based
wave cube, mask
Make/O/N=(DimSize(cube, 2))/D total=0
MatrixOP /FREE Multiplied = cube * mask
variable i
for (i = 0; i < DimSize(Multiplied, 2); i+=1)
ImageTransform /P=(i) sumPlane, Multiplied
total[i] += V_value
endfor
End
function GetSum3(cube, mask) // MultiThreaded
wave cube, mask
Make/O/N=(DimSize(cube, 2))/D total=0
MultiThread total = WorkerFunc3(cube, mask, p)
end
ThreadSafe Function WorkerFunc3(cube, mask, plane)
wave cube, mask
variable plane
variable i, j, total = 0
for (i = 0; i < DimSize(cube, 0); i+=1)
for (j = 0; j < DimSize(cube, 1); j+=1)
if (mask[i][j] == 1)
total += cube[i][j][plane]
endif
endfor
endfor
return total
End
function GetSum4(cube, mask) // MultiThreaded + MatrixOP
wave cube, mask
Make/O/N=(DimSize(cube, 2))/D total=0
MultiThread total = WorkerFunc4(cube, mask, p)
end
ThreadSafe Function WorkerFunc4(cube, mask, plane)
wave cube, mask
variable plane
MatrixOP /FREE W_Sum = sum(mask * cube[][][plane])
return W_Sum[0]
End
Function DoTiming()
Make /o/n=(128,128) /D M_Mask = round(enoise(0.5) + 0.5)
Make /o/n=(128,128, 512) /D/O M_cube = enoise(1)
variable timerRef, time1, time2, time3, time4
timerRef = StartMSTimer
GetSum(m_cube, m_mask)
time1 = StopMSTimer(timerRef) / 1e6
timerRef = StartMSTimer
GetSum2(m_cube, m_mask)
time2 = StopMSTimer(timerRef) / 1e6
timerRef = StartMSTimer
GetSum3(m_cube, m_mask)
time3 = StopMSTimer(timerRef) / 1e6
timerRef = StartMSTimer
GetSum4(m_cube, m_mask)
time4 = StopMSTimer(timerRef) / 1e6
Printf "Original: %f, MatrixOP: %f s, MultiThread: %f s, MatrixOP + MultiThread: %f s\r", time1, time2, time3, time4
End
Result on my computer:
Original: 0.343414 s, MatrixOP: 0.056957 s, MultiThread: 0.389801 s, MatrixOP + MultiThread: 0.008289 s
In my implementation, multithreading of the 'naive' Igor code actually leads to a slowdown. Using MatrixOP, however, is much faster than the naive code and multithreaded MatrixOP appears to be the performance winner.
Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.
I think that this is probably due to a mistake in your code. But be wary of using single-precision waves when not necessary. Use doubles (Make /D) when practical.
I'd be interested to hear what the performance difference is on your dataset.
October 22, 2012 at 01:02 pm - Permalink
GetSum4() is awesome!
Thinking of "planes" instead of "beams" clearly makes a difference. The result for my cube:
As for differences in totals in the initial post; I somehow had the feeling that all threads working on one wave (or memory) might be the problem but your solution is much more elegant anyway!
Thanks a lot!
October 23, 2012 at 12:42 am - Permalink