parallel processing issue

I want to extract and sum up "beams" from a 3D wave depending on a ROI or mask wave, which I do using the code below:

function GetSum(cube, mask)
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2)) total=0
        
    variable cx, cy 
    for(cx = 0; cx < DimSize(Cube, 0); cx += 1)         
            for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
                    
                if (mask[cx][cy] ==1)           // mask[cx][cy] can be 1 or 0               
                    ImageTransform /Beam ={(cx), (cy)} getBeam cube
                    wave W_Beam
                    total+=W_Beam   
                endif       
            
            endfor
    endfor      
    
    KillWaves/Z W_Beam
end


For the size of my 3D wave this process takes about 2-3 seconds. To improve performance I copied and modified the parallel processing example from the help files to this:

Threadsafe function workerFunc(cube, total, mask, col)
    wave cube, total, mask
    variable col
    
    variable i
    for (i=0; i<DimSize(cube, 0);i+=1)
        
        if (mask[i][col] ==1)   
            ImageTransform /Beam ={(i), (col)} getBeam cube
            wave W_Beam
            total+=W_Beam   
        endif
    endfor
 
end
 
 
Function MT_GetSum(cube, mask)
    WAVE cube, mask
    
    Make/O/N=(DimSize(cube, 2)) total=0
    
    Variable ncol= DimSize(cube,1)
    Variable i,col,nthreads= ThreadProcessorCount
    Variable threadGroupID= ThreadGroupCreate(nthreads)
    
    for(col=0; col<ncol;)
        for(i=0; i<nthreads; i+=1)
        
            ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col)
            
            col+=1
            if( col>=ncol )
                break
            endif
        endfor
        
        do
            Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    endfor
    Variable dummy= ThreadGroupRelease(threadGroupID)
End



This gives a 60-70% speedup but it appears that wave "total" from the two methods does not contain identical values. The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.
Any ideas what I'm doing wrong?

I don't have a lot of time but from a quick look at your post:

ChrLie wrote:

Threadsafe function workerFunc(cube, total, mask, col)
    wave cube, total, mask
    variable col
    
    variable i
    for (i=0; i<DimSize(cube, 0);i+=1)
        
        if (mask[i][col] ==1)   
            ImageTransform /Beam ={(i), (col)} getBeam cube
            wave W_Beam
            total+=W_Beam   
        endif
    endfor
 
end



Think about what happens in your parallel calculation. Igor spawns a bunch of threads and these start crunching away, independently and regardless of what other threads may be doing. This also means that if different threads write to the same memory, each will modify this memory as it pleases and the result is unpredictable. Read the section called "Example atomic operation" here to understand what that means with respect to your problem.

Igor tries to prevent these problems from occurring by giving each thread its own, private data folder hierarchy, such that all waves created by the thread are accessible only by that thread. The exception, of course, are function arguments. In other words, every thread refers to the same "total" wave.

So when you do this:
total+=W_Beam   

The result is undefined because multiple threads can modify it simultaneously.

Your only option is to provide each thread with its own copy of total (so if there are 8 threads, create 8 copies of total, all initialized to zero), and to add all of these waves together into the final result after the threads have completed.


Also, the loop in which you call ThreadStart has issues, but I don't have time now so maybe someone else will chip in on that.
Looking at this again, your ThreadStart loop does look OK. I misread it due to reading it in a hurry, sorry. However, my comment above still stands.

I played with this a bit:
function GetSum(cube, mask) // this is your original function
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
 
    variable cx, cy 
    for(cx = 0; cx < DimSize(Cube, 0); cx += 1)         
            for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
 
                if (mask[cx][cy] ==1)           // mask[cx][cy] can be 1 or 0               
                    ImageTransform /Beam ={(cx), (cy)} getBeam cube
                    wave W_Beam
                    total+=W_Beam[p]
                endif       
 
            endfor
    endfor      
 
    KillWaves/Z W_Beam
end
 
Function GetSum2(cube, mask)    // MatrixOP-based
    wave cube, mask
    
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MatrixOP /FREE Multiplied = cube * mask
    
    variable i
    for (i = 0; i < DimSize(Multiplied, 2); i+=1)
        ImageTransform /P=(i) sumPlane, Multiplied
        total[i] += V_value
    endfor
End
 
function GetSum3(cube, mask)    // MultiThreaded
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MultiThread total = WorkerFunc3(cube, mask, p)
end
 
ThreadSafe Function WorkerFunc3(cube, mask, plane)
    wave cube, mask
    variable plane
    
    variable i, j, total = 0
    for (i = 0; i < DimSize(cube, 0); i+=1)
        for (j = 0; j < DimSize(cube, 1); j+=1)
            if (mask[i][j] == 1)
                total += cube[i][j][plane]
            endif
        endfor
    endfor
    
    return total
End
 
function GetSum4(cube, mask)    // MultiThreaded + MatrixOP
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MultiThread total = WorkerFunc4(cube, mask, p)
end
 
ThreadSafe Function WorkerFunc4(cube, mask, plane)
    wave cube, mask
    variable plane
    
    MatrixOP /FREE W_Sum = sum(mask * cube[][][plane])
    return W_Sum[0]
End
 
Function DoTiming()
    Make /o/n=(128,128) /D M_Mask = round(enoise(0.5) + 0.5)
    Make /o/n=(128,128, 512) /D/O M_cube = enoise(1)
    
    variable timerRef, time1, time2, time3, time4
    
    timerRef = StartMSTimer
    GetSum(m_cube, m_mask)
    time1 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum2(m_cube, m_mask)
    time2 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum3(m_cube, m_mask)
    time3 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum4(m_cube, m_mask)
    time4 = StopMSTimer(timerRef) / 1e6
    
    Printf "Original: %f, MatrixOP: %f s, MultiThread: %f s, MatrixOP + MultiThread: %f s\r", time1, time2, time3, time4
End


Result on my computer:
•DoTiming()
  Original: 0.343414 s, MatrixOP: 0.056957 s, MultiThread: 0.389801 s, MatrixOP + MultiThread: 0.008289 s


In my implementation, multithreading of the 'naive' Igor code actually leads to a slowdown. Using MatrixOP, however, is much faster than the naive code and multithreaded MatrixOP appears to be the performance winner.

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.

ChrLie wrote:
The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.

I think that this is probably due to a mistake in your code. But be wary of using single-precision waves when not necessary. Use doubles (Make /D) when practical.

I'd be interested to hear what the performance difference is on your dataset.
741 wrote:

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.


GetSum4() is awesome!

Thinking of "planes" instead of "beams" clearly makes a difference. The result for my cube:

  Original: 2.831555, MatrixOP: 0.350114 s, MultiThread: 6.146490 s, MatrixOP + MultiThread: 0.097558 s


As for differences in totals in the initial post; I somehow had the feeling that all threads working on one wave (or memory) might be the problem but your solution is much more elegant anyway!

Thanks a lot!