parallel processing issue

function GetSum(cube, mask) wave cube, mask Make/O/N=(DimSize(cube, 2)) total=0 variable cx, cy for(cx = 0; cx < DimSize(Cube, 0); cx += 1) for(cy = 0; cy < DimSize(Cube, 1); cy += 1) if (mask[cx][cy] ==1) // mask[cx][cy] can be 1 or 0 ImageTransform /Beam ={(cx), (cy)} getBeam cube wave W_Beam total+=W_Beam endif endfor endfor KillWaves/Z W_Beam end

Threadsafe function workerFunc(cube, total, mask, col) wave cube, total, mask variable col variable i for (i=0; i<DimSize(cube, 0);i+=1) if (mask[i][col] ==1) ImageTransform /Beam ={(i), (col)} getBeam cube wave W_Beam total+=W_Beam endif endfor end Function MT_GetSum(cube, mask) WAVE cube, mask Make/O/N=(DimSize(cube, 2)) total=0 Variable ncol= DimSize(cube,1) Variable i,col,nthreads= ThreadProcessorCount Variable threadGroupID= ThreadGroupCreate(nthreads) for(col=0; col<ncol;) for(i=0; i<nthreads; i+=1) ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col) col+=1 if( col>=ncol ) break endif endfor do Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100) while( threadGroupStatus != 0 ) endfor Variable dummy= ThreadGroupRelease(threadGroupID) End

741

I don't have a lot of time but from a quick look at your post:

ChrLie wrote:

Threadsafe function workerFunc(cube, total, mask, col)
    wave cube, total, mask
    variable col
    
    variable i
    for (i=0; i<DimSize(cube, 0);i+=1)
        
        if (mask[i][col] ==1)   
            ImageTransform /Beam ={(i), (col)} getBeam cube
            wave W_Beam
            total+=W_Beam   
        endif
    endfor
 
end

Think about what happens in your parallel calculation. Igor spawns a bunch of threads and these start crunching away, independently and regardless of what other threads may be doing. This also means that if different threads write to the same memory, each will modify this memory as it pleases and the result is unpredictable. Read the section called "Example atomic operation" here to understand what that means with respect to your problem.

Igor tries to prevent these problems from occurring by giving each thread its own, private data folder hierarchy, such that all waves created by the thread are accessible only by that thread. The exception, of course, are function arguments. In other words, every thread refers to the same "total" wave.

So when you do this:

total+=W_Beam

The result is undefined because multiple threads can modify it simultaneously.

Your only option is to provide each thread with its own copy of total (so if there are 8 threads, create 8 copies of total, all initialized to zero), and to add all of these waves together into the final result after the threads have completed.

Also, the loop in which you call ThreadStart has issues, but I don't have time now so maybe someone else will chip in on that.

October 19, 2012 at 07:36 am - Permalink

741

Looking at this again, your ThreadStart loop does look OK. I misread it due to reading it in a hurry, sorry. However, my comment above still stands.

I played with this a bit:

function GetSum(cube, mask) // this is your original function
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
 
    variable cx, cy 
    for(cx = 0; cx < DimSize(Cube, 0); cx += 1)         
            for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
 
                if (mask[cx][cy] ==1)           // mask[cx][cy] can be 1 or 0               
                    ImageTransform /Beam ={(cx), (cy)} getBeam cube
                    wave W_Beam
                    total+=W_Beam[p]
                endif       
 
            endfor
    endfor      
 
    KillWaves/Z W_Beam
end
 
Function GetSum2(cube, mask)    // MatrixOP-based
    wave cube, mask
    
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MatrixOP /FREE Multiplied = cube * mask
    
    variable i
    for (i = 0; i < DimSize(Multiplied, 2); i+=1)
        ImageTransform /P=(i) sumPlane, Multiplied
        total[i] += V_value
    endfor
End
 
function GetSum3(cube, mask)    // MultiThreaded
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MultiThread total = WorkerFunc3(cube, mask, p)
end
 
ThreadSafe Function WorkerFunc3(cube, mask, plane)
    wave cube, mask
    variable plane
    
    variable i, j, total = 0
    for (i = 0; i < DimSize(cube, 0); i+=1)
        for (j = 0; j < DimSize(cube, 1); j+=1)
            if (mask[i][j] == 1)
                total += cube[i][j][plane]
            endif
        endfor
    endfor
    
    return total
End
 
function GetSum4(cube, mask)    // MultiThreaded + MatrixOP
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
    
    MultiThread total = WorkerFunc4(cube, mask, p)
end
 
ThreadSafe Function WorkerFunc4(cube, mask, plane)
    wave cube, mask
    variable plane
    
    MatrixOP /FREE W_Sum = sum(mask * cube[][][plane])
    return W_Sum[0]
End
 
Function DoTiming()
    Make /o/n=(128,128) /D M_Mask = round(enoise(0.5) + 0.5)
    Make /o/n=(128,128, 512) /D/O M_cube = enoise(1)
    
    variable timerRef, time1, time2, time3, time4
    
    timerRef = StartMSTimer
    GetSum(m_cube, m_mask)
    time1 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum2(m_cube, m_mask)
    time2 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum3(m_cube, m_mask)
    time3 = StopMSTimer(timerRef) / 1e6
    
    timerRef = StartMSTimer
    GetSum4(m_cube, m_mask)
    time4 = StopMSTimer(timerRef) / 1e6
    
    Printf "Original: %f, MatrixOP: %f s, MultiThread: %f s, MatrixOP + MultiThread: %f s\r", time1, time2, time3, time4
End

Result on my computer:

•DoTiming()
  Original: 0.343414 s, MatrixOP: 0.056957 s, MultiThread: 0.389801 s, MatrixOP + MultiThread: 0.008289 s

In my implementation, multithreading of the 'naive' Igor code actually leads to a slowdown. Using MatrixOP, however, is much faster than the naive code and multithreaded MatrixOP appears to be the performance winner.

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.

ChrLie wrote:
The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.

I think that this is probably due to a mistake in your code. But be wary of using single-precision waves when not necessary. Use doubles (Make /D) when practical.

I'd be interested to hear what the performance difference is on your dataset.

October 22, 2012 at 01:02 pm - Permalink

ChrLie

741 wrote:

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.

GetSum4() is awesome!

Thinking of "planes" instead of "beams" clearly makes a difference. The result for my cube:

  Original: 2.831555, MatrixOP: 0.350114 s, MultiThread: 6.146490 s, MatrixOP + MultiThread: 0.097558 s

As for differences in totals in the initial post; I somehow had the feeling that all threads working on one wave (or memory) might be the problem but your solution is much more elegant anyway!

Thanks a lot!

October 23, 2012 at 12:42 am - Permalink