Python numpy performance using ufunc or numba.

posted on: 2018-09-13 06:19:32

With numpy arithmetic a complicated expression can lead to many calls and loops when one loop and call would suffice.

We start with a somewhat complex function.

def fun1(x, y):
    return x**3 + x**2*y + x*y**2 + y**3 - 0.1

Each operation takes the input arrays and operates on them in c and returns a new array. This doesn't seem efficient, so here is how Ill bench mark it.

numpy.random.seed(1)
x = numpy.random.random((10000, 10000))
y = numpy.random.random((10000, 10000))

for i in range(5):
    x = fun1(x,y)

And the output ends up being,

prompt$> time python package/check.py 
1.604411024829864e+55

real	1m24.093s
user	1m18.549s
sys	0m5.836s

Can we make this run faster. When reading the numpy docs they mention that copying the arrays can take up a lot of resources and could be improved by using the operators with an output array.

def fun3(x,y):
    T = numpy.power(x,3)
    T2 = numpy.square(x)
    numpy.multiply(T2,y,T2)
    numpy.add(T,T2,T)
    numpy.square(y, T2)
    numpy.multiply(x, T2, T2)
    numpy.add(T, T2, T)
    numpy.power(y, 3, T2)
    numpy.add(T, T2, T)
    numpy.add(T, -0.1, T)
    return T

Using the new function, the result is hardly noticeable.

1.604411024829864e+55

real	1m19.115s
user	1m16.325s
sys	0m3.052s

The next step is to make a ufunc that performs the same task. I based the following on this official tutorial.

static void double_funny(char **args, npy_intp *dimensions,
                            npy_intp* steps, void* data)
{
    npy_intp i;
    npy_intp n = dimensions[0];
    char *in1 = args[0], *in2 = args[1];
    char *out1 = args[2];
    npy_intp in1_step = steps[0], in2_step = steps[1];
    npy_intp out1_step = steps[2];

    double x,y;
    
    for (i = 0; i < n; i++) {
        /*BEGIN main ufunc computation*/
        x = *(double *)in1;
        y = *(double *)in2;
        *((double *)out1) = x*x*x + x*x*y + x*y*y + y*y*y - 0.1;
        /*END main ufunc computation*/

        in1 += in1_step;
        in2 += in2_step;
        out1 += out1_step;
        
    }
}

Then using the same check routine I get:

1.6044110248298627e+55

real	0m5.103s
user	0m3.828s
sys	0m2.322s

Notice that the output number is slightly different!? The amazing thing to notice is we went from ~80 seconds down to 5 seconds so nearly a 16 times speed up. It is a bit of a hassell to build a c extension so I also used numba. The most challenging aspect of numpy was installing a version that uses a version of llvm I could install.

Once numba was up and running. I included three versions that use numba:

@numba.jit
def fun(x, y):
    return x**3 + x**2*y + x*y**2 + y**3 - 0.1


@numba.jit
def fun3(x,y):
    T = numpy.power(x,3)
    T2 = numpy.power(x, 2)
    numpy.multiply(T2,y,T2)
    numpy.add(T,T2,T)
    numpy.power(y, 2, T2)
    numpy.multiply(x, T2, T2)
    numpy.add(T, T2, T)
    numpy.power(y, 3, T2)
    numpy.add(T, T2, T)
    numpy.add(T, -0.1, T)
    return T

@numba.jit
def fun4(x,y):
    w,h = x.shape
    z = numpy.zeros(x.shape)
    for i in range(w):
        for j in range(h):
            z[i,j] = x[i,j]**3 + x[i,j]**2*y[i,j] + x[i,j]*y[i,j]**2 + y[i,j]**3 - 0.1
    return z

The first version is just the actual numpy calls, then there is a version where the numpy calls are made but re-using an array for the answer, and the final version is an explicit loop.

#fun output
1.6044110248298622e+55

real	0m5.542s
user	0m4.164s
sys	0m2.595s


#fun3 output
1.604411024829864e+55

real	1m21.265s
user	1m18.629s
sys	0m4.154s

#fun4 output
1.6044110248298622e+55

real	0m5.397s
user	0m4.071s
sys	0m2.557s

Something about the passing back and forth does something to the accuracy. We can see the jit'd versions and ufunc version both have the same output.

source code found on github

Comments

create comment?