We start with a somewhat complex function.

def fun1(x, y): return x**3 + x**2*y + x*y**2 + y**3 - 0.1

Each operation takes the input arrays and operates on them in c and returns a new array. This doesn't seem efficient, so here is how Ill bench mark it.

numpy.random.seed(1) x = numpy.random.random((10000, 10000)) y = numpy.random.random((10000, 10000)) for i in range(5): x = fun1(x,y)

And the output ends up being,

prompt$> time python package/check.py 1.604411024829864e+55 real 1m24.093s user 1m18.549s sys 0m5.836s

Can we make this run faster. When reading the numpy docs they mention that copying the arrays can take up a lot of resources and could be improved by using the operators with an output array.

def fun3(x,y): T = numpy.power(x,3) T2 = numpy.square(x) numpy.multiply(T2,y,T2) numpy.add(T,T2,T) numpy.square(y, T2) numpy.multiply(x, T2, T2) numpy.add(T, T2, T) numpy.power(y, 3, T2) numpy.add(T, T2, T) numpy.add(T, -0.1, T) return T

Using the new function, the result is hardly noticeable.

1.604411024829864e+55 real 1m19.115s user 1m16.325s sys 0m3.052s

The next step is to make a ufunc that performs the same task. I based the following on this official tutorial.

static void double_funny(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in1 = args[0], *in2 = args[1]; char *out1 = args[2]; npy_intp in1_step = steps[0], in2_step = steps[1]; npy_intp out1_step = steps[2]; double x,y; for (i = 0; i < n; i++) { /*BEGIN main ufunc computation*/ x = *(double *)in1; y = *(double *)in2; *((double *)out1) = x*x*x + x*x*y + x*y*y + y*y*y - 0.1; /*END main ufunc computation*/ in1 += in1_step; in2 += in2_step; out1 += out1_step; } }

Then using the same check routine I get:

1.6044110248298627e+55 real 0m5.103s user 0m3.828s sys 0m2.322s

Notice that the output number is slightly different!? The amazing thing to notice is we went from ~80 seconds down to 5 seconds so nearly a 16 times speed up. It is a bit of a hassell to build a c extension so I also used numba. The most challenging aspect of numpy was installing a version that uses a version of llvm I could install.

Once numba was up and running. I included three versions that use numba:

@numba.jit def fun(x, y): return x**3 + x**2*y + x*y**2 + y**3 - 0.1 @numba.jit def fun3(x,y): T = numpy.power(x,3) T2 = numpy.power(x, 2) numpy.multiply(T2,y,T2) numpy.add(T,T2,T) numpy.power(y, 2, T2) numpy.multiply(x, T2, T2) numpy.add(T, T2, T) numpy.power(y, 3, T2) numpy.add(T, T2, T) numpy.add(T, -0.1, T) return T @numba.jit def fun4(x,y): w,h = x.shape z = numpy.zeros(x.shape) for i in range(w): for j in range(h): z[i,j] = x[i,j]**3 + x[i,j]**2*y[i,j] + x[i,j]*y[i,j]**2 + y[i,j]**3 - 0.1 return z

The first version is just the actual numpy calls, then there is a version where the numpy calls are made but re-using an array for the answer, and the final version is an explicit loop.

#fun output 1.6044110248298622e+55 real 0m5.542s user 0m4.164s sys 0m2.595s #fun3 output 1.604411024829864e+55 real 1m21.265s user 1m18.629s sys 0m4.154s #fun4 output 1.6044110248298622e+55 real 0m5.397s user 0m4.071s sys 0m2.557s

Something about the passing back and forth does something to the accuracy. We can see the jit'd versions and ufunc version both have the same output.