posted on: 2018-09-13 06:19:32
With numpy arithmetic a complicated expression can lead to many calls and loops when one loop and call would suffice.

```def fun1(x, y):
return x**3 + x**2*y + x*y**2 + y**3 - 0.1
```

Each operation takes the input arrays and operates on them in c and returns a new array. This doesn't seem efficient, so here is how Ill bench mark it.

```numpy.random.seed(1)
x = numpy.random.random((10000, 10000))
y = numpy.random.random((10000, 10000))

for i in range(5):
x = fun1(x,y)
```

And the output ends up being,

```prompt\$> time python package/check.py
1.604411024829864e+55

real	1m24.093s
user	1m18.549s
sys	0m5.836s

```

Can we make this run faster. When reading the numpy docs they mention that copying the arrays can take up a lot of resources and could be improved by using the operators with an output array.

```def fun3(x,y):
T = numpy.power(x,3)
T2 = numpy.square(x)
numpy.multiply(T2,y,T2)
numpy.square(y, T2)
numpy.multiply(x, T2, T2)
numpy.power(y, 3, T2)
return T
```

Using the new function, the result is hardly noticeable.

```1.604411024829864e+55

real	1m19.115s
user	1m16.325s
sys	0m3.052s
```

The next step is to make a ufunc that performs the same task. I based the following on this official tutorial.

```static void double_funny(char **args, npy_intp *dimensions,
npy_intp* steps, void* data)
{
npy_intp i;
npy_intp n = dimensions;
char *in1 = args, *in2 = args;
char *out1 = args;
npy_intp in1_step = steps, in2_step = steps;
npy_intp out1_step = steps;

double x,y;

for (i = 0; i < n; i++) {
/*BEGIN main ufunc computation*/
x = *(double *)in1;
y = *(double *)in2;
*((double *)out1) = x*x*x + x*x*y + x*y*y + y*y*y - 0.1;
/*END main ufunc computation*/

in1 += in1_step;
in2 += in2_step;
out1 += out1_step;

}
}

```

Then using the same check routine I get:

```1.6044110248298627e+55

real	0m5.103s
user	0m3.828s
sys	0m2.322s
```

Notice that the output number is slightly different!? The amazing thing to notice is we went from ~80 seconds down to 5 seconds so nearly a 16 times speed up. It is a bit of a hassell to build a c extension so I also used numba. The most challenging aspect of numpy was installing a version that uses a version of llvm I could install.

Once numba was up and running. I included three versions that use numba:

```@numba.jit
def fun(x, y):
return x**3 + x**2*y + x*y**2 + y**3 - 0.1

@numba.jit
def fun3(x,y):
T = numpy.power(x,3)
T2 = numpy.power(x, 2)
numpy.multiply(T2,y,T2)
numpy.power(y, 2, T2)
numpy.multiply(x, T2, T2)
numpy.power(y, 3, T2)
return T

@numba.jit
def fun4(x,y):
w,h = x.shape
z = numpy.zeros(x.shape)
for i in range(w):
for j in range(h):
z[i,j] = x[i,j]**3 + x[i,j]**2*y[i,j] + x[i,j]*y[i,j]**2 + y[i,j]**3 - 0.1
return z
```

The first version is just the actual numpy calls, then there is a version where the numpy calls are made but re-using an array for the answer, and the final version is an explicit loop.

```#fun output
1.6044110248298622e+55

real	0m5.542s
user	0m4.164s
sys	0m2.595s

#fun3 output
1.604411024829864e+55

real	1m21.265s
user	1m18.629s
sys	0m4.154s

#fun4 output
1.6044110248298622e+55

real	0m5.397s
user	0m4.071s
sys	0m2.557s

```

Something about the passing back and forth does something to the accuracy. We can see the jit'd versions and ufunc version both have the same output.