Image Optimization April 7, 2021 by Alex Karpinsky

Fast Import of Pillow Images to NumPy / OpenCV Arrays

There’s an incredible technique that makes it possible to convert Pillow images to NumPy arrays with just two memory copies! Wait, what do you mean “with only two memory copies”? Isn’t it possible to convert data between libraries while copying memory only once or without copying it at all?

Well, it seems weird, but more traditional image converting methods work 1.5-2.5 times slower (if you need a mutable object). Today, I’m going to dive deep into both libraries and tell you why that happens. Also, I’ll show you a way to get the same result but faster. There won’t be any repositories or packages, just the facts, and working code at the end.

First Things First: Key Concepts 

Pillow is a Python imaging library. It supports different formats, provides lazy loading, and gives access to metadata from a file. Long story short, it does everything you need for image loading/saving.

NumPy is a Python library used for working with multidimensional arrays. It’s a base library for a bunch of scientific, computer vision, and machine learning libraries like SciPy, Pandas, Astropy and many others.

OpenCV is the most popular computer vision library and has a wide range of features. It doesn’t have its own internal storage format for images, instead, it uses NumPy arrays. The common scenario for using this library is when you need to convert an image from Pillow to NumPy so that you can work with it using OpenCV.

Today I’m going to run benchmarks on a Raspberry Pi 4 1800 MHz under a 64-bit OS. After all, where else would you need computer vision, if not on Raspberry? 🙂

How NumPy Conversion Works

Here are the two most common ways to convert a Pillow image to NumPy. If you Google it, you’ll probably find one of them: 

  • numpy.array(im) — makes a copy from an image to a NumPy array.
  • numpy.asarray(im) — the same as numpy.array(im, copy=False). Supposedly, it doesn’t make a copy but uses the memory of the original object instead. But it’s a bit more complicated than that.

One would think that in the second case, the NumPy array becomes kind of a representation of the original image, and if you change the NumPy array, the image will also change. In fact, that’s not the case:

In [1]: from PIL import Image
In [2]: import numpy
In [3]: im = Image.open('./canyon.jpg').resize((4096, 4096))
In [4]: n = numpy.asarray(im)

In [5]: n[:, :, 0] = 255
ValueError: assignment destination is read-only

In [6]: n.flags
Out[6]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

It’s way different from what you get if you use the numpy.array() function:

In [7]: n = numpy.array(im)

In [8]: n[:, :, 0] = 255

In [9]: n.flags
Out[9]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

However, the asarray() function works much faster:

In [10]: %timeit -n 10 n = numpy.array(im)
257 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit -n 10 n = numpy.asarray(im)
179 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There are fewer memory copies, but there’s a price: you can’t change the array. Still, the conversion time remains horribly long compared to the one-copy method. Let’s figure out why that happens. 

NumPy Arrays Interface

If you look at the dependencies and code behind Pillow, you won’t find any mention of NumPy. (Well, you will, but only in the comments.) The same is true for NumPy. So how are images converted from one format to another? Turns out NumPy has a special interface for that. You create a special property for the object where you explain to NumPy how it should retrieve data, and it retrieves the data that way. Here’s an example of implementing this property from Pillow:

    @property
    def __array_interface__(self):
        shape, typestr = _conv_type_shape(self)
        return {
            "shape": shape,
            "typestr": typestr,
            "version": 3,
            "data": self.tobytes(),
        }

_conv_type_shape() describes the type and size of the array that should be obtained. But the most interesting things happen in the tobytes() method. If you check how long this method is executed, it becomes clear that, in general, NumPy does not add anything from itself:

In [12]: %timeit -n 10 n = im.tobytes()
179 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The time is exactly the same as the time of the asarray() function. Well, it seems like we’ve just found the culprit. The only thing left is to change the function call or speed it up, right? But it’s not that simple. 

Memory Organization in Pillow and NumPy

The memory model in NumPy is quite simple: an array is seen as a continuous chunk of memory starting at some location. Plus, there are strides that separately set offsets for each dimension. 

In Pillow, it’s a whole other thing. The image is stored in a number of chunks, and each chunk contains several image lines. Each pixel takes up 1 or 4 bytes (not from 1 to 4, exactly 1 or 4). This means that some bytes are not used for some color modes. For example, for RGB, the last byte in each pixel isn’t used. For black and white images with an alpha channel (LA mode), the middle two bytes are not used so that the alpha channel is in the last byte of the pixel.

I’m telling you this because I don’t want you to have the illusion that it’s possible to completely solve this issue without rewriting one of the libraries.  

Now you understand why we need the tobytes() method here: it transforms the internal representation of a Pillow image into a continuous flow of bytes without omissions, and that’s exactly what NumPy can use. After getting the bytes object, NumPy can either make a copy or use it in read-only mode. I’m not sure if that’s done to make it impossible to get around object immutability in Python, or if there are some real limitations on the C API level. Either way, if, for example, the input object is bytearray instead of bytes, the array will not be read-only.

But let’s take a look at a simplified version of tobytes():

    def tobytes(self):
        self.load()
        # unpack data
        e = Image._getencoder(self.mode, "raw", self.mode)
        e.setimage(self.im)

        data, bufsize, s = [], 65536, 0
        while not s:
            l, s, d = e.encode(bufsize)
            data.append(d)
        if s < 0:
            raise RuntimeError(f"encoder error {s} in tobytes")
        return b"".join(data)

Here we can see that the “raw” encoder has been created, and it produces image chunks of at least 65 kilobytes of memory. This is the first memory copy: by the end of the functions, we have the entire image in small chunks in the data array. The last line is the second memory copy: all the chunks are collected into one large byte string.

So, Who’s Guilty, and How to Deal With It?

It’s important to remember that libraries are written in such a way that there’s an interface, but there’s no explicit use of one another. I believe that’s almost the optimal solution under such conditions. But still, what if we don’t have such a limitation but want to get the maximum possible speed?

I want to highlight that we can’t give up the encoder: who knows what implementation details it’s hiding from us? Transferring it all to the Python level or rewriting part of it in C is a last resort.

It seems like it would be more reasonable to allocate a buffer of the required size in tobytes() in advance, and then write chunks to it. But it’s obvious that the encoder interface doesn’t work that way: it already returns chunks packed into bytes objects. However, if you don’t store all those chunks, but instead instantly copy it to a buffer, the data won’t be washed out from the L2 cache and will quickly get to the right place. It will look something like this:

def to_mem(im):
    im.load()
    e = Image._getencoder(im.mode, "raw", im.mode)
    e.setimage(im.im)

    mem = ... # we don't know yet

    bufsize, offset, s = 65536, 0, 0
    while not s:
        l, s, d = e.encode(bufsize)
        mem[offset:offset + len(d)] = d
        offset += len(d)
    if s < 0:
        raise RuntimeError(f"encoder error {s} in tobytes")
    return mem

What will we have instead of mem? Ideally, it should be a NumPy array. It’s not a problem to create it; we’ve already seen what parameters it will have in __array_interface__:

In [13]: shape, typestr = Image._conv_type_shape(im)
In [14]: data = numpy.empty(shape, dtype=numpy.dtype(typestr))

But it won’t work if instead of mem you try to take its flat version: 

In [15]: mem = data.reshape((data.size,))
In [16]: mem[0:4] = b'abcd'
ValueError: invalid literal for int() with base 10: b'abcd'

In this case, it seems strange that it’s impossible to put bytes into a byte array.  But keep in mind that, firstly, not only bytes can be on the left, and secondly, the library is called NumPy, meaning it works with numbers. Fortunately, NumPy gives you access to the immediate memory of an array directly from Python. Here’s its data property:

In [17]: data.data
Out[17]: <memory at 0x7f78854d68>

In [18]: data.data[0] = 255
NotImplementedError: sub-views are not implemented

In [19]: data.data.shape
Out[19]: (4096, 4096, 3)

In [20]: data.data[0, 0, 0] = 255

There is a memoryview object. But this memoryview is strange: it’s also multidimensional, like the NumPy array itself, and it has the same object type as the array itself. Fortunately, it’s easy to fix with the cast method:

In [21]: mem = data.data.cast('B', (data.data.nbytes,))

In [22]: mem.nbytes == mem.shape[0]
Out[22]: True

In [23]: mem[0], mem[1]
Out[23]: (255, 0)

In [24]: mem[0:4] = b'1234'

In [25]: mem[0], mem[1]
Out[25]: (49, 50)

Now let’s put it all together:

def to_numpy(im):
    im.load()
    # unpack data
    e = Image._getencoder(im.mode, 'raw', im.mode)
    e.setimage(im.im)

    # NumPy buffer for the result
    shape, typestr = Image._conv_type_shape(im)
    data = numpy.empty(shape, dtype=numpy.dtype(typestr))
    mem = data.data.cast('B', (data.data.nbytes,))

    bufsize, s, offset = 65536, 0, 0
    while not s:
        l, s, d = e.encode(bufsize)
        mem[offset:offset + len(d)] = d
        offset += len(d)
    if s < 0:
        raise RuntimeError("encoder error %d in tobytes" % s)
    return data

And check:

In [26]: n = to_numpy(im)

In [27]: numpy.all(n == numpy.array(im))
Out[27]: True

In [28]: n.flags
Out[28]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [29]: %timeit -n 10 n = to_numpy(im)
101 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Great! It’s 2.5 times faster with the same functionality and a smaller number of allocations.

Benchmarks 

The image I picked for this test was quite large. It’s not because to_numpy doesn’t work faster for smaller images. (It does!) The thing is that, in general, it’s tough to achieve any kind of constant runtime when it comes to memory allocation. The allocator can request new memory from the system or provide preallocated memory. It can decide to fill it with zeros or leave it as it is. From this perspective, working with large arrays at least gives us a stable result: we always get the worst case.

Here’s the code:

In [30]: for i in range(6, 0, -1):
    ...:     i = 128 * 2 ** i
    ...:     print(f'\n\nSize: {i}x{i}   \t{i*i // 1024} KPx')
    ...:     im = Image.new('RGB', (i, i))
    ...:     print('\tnumpy.array()')
    ...:     %timeit n = numpy.array(im)
    ...:     print('\tnumpy.asarray()')
    ...:     %timeit n = numpy.asarray(im)
    ...:     print('\tto_numpy()')
    ...:     %timeit n = to_numpy(im)
    ...:     im = None
    ...: 

Results:

Size numpy.array()numpy .asarray()to_numpy()Acceleration 
8192×8192995 ms683 ms378 ms2.63x
4096×40962571791012.54x
2048×204824.513.410.52.33x
1024×10244.823.452.741.77x
512×5121.341.050.751.79x
256×2560.260.20.181.44x

As a result, we’ve managed to eliminate unnecessary memory allocation, speed up the process from 1.5 to 2.5 times, and figure out how NumPy works with memory along the way. 

Leave a comment

*

*

One developer subscribed to the Uploadcare blog and eventually became a CTO