In the introductory article, I provided a comprehensive summary of the challenge. The story turned out to be rather long and a bit half-baked: it did not contain a single line of code. However, it’s hard to talk about specific optimizations without the preceding summary. Of course, one can apply some techniques to any code at hand. For instance, caching calculations or reducing branching.
I believe certain things can not be done without an overall understanding of the problem you’re dealing with. That is what distinguishes a living person from the optimizing compiler. And, that’s why manual optimizations still play a critical part: compilers deal with code, humans handle problems. A compiler can not decide a number is sufficiently random while a human can.
Let us recall we’re talking about optimizing image resize via convolution-based resampling in the Pillow Python library. In this “general optimizations” article I’m going to tell you about the changes I implemented several years ago. It’s not a word-for-word story: I optimized the order of optimizations for the narrative. I also created a separate branch of version 2.6.2 in the repo, that’s our starting point.
If you’d like this to get interactive and not just read but run some tests, here’s the pillow-perf repo.
Because Pillow consists of many modules and can’t be compiled incrementally, we use the
ccache utility to speed up repetitive builds significantly.
pillow-perf allows you to test many different operations, but we're interested in the
scale specifically where
-n 3 sets the number of test runs. While the code is pretty slow, let's use the smaller number of
-n not to fall asleep. Here are the results for the initial performance, commit bf1df9a:
The results above differ from the official ver. 2.6 benchmark performance. There are two reasons for that:
- The official benchmark uses 64-bit Ubuntu 16.04 with GCC 5.3 while I used 32-bit Ubuntu 14.04 with GCC 4.8 for this article. You’ll understand why I did so by the end of it.
- In the series of articles, I start with the commit with a bug fix that does not directly relate to optimizations but affects the performance.
A significant part of the code we’re about to look into is in the
Antialias.c file, the
ImagingStretch function exactly. The code of the function can be split into three parts:
As I highlighted in the previous article, convolution-based resize can be done in two passes: the first one deals with changing image width, and the second one—with height or vice versa. The
ImagingStretch function deals with one of the dimensions in one call. Here, you can see the function is called twice for every resize operation. Firstly, there goes a prologue and then one or another operation depending on the input parameters. Quite an unusual way to get rid of the repeating code: prologue in this case. Within the function, both passes look similar adjusted to the processing direction. For brevity, here's the code for one of the passes:
We can see the branching for picture formats supported by Pillow: single-channel 8-bit, shades of gray; multi-channel 8-bit, RGB, RGBA, LA, CMYK, and others; single-channel 32-bit and 32-bit float. We’re specifically interested in the multi-channel 8-bit format since it’s the most common image format.
Even though I mentioned the two passes look similar, there’s a significant difference. Let’s check out the vertical pass first:
And, the horizontal one:
While in the vertical pass we iterate through the columns of the resulting image, in the horizontal one we deal with rows. The horizontal pass is then a serious problem for the processor cache. Each step of the embedded cycle accesses one lower row thus requesting a value that is far from the value needed for the previous step. That’s a problem when working with smaller-sized convolution. The thing is, current processors are only capable of reading a 64-byte cache line from RAM. This means when we convolve less than 16 pixels, reading some part of the data from RAM to cache is a waste. Now, imagine we inverted the loops: now each convolved pixel is not in the next row but the next pixel of the current row. Hence, most of the needed pixels would already be in the cache.
The thing is, current processors are only capable of reading a 64-byte cache line from RAM. This means when we convolve less than 16 pixels, reading some part of the data from RAM to cache is a waste. Now, imagine we inverted the loops: now each convolved pixel is not in the next row but the next pixel of the current row. Hence, most of the needed pixels would already be in the cache.
The second negative factor of such code organization relates to a long convolution line: in the case of a significant downscaling. The thing is, for the adjacent convolutions the original pixels intersect significantly, and it would be great if these data remained in the cache. But when we move from top to bottom, the data for the old convolutions are gradually being replaced from the cache by the data for the new ones. As a result, after the embedded loop when the next external step begins, there are no upper rows in the cache, they are all replaced by the lower ones. Hence, we must read those upper rows from RAM again. And when it comes to the lower rows, all the data have already been replaced by the higher rows in the cache. It turns out a loop where none of the needed data are ever in the cache.
Why is it so? In the pseudo code above we see that the second line in both cases stands for the calculation of the convolution coefficients. For the vertical pass, coefficients only depend on the
yy values in a row of the resulting image, and for the horizontal—on the
xx values in the current column. Hence, we can't just swap the two loops in the horizontal pass: we should calculate the coefficients inside the loop for
xx. If we start counting the coefficients in the embedded loop, we get a degrade in performance. Especially when we use the Lanczos filter for the calculations: it makes use of trigonometric functions.
So, we shouldn’t calculate the coefficients on every step even though those are identical for every pixel in a column. However, we can calculate coefficients for all the columns in advance and use them in the embedded loop. Let’s do this.
There’s a line for allocating the memory for coefficients:
k = malloc(kmax * sizeof(float));
Now, we’ll need an array of such arrays. We can simplify through allocating a flat memory fragment and emulating two dimensions via addressing:
kk = malloc(imOut->xsize * kmax * sizeof(float));
We’ll also need a storage for
xmax which also depend on
xx. Let's make an array for those:
xbounds = malloc(imOut->xsize * 2 * sizeof(float));
Also inside the loop, a value of
ww is used for normalizing,
ww = 1 / Σk. We do not need to store it at all. Instead, we can normalize the coefficients themselves, and not convolution results. So, once we calculate the coefficients, we iterate through them again dividing every single value by the coefficients sum. As a result, the sum of all the coefficients becomes 1.0:
Finally, we can rotate the process by 90 degrees:
Here are the performance results we get then, commit d35755c:
The fourth column shows the performance improvements.
In some parts of the code we can see such construction:
The snippet is about limiting pixel values in the 8-bit range of [0, 255] in case the calculation results are outside of it. That can happen because the sum of all the positive convolution coefficients can be greater than one, and the sum of all the negative ones can be less than zero. So, we can sometimes stumble upon an overflow. The overflow is the result of compensating for sharp brightness gradients and is not an error.
Let’s take a look at the code. There’s one input variable
ss and a single output
imOut-> image[yy]. The output gets assigned values in several places. The thing here is we're comparing floating point numbers while it would be more efficient to convert values into integers and then perform the comparison. So this is the function:
imOut->image[yy][xx*4+b] = clip8(ss);
This optimization also gives us a boost in performance, a moderate one this time, commit 54d3b9d:
As you can see, this optimization has a greater effect on filters with smaller windows and larger output resolutions with the only exception of 320x200 Bilinear; I did not research why it worked out this way. That’s logical, the smaller the filter window and the larger the final resolution, the greater our value limiting contributes to the overall performance.
If we, again, look through the code for the horizontal pass, we encounter four inner loops:
We’re iterating through every row and column of the output image: every pixel, to be precise. Embedded loops are there to iterate through every pixel of the input image to be convolved. What about
b? It's there to iterate through the bands of your image. It's quite obvious the number of image bands is rather constant and does not exceed four: due to the way Pillow represents images. Hence, there are four possible cases, one of which deals with single-channel 8-bit images. Those are represented differently thus the final number of cases is three. We can now make three separate inner loops: for 2-, 3-, and 4-band images respectively. Then, we add branching to switch between the modes. Here's the code for the most common case of a 3-band image, not to take much space here:
We can go even further and unwrap the branching one more time, to the cycle for
Here’s what performance improvements we get, commit 95a9e30.
Something similar can be found for the vertical pass. Here’s the initial code:
There’s no iterating through channels. Instead,
xx iterates over the width multiplied by four, i.e.,
xx goes through each channel no matter the number of bands in an image.
FIXME in the comment relates to that fix. We're doing the same: adding branching to switch depending on the number of input image bands. The code can be found in the commit f227c35, here are the results:
I’d like to highlight that the horizontal pass optimizations provide better performance for downscaling an image while the vertical pass—for upscaling.
If we take a look at the embedded loop, we can see that
ymax are declared as floats. However, those are converted to integers at every step. Moreover, outside the loop, when the variables get assigned any values,
ceil functions are used. So, every value they ever get assigned is an integer even though the variables are initially declared as floats. The Same concept can be applied to
xmax. Let's change that and check out the performance, commit 57e8925:
I admit I was happy about the results. I managed to speed up the code 2.5 times, and you won’t have to use better hardware or anything like that to get the boost: the same number of cores of the same CPU. The only requirement was updating Pillow to the version 2.7. There still was some time for the update to come out, and I wanted to test the new code on a server for which it was intended. I cloned the code, compiled it, and at first, I even got the feeling I got something wrong:
Lolwut!? Everything performed the same as before any optimization. I re-checked everything ten times and included
Well, Glory to Moore, I was not crazy, it was a real bug in the compiler. Moreover, they fixed the bug in GCC 4.9, but GCC 4.8 was included in the current Ubuntu 14.04 LTS distribution. That is, GCC 4.8 was most likely installed by most users of the library. Ignoring this was impractical: how would users benefit from the optimization if it does not work in the majority of cases including the one it was made for. I updated the question on StackOverflow and threw a tweet. That’s how Vyacheslav Egorov, a former V8 engine developer and a genius of optimization, came to help me.
To understand the problem, we’ll need to delve deeper into the history of CPUs and their current architecture. Way back, x86 processors did not work with floating-point numbers, x87 command set enabled coprocessors did that for them. Those would execute instructions from the same thread as CPUs but were installed on the motherboard as separate devices. Then, coprocessors were built into central ones and physically formed one device together. Sooner or later, in the Pentium III CPU, Intel introduced a set of instructions called SSE (Streaming SIMD Extensions). The third article of the series will be devoted to SIMD instructions, by the way. Despite its name, SSE did not only contain SIMD instructions for working with floating-point numbers but also their equivalents for scalar computations. That is, SSE held a set of instructions duplicating the x87 set but encoded and behaving differently.
However, compilers didn’t hurry to generate SSE code for floating-point calculations but would continue using the older x87 set. After all, the existence of SSE in a CPU could not be guaranteed while the x87 set was there for decades. Things changed when the 64-bit CPU mode was out. For this mode, the SSE2 instructions set became mandatory. So, if you write a 64-bit program for x86, the SSE2 instructions set is available for you at the least. And that’s what compilers make use of: they generate SSE instructions for floating point calculations in 64-bit mode. Again, I’m talking about ordinary scalar calculations, no vectorization involved.
That’s what was happening in our case: different instruction sets were used for 32-bit and 64-bit mode. However, this does not explain why the later SSE code worked many times slower than the outdated x87 one. To understand the phenomenon, let’s talk about how CPUs execute instructions.
Some time ago, processors did really “execute” instructions. They would get the instruction, decode it, execute what it was about, and put results where they were supposed to. Those could be smarter, and it happened. Modern processors are brainy and more complex: they consist of dozens of different subsystems. On a single core, without any parallel processing, CPUs execute several instructions in one clock cycle. This execution occurs at different stages for different instructions: while some are being decoded, another get data from the cache, others get transferred to the arithmetic block. Each processor subsystem deals with its part of instruction. That’s called a CPU pipeline.
In the picture, colors show different CPU subsystems. Even though instructions require 4–5 cycles to execute, thanks to the pipeline, each clock cycle one instruction can be started while another gets terminated.
The pipeline works the more efficiently, the more uniformly it is filled, and the fewer subsystems are idle. There even are CPU subsystems that plan the optimal pipeline filling: they swap, split or combine instructions.
One of the things causing CPU pipelines to under perform is data dependency. That’s when some instruction requires the output of another instruction to execute so that many processor subsystems are idle waiting for those outputs.
The figure shows instruction two is waiting for the completion of the first one. Most subsystems are idle. This slows down the execution of an entire instruction chain.
And now, with all the info, let’s look at the pseudo code of an instruction converting an integer to a floating-point number:
The results are written in the lower 32 bits. Even though it’s further written that lower bits from the register
a are transferred to the
dst register, it's not what is happening. Looking at the signature of the instruction, we get it only deals with a single
xmm register which means both
a relate to the same register: higher 96 bits are not moving anywhere. And here's where we have the data dependency. The instruction is made in such a way that ensures the safety of the high-order bits, and therefore, we need to wait for the results of all other operations working with the register to execute it.
However, the thing is we aren’t using any of the higher bits and are interested in the lower 32-bit float. We don’t care about the higher bits hence those don’t affect our results. That’s called a false data dependency.
Fortunately, we can break the dependency. Before executing the
cvtsi2ss conversion instruction, the compiler should perform the registry cleanup via
xorps. I can't call this fix intuitive and even logical. Most likely, it's not a fix but a hack on the decoder level that replaces
xorps + cvtsi2ss with some internal instruction with the following pseudo code:
The fix for GCC 4.8 is rather ugly; it consists of Assembler and a preprocessor code that checks if the fix can be applied. I won’t show the code here. However, you can always check out the commit 81fc88e. This completely fixes the 64-bit code. Here’s how it affects the performance on a server machine:
The situation I described in this part of the series is rather frequent. The code that converts integers to floats and then runs some calculations can be found in almost any program. That’s also the case with ImageMagick, for instance: its 64-bit versions compiled with GCC 4.9 are 40% faster than the ones compiled using previous GCC versions. In my opinion, that’s a serious flaw in SSE.
That’s how we got to the average improvement of 2.5x without altering the approach: by using general optimizations. In my next article, I’ll show how to speed up that result 3.5 times by implementing SIMD techniques.