How we made Blurhash 128x faster

November 19, 2024

Blurhash is a nifty way to generate a blurred preview of an image, represented as a compact ASCII string. Originally developed by the Finnish company Wolt, it’s a clever solution for creating smooth, graceful content-loading experiences on the web. For the longest time, I’ve wanted to bring something like this into the Uploadcare URL API, giving our clients an easy way to enhance their sites’ user experience. But every time I looked into it, one thing kept bothering me: performance. It just seemed too slow.

Now, it’s finally time to dig in and uncover what’s behind the sluggishness.

The main Blurhash repository includes reference implementations in C, Swift, Kotlin, and TypeScript. For my own dive into the code, though, I kicked things off with the Python library.

0. C <=> Python interface

Let’s start by understanding how the library is organized. The Blurhash Python implementation interacts with the C library through the CFFI, but not in the most conventional way. Instead of linking to a system library, the project copies the decode.c and encode.c files directly into its source. This approach works fine for small libraries with straightforward APIs, like Blurhash.

The core C function, create_hash_from_pixels, handles the heavy lifting. It accepts:

An amount of hash components (x_components and y_components), in other words, hash resolution.
Original image dimensions (width and height).
The image itself with all its raw RGB data (rgb).
The bytes_per_row accounts for extra padding in images and allows cropping by simply adjusting the pointer to the first pixel and recalculating the row size.
The destination as an output Blurhash string.

Here’s how the Python code wraps this functionality:

def encode(image, x_components, y_components):
    if not isinstance(image, Image.Image):
        image = Image.open(image)
    if image.mode != 'RGB':
        image = image.convert('RGB')
    red_band = image.getdata(band=0)
    green_band = image.getdata(band=1)
    blue_band = image.getdata(band=2)
    rgb_data = list(chain.from_iterable(zip(red_band, green_band, blue_band)))
    width, height = image.size
    image.close()

    rgb = _ffi.new('uint8_t[]', rgb_data)
    bytes_per_row = _ffi.cast('size_t', width * 3)
    destination = _ffi.new("char[]", 2 + 4 + (9 * 9 - 1) * 2 + 1)

    result = _lib.create_hash_from_pixels(
        x_components, y_components,
        width, height, rgb, bytes_per_row,
        destination,
    )

    # ...

At first glance, this looks functional, but it has some inefficiencies. The wrapper iterates through every pixel, extracts RGB values into separate lists, and then recombines them into a C array. This process is resource-intensive and far from optimal.

To assess its performance, I tested the function using a 360×240-pixel image:

from blurhash import encode, Image
im = Image.open('/Code/imgs/bologna2k.jpg').resize((360, 240), Image.BICUBIC)
assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
%timeit encode(im, 6, 4)

Note, I used assert to make sure I won’t ruin anything during the optimization. However, I encountered an issue right away:

ValueError: Operation on closed image

This error occurred because the image.close() call in the function prematurely closed the image object. To work around this, I modified the test to explicitly create a copy of the image before calling the function:

   ...: assert encode(im.copy(), 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im.copy(), 6, 4)
126 ms ± 995 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The results were disappointing: 126 milliseconds per call. For such a small image, this performance is far too slow, especially for practical use in production.

Upon closer inspection, I realized the process of extracting individual RGB channels and recombining them was unnecessary. Pillow offers a .tobytes() method that returns the raw pixel data as a bytes object. Replacing the manual channel extraction with this method cleaned up the code and provided a small performance boost:

-    red_band = image.getdata(band=0)
-    green_band = image.getdata(band=1)
-    blue_band = image.getdata(band=2)
-    rgb_data = list(chain.from_iterable(zip(red_band, green_band, blue_band)))
+    rgb_data = image.tobytes()

After this change, I tested the performance again:

   ...: assert encode(im.copy(), 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im.copy(), 6, 4)
109 ms ± 487 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The runtime improved to 109 milliseconds, a modest 15% increase in speed. While this wasn’t a dramatic improvement, it removed unnecessary complexity and reduced dependencies like itertools and six.

But what about image.close()? It’s not completely useless — it’s meant to handle cases where the function works with both a preloaded image object and an image file opened directly. However, there are several issues.

First, if the input image is passed by path and isn’t in RGB mode, the function converts opened image to RGB and then closes the converted image instead of the one its created. This leaves the image object open, which could lead to resource management issues. Second, if you pass an RGB image as an object, it will be closed even though you may not expect it.

To address this, we can use a context manager to cleanly handle the image’s lifecycle, ensuring that images are properly managed. Here’s how the context is applied:

from contextlib import nullcontext

def encode(image, x_components, y_components):
    if isinstance(image, Image.Image):
        image_context = nullcontext()
    else:
        image = Image.open(image)
        image_context = image
    with image_context:
        if image.mode != 'RGB':
            image = image.convert('RGB')
        rgb_data = image.tobytes()
        width, height = image.size

    rgb = _ffi.new('uint8_t[]', rgb_data)
    
    # ...

The rest of the code remains unchanged. However, with this fix, the explicit call to copy() before invoking encode() is no longer necessary. (Look at the time distribution ± 30 μs vs ± 487 μs in the previous call)

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
108 ms ± 30 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Although this change doesn’t significantly improve speed, it makes the results much more stable by eliminating unnecessary memory copying.

1. Cache all the things

Now, let’s dive deeper into the performance bottlenecks. Most of the runtime is spent in the encode.c file, specifically in the multiplyBasisFunction function. This function is called for each pixel of the hash image, iterating over every combination of xComponents and yComponents. Here’s the code:

static struct RGB multiplyBasisFunction(
	int xComponent, int yComponent, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow
) {
	struct RGB result = { 0, 0, 0 };
	float normalisation = (xComponent == 0 && yComponent == 0) ? 1 : 2;

	for(int y = 0; y < height; y++) {
		for(int x = 0; x < width; x++) {
			float basis = cosf(M_PI * xComponent * x / width) * cosf(M_PI * yComponent * y / height);
			result.r += basis * sRGBToLinear(rgb[3 * x + 0 + y * bytesPerRow]);
			result.g += basis * sRGBToLinear(rgb[3 * x + 1 + y * bytesPerRow]);
			result.b += basis * sRGBToLinear(rgb[3 * x + 2 + y * bytesPerRow]);
		}
	}

	float scale = normalisation / (width * height);

	result.r *= scale;
	result.g *= scale;
	result.b *= scale;

	return result;
}

The code iterates over every pixel of the input image and calculates a coefficient called basis using cosine functions. These calculations are combined with the pixel values (converted to linear color space using sRGBToLinear) to produce the final result. Essentially, this is a discrete cosine transform (DCT) for the Blurhash.

But here’s the catch: the sRGBToLinear function is called repeatedly for every pixel, even though the input values (8-bit RGB values) only range from 0 to 255. This means we’re recalculating the same results over and over for no reason.

To optimize this, we can precompute and cache the results of sRGBToLinear for all possible input values (0–255). Instead of recalculating the same values during every encode function call, we simply look them up in the cache. Since this cache only requires 256 floats (1024 bytes of memory), it’s incredibly lightweight but delivers a significant performance boost.

Here’s how the cache is implemented:

float *sRGBToLinear_cache = NULL;

static void init_sRGBToLinear_cache() {
	if (sRGBToLinear_cache != NULL) {
		return;
	}
	sRGBToLinear_cache = (float *)malloc(sizeof(float) * 256);
	for (int x = 0; x < 256; x++) {
		sRGBToLinear_cache[x] = sRGBToLinear(x);
	}
}

And here’s how it’s used in the multiplyBasisFunction function:

static struct RGB multiplyBasisFunction(
	int xComponent, int yComponent, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow
) {
	struct RGB result = { 0, 0, 0 };

	init_sRGBToLinear_cache();

	for(int y = 0; y < height; y++) {
		for(int x = 0; x < width; x++) {
			float basis = cosf(M_PI * xComponent * x / width) * cosf(M_PI * yComponent * y / height);
			result.r += basis * sRGBToLinear_cache[rgb[3 * x + 0 + y * bytesPerRow]];
			result.g += basis * sRGBToLinear_cache[rgb[3 * x + 1 + y * bytesPerRow]];
			result.b += basis * sRGBToLinear_cache[rgb[3 * x + 2 + y * bytesPerRow]];
		}
	}
  
	// ...
}

With this optimization in place, I ran the tests again:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
14.2 ms ± 6.37 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The result was astonishing: 14.2 milliseconds per call, an 8.87x speedup compared to the previous version. This demonstrates how a simple caching mechanism can make a massive difference in performance.

Another interesting observation is that the basis coefficient calculation uses the cosf function, which is computationally expensive. For every pixel, cosf is calculated twice: once for x and once for y. What’s notable is that cosf values depend only on the counters (x and y), which repeat the same values across iterations.

Fortunately, the compiler can optimize repeated calculations for cos(y) automatically, but it doesn’t do the same for cos(x). To improve this, we can precompute the cosf values for x and store them in an array, which avoids recalculating them during every iteration:

static struct RGB multiplyBasisFunction(
  int xComponent, int yComponent, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow
) {
	struct RGB result = { 0, 0, 0 };
	float *cosx = (float *)malloc(sizeof(float) * width);

	for(int x = 0; x < width; x++) {
		cosx[x] = cosf(M_PI * xComponent * x / width);
	}

	for(int y = 0; y < height; y++) {
		float cosy = cosf(M_PI * yComponent * y / height);
		uint8_t *src = rgb + y * bytesPerRow;
		for(int x = 0; x < width; x++) {
			float basis = cosy * cosx[x];
			result.r += basis * sRGBToLinear_cache[src[3 * x + 0]];
			result.g += basis * sRGBToLinear_cache[src[3 * x + 1]];
			result.b += basis * sRGBToLinear_cache[src[3 * x + 2]];
		}
	}

	free(cosx);
  
	// ...
}

I ran the performance tests again after implementing this change:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
3.45 ms ± 3.25 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3.45 milliseconds per call. 36x speedup compared to the original implementation.

At this point, the speed was already sufficient to consider the library optimized for production use. However, curiosity and the thrill of further improvement kept me going. There was still room for optimization!

2. One pass to rule them all

The multiplyBasisFunction function calculates exactly one cosine transform coefficient for each combination of xComponent and yComponent. For every call, it iterates over all pixels of the input image and recalculates the cosf values for the same xComponent and yComponent combinations repeatedly. This redundancy costs us precious computation time.

To optimize this further, we can precompute the cosine values for all xComponent and yComponent combinations in advance and store them in arrays. By passing these arrays to the function, we eliminate the need to recalculate the same values multiple times. This reduces the overhead significantly.

The optimized multiplyBasisFunction function becomes much simpler:

static struct RGB multiplyBasisFunction(
	int xComponent, int yComponent, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow, float *cosx, float *cosy
) {
	struct RGB result = { 0, 0, 0 };

	for(int y = 0; y < height; y++) {
		uint8_t *src = rgb + y * bytesPerRow;
		for(int x = 0; x < width; x++) {
			float basis = cosy[y] * cosx[x];
			result.r += basis * sRGBToLinear_cache[src[3 * x + 0]];
			result.g += basis * sRGBToLinear_cache[src[3 * x + 1]];
			result.b += basis * sRGBToLinear_cache[src[3 * x + 2]];
		}
	}

	// ...
}

The blurHashForPixels function, which calls multiplyBasisFunction, is also modified to prepare all the coefficients ahead of time:

const char *blurHashForPixels(
	int xComponents, int yComponents, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow, char *destination
) {
	if(xComponents < 1 || xComponents > 9) return NULL;
	if(yComponents < 1 || yComponents > 9) return NULL;

	float factors[9 * 9][3];

	init_sRGBToLinear_cache();

	float *cosx = (float *)malloc(sizeof(float) * width * xComponents);
	if (! cosx) return NULL;
	float *cosy = (float *)malloc(sizeof(float) * height);
	if (! cosy) {
		free(cosx);
		return NULL;
	}
	for(int x = 0; x < xComponents; x++) {
		for(int i = 0; i < width; i++) {
			cosx[x * width + i] = cosf(M_PI * x * i / width);
		}
	}
	for(int y = 0; y < yComponents; y++) {
		for(int i = 0; i < height; i++) {
			cosy[i] = cosf(M_PI * y * i / height);
		}
		for(int x = 0; x < xComponents; x++) {
			struct RGB factor = multiplyBasisFunction(x, y, width, height, rgb, bytesPerRow, cosx + x * width, cosy);
			factors[y * xComponents + x][0] = factor.r;
			factors[y * xComponents + x][1] = factor.g;
			factors[y * xComponents + x][2] = factor.b;
		}
	}
	free(cosx);
	free(cosy);

	// ...
}

After applying these changes, I tested the performance again:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
3.4 ms ± 3.65 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The result was 3.4 milliseconds per call. While the improvement compared to the previous version was marginal, the number of cosine calculations was drastically reduced, and the code became significantly cleaner and more efficient.

With some reflection, it became clear that not only do the cosine values remain constant between different calls to multiplyBasisFunction, but the image pixels remain unchanged as well. In the current implementation, the code iterates over the image and retrieves values from the sRGBToLinear_cache 24 times (6 × 4) for each channel. Does this make sense? Absolutely not. It’s redundant and inefficient. This realization marked the need for the most significant change in the code and its logic so far: performing all calculations in a single pass.

The straightforward solution would be to traverse xComponent and yComponent for each pixel within the image-processing loop. However, this approach creates four levels of nested loops, where the innermost loops cover very small distances. Based on prior experience, such a structure is highly prone to severe performance drops due to frequent CPU branch mispredictions. Although I didn’t test this directly, it’s a well-known issue that would almost certainly slow the code down significantly.

To avoid these pitfalls, I adopted a different strategy: collapsing the two inner loops into a single loop and treating xComponent and yComponent as a single flat index. This eliminates the need to handle the two components separately during iteration, simplifying the logic.

However, a new challenge arose: how do we determine the correct xComponent and yComponent values at each step to access the precomputed cosine indices? Integer division and modulus operations could theoretically solve this, but they are computationally expensive and would offset the performance gains from collapsing the loops.

The more efficient solution was to duplicate the coefficients for every combination of xComponent and yComponent. By expanding the cosX and cosY arrays to dimensions of width * factorsCount, where factorsCount is xComponents * yComponents, rather than the original width * xComponents, the indexing problem was resolved. This approach slightly increases memory usage but avoids unnecessary calculations and eliminates the need for division and modulus operations entirely.

This tradeoff — modestly higher memory consumption in exchange for reduced computational overhead — is an acceptable compromise, particularly given the performance gains it unlocks.

static void multiplyBasisFunction(
	struct RGB *factors, int factorsCount, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow, float *cosX, float *cosY
) {
	for(int y = 0; y < height; y++) {
		uint8_t *src = rgb + y * bytesPerRow;
		float *cosYLocal = cosY + y * factorsCount;
		for(int x = 0; x < width; x++) {
			float pixel[3];
			float *cosXLocal = cosX + x * factorsCount;
			pixel[0] = sRGBToLinear_cache[src[3 * x + 0]];
			pixel[1] = sRGBToLinear_cache[src[3 * x + 1]];
			pixel[2] = sRGBToLinear_cache[src[3 * x + 2]];
			for (int i = 0; i < factorsCount; i++) {
				float basis = cosYLocal[i] * cosXLocal[i];
				factors[i].r += basis * pixel[0];
				factors[i].g += basis * pixel[1];
				factors[i].b += basis * pixel[2];
			}
		}
	}

	for (int i = 0; i < factorsCount; i++) {
		float normalisation = (i == 0) ? 1 : 2;
		float scale = normalisation / (width * height);
		factors[i].r *= scale;
		factors[i].g *= scale;
		factors[i].b *= scale;
	}
}

With this optimization in place, I tested the function again:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
1.44 ms ± 950 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The results showed a remarkable improvement: 1.44 milliseconds per call. Compared to the initial implementation, this represents an 87.5x speedup. By consolidating redundant operations and computing all coefficients in a single pass, we drastically improved performance while simplifying the logic.

3. SIMD (SSE + Neon)

Finally, we arrive at one of my favorite topics: SIMD. It’s about to get exciting.

I had previously written SIMD optimizations for x86 architectures, but this time, my goal was to implement support for both x86 and ARM. All previous benchmarks had been run under Rosetta emulation on an M1 Pro, so this was the perfect opportunity to explore the ARM platform natively.

SSE implementation

The changes for SSE were minimal. For compatibility, I used the original SSE, ensuring it would run on older processors like the Pentium III. Here’s the updated function:

static void multiplyBasisFunction(
	float factors[][4], int factorsCount, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow, float *cosX, float *cosY
) {
	for(int y = 0; y < height; y++) {
		uint8_t *src = rgb + y * bytesPerRow;
		float *cosYLocal = cosY + y * factorsCount;
		int x = 0;
		for(; x < width; x++) {
			float *cosXLocal = cosX + x * factorsCount;
#ifdef __SSE__
			__m128 pixel = _mm_set_ps(0, sRGBToLinear_cache[src[3 * (x+0) + 2]],
				sRGBToLinear_cache[src[3 * (x+0) + 1]], sRGBToLinear_cache[src[3 * (x+0) + 0]]);
			for (int i = 0; i < factorsCount; i++) {
				__m128 basis = _mm_set1_ps(cosYLocal[i] * cosXLocal[i + 0 * factorsCount]);
				__m128 factor = _mm_loadu_ps(factors[i]);
				factor = _mm_add_ps(factor, _mm_mul_ps(basis, pixel));
				_mm_storeu_ps(factors[i], factor);
			}
#else
			float pixel[4];
			pixel[0] = sRGBToLinear_cache[src[3 * x + 0]];
			pixel[1] = sRGBToLinear_cache[src[3 * x + 1]];
			pixel[2] = sRGBToLinear_cache[src[3 * x + 2]];
			for (int i = 0; i < factorsCount; i++) {
				float basis = cosYLocal[i] * cosXLocal[i];
				factors[i][0] += basis * pixel[0];
				factors[i][1] += basis * pixel[1];
				factors[i][2] += basis * pixel[2];
			}
#endif
		}
	}

	// ...
}

With this SSE implementation, I achieved significant performance gains. Here’s the benchmark:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
973 μs ± 1.13 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The result was 0.97 milliseconds per call — a remarkable 128x speedup compared to the original version.

Neon implementation

The native implementation without Neon ran even faster on the M1 Pro without Rosetta translation layer (though these measurements aren’t included in the final result):

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
808 μs ± 454 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The equivalent code for Neon on ARM turned out to be quite straightforward:

#elif defined(__aarch64__)
			float32x4_t pixel = {sRGBToLinear_cache[src[3 * (x+0) + 0]],
				sRGBToLinear_cache[src[3 * (x+0) + 1]], sRGBToLinear_cache[src[3 * (x+0) + 2]]};
			for (int i = 0; i < factorsCount; i++) {
				float32x4_t basis = vdupq_n_f32(cosYLocal[i] * cosXLocal[i + 0 * factorsCount]);
				float32x4_t factor = vld1q_f32(factors[i]);
				factor = vmlaq_f32(factor, basis, pixel);
				vst1q_f32(factors[i], factor);
			}
#else

I ran the benchmark natively on the M1 Pro, bypassing Rosetta:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
962 μs ± 24.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The function was completed in 0.96 milliseconds per call. However, this was slightly slower than the earlier benchmark without Rosetta. Even with the fused multiplication and addition vmlaq_f32 operation, the performance dipped to 0.84x the original speed.

At this point, I had already migrated to the main Blurhash repository to test the code in pure C. While analyzing the encode.o dump, I discovered that the compiler was smarter than I anticipated. It didn’t just unfold the inner loop by factorsCount; it also used ld4 instructions, which loaded much more data efficiently. Interestingly, these instructions caused slowdowns in Docker when Rosetta was enabled — a bug I encountered during testing.

Although this attempt didn’t achieve the expected improvement, it showed that compilers are often capable of manually applying optimizations beyond what’s feasible. Instead of competing with the compiler, I decided to assist it by structuring the code to allow better vectorization.

4. Untie the compiler’s hands

At this point, it was time to help the compiler do its job more effectively. Modern compilers like GCC and Clang are capable of impressive optimizations, but sometimes they need a little guidance to make the most of the code.

The first inefficiency I noticed was that the factors[i] array was being repeatedly loaded and stored during each iteration of the loop. Ideally, these values should stay in registers for as long as possible. However, when factorsCount becomes large (e.g., 9 × 9 = 81), the available registers are insufficient to handle all combinations of xComponent and yComponent.

#ifdef __SSE__
			__m128 pixel = _mm_set_ps(0, sRGBToLinear_cache[src[3 * (x+0) + 2]],
				sRGBToLinear_cache[src[3 * (x+0) + 1]], sRGBToLinear_cache[src[3 * (x+0) + 0]]);
			for (int i = 0; i < factorsCount; i++) {
				__m128 basis = _mm_set1_ps(cosYLocal[i] * cosXLocal[i + 0 * factorsCount]);
				__m128 factor = _mm_loadu_ps(factors[i]);
				factor = _mm_add_ps(factor, _mm_mul_ps(basis, pixel));
				_mm_storeu_ps(factors[i], factor);
			}
#else

To address this, I implemented grouping for four consecutive pixels, reducing memory access overhead:

static void multiplyBasisFunction(
	float factors[][4], int factorsCount, int width, int height, 
	uint8_t *rgb, size_t bytesPerRow, float *cosX, float *cosY
) {
	for(int y = 0; y < height; y++) {
		uint8_t *src = rgb + y * bytesPerRow;
		float *cosYLocal = cosY + y * factorsCount;
		int x = 0;
		for(; x < width - 3; x += 4) {
			float *cosXLocal = cosX + x * factorsCount;
			float pixel0[4] = {sRGBToLinear_cache[src[3 * (x+0) + 0]], sRGBToLinear_cache[src[3 * (x+0) + 1]], sRGBToLinear_cache[src[3 * (x+0) + 2]]};
			float pixel1[4] = {sRGBToLinear_cache[src[3 * (x+1) + 0]], sRGBToLinear_cache[src[3 * (x+1) + 1]], sRGBToLinear_cache[src[3 * (x+1) + 2]]};
			float pixel2[4] = {sRGBToLinear_cache[src[3 * (x+2) + 0]], sRGBToLinear_cache[src[3 * (x+2) + 1]], sRGBToLinear_cache[src[3 * (x+2) + 2]]};
			float pixel3[4] = {sRGBToLinear_cache[src[3 * (x+3) + 0]], sRGBToLinear_cache[src[3 * (x+3) + 1]], sRGBToLinear_cache[src[3 * (x+3) + 2]]};
			for (int i = 0; i < factorsCount; i++) {
				float basis0 = cosYLocal[i] * cosXLocal[i + 0 * factorsCount];
				float basis1 = cosYLocal[i] * cosXLocal[i + 1 * factorsCount];
				float basis2 = cosYLocal[i] * cosXLocal[i + 2 * factorsCount];
				float basis3 = cosYLocal[i] * cosXLocal[i + 3 * factorsCount];
				factors[i][0] += basis0 * pixel0[0] + basis1 * pixel1[0] + basis2 * pixel2[0] + basis3 * pixel3[0];
				factors[i][1] += basis0 * pixel0[1] + basis1 * pixel1[1] + basis2 * pixel2[1] + basis3 * pixel3[1];
				factors[i][2] += basis0 * pixel0[2] + basis1 * pixel1[2] + basis2 * pixel2[2] + basis3 * pixel3[2];
			}
		}
		for(; x < width; x++) {
			float pixel[4];
			float *cosXLocal = cosX + x * factorsCount;
			pixel[0] = sRGBToLinear_cache[src[3 * x + 0]];
			pixel[1] = sRGBToLinear_cache[src[3 * x + 1]];
			pixel[2] = sRGBToLinear_cache[src[3 * x + 2]];
			for (int i = 0; i < factorsCount; i++) {
				float basis = cosYLocal[i] * cosXLocal[i];
				factors[i][0] += basis * pixel[0];
				factors[i][1] += basis * pixel[1];
				factors[i][2] += basis * pixel[2];
			}
		}
	}

	// ...
}

After applying these optimizations, I ran benchmarks to measure the performance for Rosetta x86 matches the results of manually written SSE intrinsics:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
975 μs ± 1.81 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The results on ARM are even 37% faster than the previous version without the latest optimization:

   ...: assert encode(im, 6, 4) == 'W7EUfLs:M{8yM{wJ$%MepHrE%LV[NaVtBUXSsBNbVuOrM|NaMeXm'
   ...: %timeit encode(im, 6, 4)
589 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This improvement demonstrated how small structural changes — like switching from struct RGB to arrays and grouping pixel calculations — can unlock substantial performance gains when paired with compiler optimizations.

Final thoughts

Achieving a 128x speed boost for Blurhash was a rewarding challenge, but there’s still room for improvement. Can we push it even further? Integer calculations might offer potential gains, though they can complicate the code significantly. Another avenue could involve keeping portions of the factors array in registers by making several passes over the image. However, my initial attempts in this area didn’t impress the compiler — switching to intrinsics might yield better control and results.

Right now, two pull requests are open to the main library and the Python library. I haven’t heard back from the authors yet, but hope the PRs will be merged soon.

Here’s a summary of the optimizations achieved so far:

Optimization	x86 (GCC 12.2.0)	ARM (Clang 14.0.3)
No optimization	126 ms	55.5 ms
Python interface	108 ms (1.17x)	47.7 ms (1.16x)
Value cache	3.45 ms (36x)	2.65 ms (21x)
One pass	1.44 ms (87x)	1.01 ms (55x)
Helping compiler	0.98 ms (128x)	0.59 ms (94x)

If you’re interested, there’s still a lot to explore. For example, similar optimizations could be applied to the remaining implementations, including the decoder. Let’s see how much further we can push Blurhash’s performance.

By the way, this optimization journey ties directly into recent Uploadcare updates: now, you can effortlessly generate image previews. Explore our Blurhash documentation to learn more, and stay tuned for even more image operations to make your websites more performant.