|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original interlacing handling looked like this:
```c
static inline uint32_t f1(int height, int y)
{
if ((y << 3) < height) {
return (y << 3); /* Pass 1 */
}
y -= ((height + 7) >> 3);
if ((y << 3) < (height - 4)) {
return (y << 3) + 4; /* Pass 2 */
}
y -= ((height + 3) >> 3);
if ((y << 2) < (height - 2)) {
return (y << 2) + 2; /* Pass 3 */
}
y -= ((height + 1) >> 2);
return (y << 1) + 1; /* Pass 4 */
}
```
Annotated with comments to show the passes. Thought it should be possible
to do the job without all the calculations and subtractions from `y`, and
I also didn't like that it was not super obvious how it aligned with the
spec or how it dealt with pass 4 last, when it is the most common case,
happening for 50% of image rows.
See the following excerpt from from the GIF89a Specification.
> The rows of an Interlaced images are arranged in the following order:
>
> Group 1 : Every 8th. row, starting with row 0. (Pass 1)
> Group 2 : Every 8th. row, starting with row 4. (Pass 2)
> Group 3 : Every 4th. row, starting with row 2. (Pass 3)
> Group 4 : Every 2nd. row, starting with row 1. (Pass 4)
>
> The Following example illustrates how the rows of an interlaced image are
> ordered.
>
> Row Number Interlace Pass
>
> 0 ----------------------------------------- 1
> 1 ----------------------------------------- 4
> 2 ----------------------------------------- 3
> 3 ----------------------------------------- 4
> 4 ----------------------------------------- 2
> 5 ----------------------------------------- 4
> 6 ----------------------------------------- 3
> 7 ----------------------------------------- 4
> 8 ----------------------------------------- 1
> 9 ----------------------------------------- 4
> 10 ----------------------------------------- 3
> 11 ----------------------------------------- 4
> 12 ----------------------------------------- 2
> 13 ----------------------------------------- 4
> 14 ----------------------------------------- 3
> 15 ----------------------------------------- 4
> 16 ----------------------------------------- 1
> 17 ----------------------------------------- 4
> 18 ----------------------------------------- 3
> 19 ----------------------------------------- 4
-- https://www.w3.org/Graphics/GIF/spec-gif89a.txt
To test performance I ran it for all possible GIF heights:
```c
for (unsigned height = 1; height < UINT16_MAX; height++) {
for (uint32_t y = 0; y < height; y++) {
use_val(f1(height, y));
}
}
```
For the original implementation (`f1`), I got the following performance:
| Compiler | Time |
| -------- | ------ |
| Clang | 3.830s |
| GCC | 4.737s |
All tests used GCC 11.2 and Clang 13.0, with `-O2` optimisation.
Next I rewrote it to more closely resemble my reading of the spec.
The compiler can convert the multiplies to shifts and I found this
more readable than `f1`, and more closely matching the spec.
The `height / N * N` bits all use powers of two for N so they boil
down the a simple bitwise AND with an appropriate mask to round down
to the nearest multiple of N.
```c
static inline uint32_t f2(uint32_t height, uint32_t y)
{
height--;
if (y * 8 <= height) {
return y * 8; /* Pass 1 */
}
if (y * 4 <= height) {
return y * 8 - height / 8 * 8 - 4; /* Pass 2 */
}
if (y * 2 <= height) {
return y * 4 - height / 4 * 4 - 2; /* Pass 3 */
}
return y * 2 - height / 2 * 2 - 1; /* Pass 4 */
}
```
For `f2`, I got the following performance:
| Compiler | Time |
| -------- | ------ |
| Clang | 2.758s |
| GCC | 3.909s |
So it's faster than `f1`, and Clang is still doing better than GCC.
After that I reversed the order, so that pass 4 would be handled first.
```c
static inline uint32_t f3(uint32_t height, uint32_t y)
{
height--;
if (y * 2 > height) {
return y * 2 - height / 2 * 2 - 1;
}
if (y * 4 > height) {
return y * 4 - height / 4 * 4 - 2;
}
if (y * 8 > height) {
return y * 8 - height / 8 * 8 - 4;
}
return y * 8;
}
```
For `f3`, I got the following performance:
| Compiler | Time |
| -------- | ------ |
| Clang | 3.329s |
| GCC | 3.630s |
Interestingly this was consistently faster than `f2` with GCC, but
slower than `f2` when built with Clang. (It's still faster than the
original `f1` with both compilers.)
I thought it might be because the branches in `f2` are more
predictable than in `f3`. For example the first branch in
`f2` is only true one eighth of the time, while all the branches
in `f3` are 50:50.
However, running under `perf` showed this was not the case.
Actually the faster `f2` for Clang had more branch misses than
the Clang `f3`.
| Compiler | Implementation | Branches | Branch misses |
| -------- | -------------- | -------------- | ------------- |
| Clang | `f2` | 13,690,378,148 | 656,517 |
| Clang | `f3` | 10,738,688,711 | 275,147 |
| GCC | `f2` | 10,738,659,995 | 336,992 |
| GCC | `f3` | 10,201,842,339 | 1,314,445 |
After this I tried changing the implementation completely. All of
the previous implementations were based on the idea of mapping from
an uninterlaced row number to an interlaced one.
```c
static inline bool f4(uint32_t height, uint32_t *y, uint8_t *pass)
{
*y += *pass == 0 ? 8 : 16 >> *pass;
if (*y < height) return true;
switch (*pass) {
case 0: *pass += 1; *y = 4; if (*y < height) return true;
case 1: *pass += 1; *y = 2; if (*y < height) return true;
case 2: *pass += 1; *y = 1; return true;
}
return false;
}
```
With `f4` we simply step to the next line in interlaced order.
The `16 >> *pass` generates the correct step size for all but the
first pass, so the first pass is special cased.
The switch statement is for handling advancing from one pass to
the next, so it is reached a maximum of four times per frame.
The rest of the time, we return on the `if (*y < height) return true;`
above. The switch falls through to handle the case where the frame
height is less than the start row for the given pass.
For `f4`, I got the following performance:
| Compiler | Time |
| -------- | ------ |
| Clang | 2.315s |
| GCC | 3.479s |
Again, this was tested for all possible GIF heights, but the `for`
loop for rows changes to a do-while loop:
```c
for (unsigned height = 1; height < UINT16_MAX; height++) {
uint8_t pass = 0;
uint32_t y = 0;
do {
use_val(y);
} while (f4(height, &y, &pass));
}
```
This is the fastest yet for both compilers, with Clang still performing
best.
One thing I didn't like about `f4` is that it calculates the y step
size every call, although the step only actually changes three times,
when we change to the next pass.
So for `f5` the switch was updated to set the step size.
```c
static inline bool f5(uint32_t height, uint32_t *y, uint8_t *step)
{
*y += *step & 0xf;
if (*y < height) return true;
switch (*step) {
case 24: *y = 4; *step = 8; if (*y < height) return true;
case 8: *y = 2; *step = 4; if (*y < height) return true;
case 4: *y = 1; *step = 2; return true;
}
return false;
}
```
To avoid the caller having to maintain extra state, the pass number
is no longer used, and we use the step size to keep track of which
pass we are on. However, the step size for the first two passes is
the same, so that led to the strange use of 24 for the initial step
size, and the masking of step to ensure only the bottom four bits are
used.
For `f5`, I got the following performance:
| Compiler | Time |
| -------- | ------ |
| Clang | 2.364s |
| GCC | 2.891s |
This is the fastest yet for both compilers, with Clang still winning
easily.
Note that when the functions are not allowed to inline, `f2` and `f3`
are the fastest.
|