Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undo jpeg perf regression, add various optimizations #1143

Merged
merged 20 commits into from
Mar 13, 2020

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Mar 10, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Taking out 1 week old work from the fridge that resulted in 9% aggregate speedup for LoadResizeSave.

  • Undo regressions introduced by suboptimal implementation of certain memory primitives. Buffer2D<T> and BufferArea<T> should avoid touching MemoryGroup and use this.cachedMemory whenever possible.
  • Speed up Block8x8.ScaledCopyTo() (renamed from Block8x8F.CopyTo()) by consuming pre-calculated references instead of BufferArea<float>
  • Borrow MagicScaler's solution for AVX2-optimized float->byte conversion. /cc @saucecontrol
    • Consequence: change terminology in code. Now the postfix Vector8 is being used for "classic" Vector<float> AVX path.
  • Miscellaneous microoptimizations around memory code

UPDATE
Getting weird, minimal (<1%) pixel differences in seemingly unrelated processors. I'll investigate that later. Marking as [WIP].

@antonfirsov antonfirsov requested a review from a team March 10, 2020 00:34
@antonfirsov
Copy link
Member Author

antonfirsov commented Mar 10, 2020

Wow, today I learned that:

@Sergio0694
Copy link
Member

Ouch, nice catch! Weird that that has never been detected before though, I'd imagine it's been there for quite a while! I guess we should start adding more DebugGuard checks right before Unsafe.Add calls. At least that way we'd have both speed and safety, just at the cost of some extra verbosity 🤔

@Sergio0694
Copy link
Member

Meanwhile, me checking my own code after agreeing that using too many Unsafe APIs is bad:

image

@antonfirsov antonfirsov changed the title Undo jpeg perf regression, add various optimizations [WIP] Undo jpeg perf regression, add various optimizations Mar 10, 2020
@antonfirsov
Copy link
Member Author

Getting weird, minimal (<1%) pixel differences in seemingly unrelated processors. I'll investigate that later. Marking as [WIP].

@JimBobSquarePants
Copy link
Member

@antonfirsov It's the new Avx2Intrinsics.BulkConvertNormalizedFloatToByteClampOverflows method.

There must be a small difference between that and the extended intrinsics versions that is enough to accumulate across the convolution methods.

/// Thrown when the backing group is discontiguous.
/// </exception>
[MethodImpl(InliningOptions.ShortMethod)]
internal Span<T> GetSingleSpan()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new methods cause the Buffer2DTests.CreateEmpty tests to fail.

Comment on lines +95 to +99
private static Vector256<int> ConvertToInt32(Vector256<float> vf, Vector256<float> scale)
{
vf = Avx.Multiply(vf, scale);
return Avx.ConvertToVector256Int32(vf);
}
Copy link
Member

@JimBobSquarePants JimBobSquarePants Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things I noticed. You're using Avx over Avx2. I've noticed that in @saucecontrol -s original source also. Wouldn't Avx2 be faster?

We're not clamping like we do in the other implementations. I had a quick go. (I don't know how Vector256<float>.Zero is treated here, should it be passed as a param?). More tests pass but there's still minor differences.

private static Vector256<int> ConvertToInt32(Vector256<float> vf, Vector256<float> scale, Vector256<float> offset)
{
    vf = Avx2.Multiply(vf, scale);
    vf = Avx2.Add(vf, offset);
    vf = Avx2.Min(Avx2.Max(vf, Vector256<float>.Zero), scale);
    return Avx2.ConvertToVector256Int32(vf);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avx2 inherits from Avx, and that Multiply overload is not part of the AVX2 instruction set, you're just accessing it through the Avx2 class here, but you're calling the same method. In fact if you use Re# it does suggest to simplify the name and just use Avx 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JimBobSquarePants new x86 instructions sets are extensions to previous families of instructions. Despite the fact that these methods are static, designers of System.Runtime.Intrinsics have chosen to use nonstatic classes with a nice inheritance chain to model this. Multiply(Vector256<float>, Vector256<float>) is in fact defined in the base class Avx, because the backing VMULPS ymm, ymm, ymm/m256 instruction is already available in AVX. New x86 CPU families improve the performance of existing instructions instead of replacing them with new ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the docs were better. You have to look up each method on the intel docs which is just misdirection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to duplicate the Intel content because the specification is owned and maintained by Intel. Some instructions do crazy complex stuff. (See: specification of Avx2.PackUnsignedSaturate / _mm256_packus_epi16 under "Operation"). Other instructions are trivially simple (Add, Multiply), while still best specified by the original docs.

Adding a link to related Intel docs would be nice though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't Avx2 be faster

In general AVX instructions operate on floats and AVX2 operate on integer types. If seeing them mixed in code bothers you, you can simply using static System.Runtime.Intrinsics.X86.Avx2 and have access to everything without the class name. I personally find it useful to specify the exact ISA for each instruction because it makes it easier to spot when an IsSupported check is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised VS actually suggests Avx anyway, offering to simplify if you use Avx2. Pretty neat!

@saucecontrol
Copy link
Contributor

saucecontrol commented Mar 10, 2020

There must be a small difference between that and the extended intrinsics versions

That's going to happen a lot as you introduce HWIntrinsics code. Almost always, it's because the new code is more accurate. The Vector<T> conversion and narrowing methods go out of their way to match the behavior of scalar code.

Vector<T> conversion uses CVTTPS2DQ instead of CVTPS2DQ (truncated conversion vs rounded, in case the extra T isn't jumping out in the name). Since Vector<T> doesn't expose a proper Round operation, the behaviors can't be made to match, and of course the Vector<T> version will have to be slower because it requires an extra step.

Likewise, the Vector<T> narrowing methods truncate rather than saturating like the HWIntrisics do. To achieve that, they shift the values left, shift them back right, then call the narrowing instruction. This causes overflow wraparound, whereas the narrowing instructions correctly clamp by saturation.

You'll find the same when you get into Fused Multiply Add, where the fused operation is not only faster but more accurate than the separate ones.

@saucecontrol
Copy link
Contributor

@JimBobSquarePants to give you a better picture of the differences, here's the annotated asm of a float->byte conversion fully optimized using Vector<T>

vmovupd     ymm4,ymmword ptr [rdx]     ; load 8xfloat input
vmovupd     ymm5,ymmword ptr [rdx+20h] ; x2
vmovupd     ymm6,ymmword ptr [rdx+40h] ; x3
vmovupd     ymm7,ymmword ptr [rdx+60h] ; x4
add         rdx,80h                    ; increment input ptr
vmulps      ymm4,ymm4,ymm2             ; multiply input by scale 
vaddps      ymm4,ymm4,ymm3             ;   then add half to round
vmulps      ymm5,ymm5,ymm2             ; x2
vaddps      ymm5,ymm5,ymm3             ;   x2
vmulps      ymm6,ymm6,ymm2             ; x3
vaddps      ymm6,ymm6,ymm3             ;   x3
vmulps      ymm7,ymm7,ymm2             ; x4
vaddps      ymm7,ymm7,ymm3             ;   x4
vcvttps2dq  ymm4,ymm4                  ; convert 8xfloat->8xint with truncation
vcvttps2dq  ymm5,ymm5                  ; x2
vcvttps2dq  ymm6,ymm6                  ; x3
vcvttps2dq  ymm7,ymm7                  ; x4
vperm2i128  ymm9,ymm4,ymm5,20h         ; deinterleave 2 chunks of 128-bits (low lanes)  \
vperm2i128  ymm8,ymm4,ymm5,31h         ;   x2 (high lanes)                               \
vpslld      ymm9,ymm9,10h              ; shift left by 16                                 \
vpsrld      ymm9,ymm9,10h              ;   then shift right to clear upper bits            > Vector<int>.Narrow()
vpslld      ymm8,ymm8,10h              ; x2                                               /
vpsrld      ymm8,ymm8,10h              ;   x2                                            /
vpackusdw   ymm4,ymm9,ymm8             ; narrow/pack 2x8xint->16xushort with saturation /
vperm2i128  ymm8,ymm6,ymm7,20h         ; deinterleave 2 chunks of 128-bits (low lanes)  \
vperm2i128  ymm5,ymm6,ymm7,31h         ;   x2 (high lanes)                               \
vpslld      ymm8,ymm8,10h              ; shift left by 16                                 \
vpsrld      ymm8,ymm8,10h              ;   then shift right to clear upper bits            > Vector<int>.Narrow()
vpslld      ymm5,ymm5,10h              ; x2                                               /
vpsrld      ymm5,ymm5,10h              ;   x2                                            /
vpackusdw   ymm6,ymm8,ymm5             ; narrow/pack 2x8xint->16xushort with saturation /
vpmaxsw     ymm4,ymm0,ymm4             ; clamp to min 0 (reinterpreted ushort->short)
vpminsw     ymm4,ymm4,ymm1             ;   and max 255
vpmaxsw     ymm6,ymm0,ymm6             ; x2
vpminsw     ymm6,ymm6,ymm1             ;   x2
vmovaps     ymm5,ymm6                  ; pointless move
vperm2i128  ymm7,ymm4,ymm5,20h         ; deinterleave 2 chunks of 128-bits (low lanes)   \
vperm2i128  ymm6,ymm4,ymm5,31h         ;   x2 (high lanes)                                \
vpsllw      ymm7,ymm7,8                ; shift left by 8                                   \
vpsrlw      ymm7,ymm7,8                ;   then shift right to clear upper bits             > Vector<ushort>.Narrow()
vpsllw      ymm6,ymm6,8                ; x2                                                /
vpsrlw      ymm6,ymm6,8                ;   x2                                             /
vpackuswb   ymm4,ymm7,ymm6             ; narrow/pack 2x16xshort->32xbyte with saturation /
vmovupd     ymmword ptr [r8],ymm4      ; store output
add         r8,20h                     ; increment output pointer

And the same with HWIntrinsics

vmulps      ymm2,ymm0,ymmword ptr [rdx]     ; multiply 8xfloat input by scale (combined load/mul)
vmulps      ymm3,ymm0,ymmword ptr [rdx+20h] ; x2
vmulps      ymm4,ymm0,ymmword ptr [rdx+40h] ; x3
vmulps      ymm5,ymm0,ymmword ptr [rdx+60h] ; x4
add         rdx,80h                         ; increment input ptr
vcvtps2dq   ymm2,ymm2                       ; convert 8xfloat->8xint with rounding  
vcvtps2dq   ymm3,ymm3                       ; x2
vcvtps2dq   ymm4,ymm4                       ; x3 
vcvtps2dq   ymm5,ymm5                       ; x4
vpackssdw   ymm2,ymm2,ymm3                  ; narrow/pack 2x8xint->16xshort with saturation
vpackssdw   ymm3,ymm4,ymm5                  ; x2
vpackuswb   ymm2,ymm2,ymm3                  ; narrow/pack 2x16xshort->32xbyte with saturation
vpermd      ymm2,ymm1,ymm2                  ; deinterleave 8 chunks of 32-bits
vmovdqu     ymmword ptr [r8],ymm2           ; store output
add         r8,20h                          ; increment output ptr

Which is why it's so much faster.

The behavior of the Vector<T> narrowing is particularly strange if you're using it the way we are. It's designed to be safe for cases where the data is actually reinterpreted (like narrowing a an array of char to byte by discarding the upper half). Not only is that behavior undesirable for our use case, it's not the default behavior of the SIMD instructions, so the Vector<T> implementation does a lot of extra work to make it so.

The numeric difference is almost certainly down to the rounding. In your modified HWIntrinsics example, you would likely get the same result as Vector<T> by using Avx.ConvertToVector256Int32WithTruncation after adding the half offset. Again, it's less correct, but it would probably match if that's really important to you.

@antonfirsov
Copy link
Member Author

antonfirsov commented Mar 10, 2020

[...] but it would probably match if that's really important to you

Fortunately, the users wishing for this kind of binary compatibility (with other libraries or between versions/platforms) is very low. We already have deviations between runtimes, especially when it comes to 32 bit LegacyJIT. We can simply increase the numeric tolerance in our tests to handle this.

However, I want to understand why are the errors more significant in certain scenarios, and less significant in others.

There must be a small difference between that and the extended intrinsics versions that is enough to accumulate across the convolution methods.

@JimBobSquarePants is this really the case? I thought those processors are already done with all calculations when we are converting Vector4 to TPixel.

@JimBobSquarePants
Copy link
Member

@saucecontrol Thanks for such a detailed explanation, It's really, really helpful 👍

@antonfirsov I don't know what is happening quite yet tbh. Like you say, everything happens in between conversion to and from Vector4

It's definitely a rounding difference. 255 vs 0. While the percentile difference is small if you drop the alpha you suddenly get a massive difference in those areas.

image

Using ConvertToVector256Int32WithTruncation allows additional tests to pass but we're still a little stuck with the EntropyCrop tests. (I'm beginning too hate that method and am tempted to remove it) The output produces an image with different dimensions to we cannot even do a comparison against reference output.

@JimBobSquarePants
Copy link
Member

With entropy crop almost every pixel in the output image is different. This isn't an algorithmic issue since the output is simply a partial copy of the source via cropping to the derived rectangle.

image

@saucecontrol
Copy link
Contributor

In that first image (❤️ BeyondCompare), it looks like the image on the left (your reference, I assume?) is wrong. When you work with premultiplied alpha, the RGB channels should never be anything other than 0 if the A is 0. You're more likely to see rounding errors with premultiplied alpha, simply because the error is also multiplied.

@JimBobSquarePants
Copy link
Member

@saucecontrol That's a very valid point.... However

We may be digging into a world of some of my previous hackery in regards to convolution.

We currently convolve in two ways.

  1. Standard convolution using premultiplied alpha. This is used in all blurring and sharpening.
  2. Convolution of color components only preserving the original alpha values.

The second option is what is used in the edge detection algorithms. It could be there's a flaw in my approach (It will be) but when I used premultiplied alpha I lost masses of image data

image

With an input image of:
DetectEdges_WorksWithAllFilters_Rgba32_TestPattern100x100_Kirsch

@saucecontrol
Copy link
Contributor

Interesting... I can't imagine offhand what would cause that. Of course, when I say it's 'wrong', I mean only insofar as it could be more correct. Since the visual output is the same, it doesn't matter. I just wanted to be sure you know the differences from the HWIntrinsics aren't necessarily something to worry about as long as you can tune your test parameters, because unless something has gone horribly wrong, they will tend to be slightly more accurate.

I struggle as well with automated testing because I have as many as 6 different implementations of some of my processors, optimized for different platforms. My errors are almost always <= 1 step in 8-bit output, but they can be larger in RGBA images, particularly in semi-transparent pixels. If you're premultiplying for your comparisons as well, most of that will wash out.

@antonfirsov
Copy link
Member Author

antonfirsov commented Mar 11, 2020

Wondering now:
if we consider the output of the new conversion code "more correct", we shall probably replace the reference output. But first we also need to implement the opposite byte -> float path, without the Vector<float> side effects I guess.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Mar 11, 2020

I think I'm going to revisit edge detection anyway. I think we should be removing transparency before operating on the image.

We could then optimize our convolution algorithms to work in bulk.

@JimBobSquarePants
Copy link
Member

@saucecontrol I remembered why those values would be 255 and also why there was data loss when convolving all four channels. Edge detection uses only 3 channels, if you introduce the alpha channel then the continuous alpha value means that no edges are detected.

In my edge detection approach I convolve the RGB channels but preserve the alpha component of the source. I haven't seen another implementation that does this (normally the result is opaque) but you would get the same result if you saved the image without an alpha component.

@saucecontrol
Copy link
Contributor

Ah yes, that makes sense. I believe in that case, you still want to premultiply, even though you are going to ignore the alpha. Imagine the black and white checkerboard in the upper left corner of your test image has all 0 alpha values. There's no edge there because all the pixels are actually transparent. Or imagine their RGBA values are 255,255,255,1 and 0,0,0,2. The true pixel difference there would be 1, not 255, so there should be no edge.

I can't think of any imaging algorithm that isn't more correct with premultiplication. Same goes for evaluating your output in tests... if you're not premultiplying before comparing outputs, you'll show differences that aren't really there. The above example is an extreme one (255,255,255,0 vs 0,0,0,0), but any cases of alpha rounding can cause the color value to be further off numerically than would seem correct. And the more transparent the pixel, the higher that value can be while still being meaningless visually.

Now, if that particular test doesn't do premultiplication at any point in its processing, then your concern over the values flipping from 255 to 0 is perfectly valid. That, I don't have any explanation for...

Comment on lines +28 to +35
// The result dimensions of EntropyCrop may differ on .NET Core 3.1 because of unstable edge detection results.
// TODO: Re-enable this test case if we manage to improve stability.
#if SUPPORTS_RUNTIME_INTRINSICS
if (provider.SourceFileOrDescription.Contains(TestImages.Png.Ducky))
{
return;
}
#endif
Copy link
Member Author

@antonfirsov antonfirsov Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one, I would consider it as a bug, but it should not prevent us from merging this, since jpeg & resize speed is much more important than EntropyCrop stability. Probably worth opening a new issue to track this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup... I can have a look at this.

@antonfirsov antonfirsov changed the title [WIP] Undo jpeg perf regression, add various optimizations Undo jpeg perf regression, add various optimizations Mar 12, 2020
@antonfirsov
Copy link
Member Author

Fixed all failures by fine-tuning the comparer tolerance depending on input. Ready to review now, we can handle edge detection and EntropyCrop concerns separately.

{
public static class Avx2Intrinsics
{
private static ReadOnlySpan<byte> PermuteMaskDeinterleave8x32 => new byte[] { 0, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0, 0, 5, 0, 0, 0, 2, 0, 0, 0, 6, 0, 0, 0, 3, 0, 0, 0, 7, 0, 0, 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future. We should add comments to this kind of stuff so I can understand what is actually does! 😆

Copy link
Member Author

@antonfirsov antonfirsov Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would be happy if I could place any meaningful comment here, but the truth is that I have no idea what does it do exactly.

All I know is that it's a permuatation mask to unshuffle the bytes returned by PackSignedSaturate which are in a meaningless order to my naive eyes for some reason I not understand, and haven't taken the time to research it any further. Maybe if @saucecontrol has some more time to clarify the high level concept..

Copy link
Contributor

@saucecontrol saucecontrol Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea is that nearly all AVX instructions operate independently on 2 128-bit lanes rather than on the 256-bit register as a whole. So if you have 4 Vector256<int> that contain pixels 0,1 | 2,3 | 4,5 | 6,7, when you narrow and pack them, they end up in 2 registers as 0,2,1,3 | 4,6,5,7. Then you do that again, and you get 1 register with 0,2,4,6,1,3,5,7.

Permute instructions essentially do a shuffle across lanes, so you give it the order 0,4,1,5,2,6,3,7 to undo the interleaving that happened in the previous steps. That ROS just has those 8 32-bit integers written in little endian order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nearly all AVX instructions operate independently on 2 128-bit lanes rather than on the 256-bit register as a whole

Knowing this, the whole thing makes much more sense now.

@codecov
Copy link

codecov bot commented Mar 12, 2020

Codecov Report

Merging #1143 into master will decrease coverage by 0.01%.
The diff coverage is 81.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1143      +/-   ##
==========================================
- Coverage   82.27%   82.26%   -0.02%     
==========================================
  Files         678      679       +1     
  Lines       29216    29256      +40     
  Branches     3278     3281       +3     
==========================================
+ Hits        24038    24067      +29     
- Misses       4483     4498      +15     
+ Partials      695      691       -4     
Flag Coverage Δ
#unittests 82.26% <81.48%> (-0.02%) ⬇️
Impacted Files Coverage Δ
...arp/Common/Helpers/SimdUtils.BasicIntrinsics256.cs 98.21% <0.00%> (ø)
src/ImageSharp/Common/Tuples/Vector4Pair.cs 62.50% <ø> (ø)
...arp/Formats/Jpeg/Components/Block8x8F.Generated.cs 100.00% <ø> (ø)
...Converters/JpegColorConverter.FromYCbCrSimdAvx2.cs 91.66% <0.00%> (ø)
...ents/Decoder/ColorConverters/JpegColorConverter.cs 93.18% <0.00%> (ø)
...y/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs 76.19% <0.00%> (-3.81%) ⬇️
src/ImageSharp/Memory/Buffer2DExtensions.cs 73.07% <ø> (+5.42%) ⬆️
...ng/Processors/Transforms/Resize/ResizeKernelMap.cs 94.73% <0.00%> (-1.01%) ⬇️
...olorConverters/JpegColorConverter.FromYCbCrSimd.cs 86.79% <42.85%> (ø)
...arp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs 82.19% <50.00%> (-12.33%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aed6b77...810d3bb. Read the comment docs.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. At some point we should probably make our compiler conditionals less verbose and drop the SUPPORTS_ prefix. The feature matrix in the props file is getting huge.

@antonfirsov
Copy link
Member Author

I like the prefix, it clarifies the meaning of the preprocessor condition without any further reading. Maybe we can change SUPPORTS_ to HAS_ to make it shorter.

@James-Jackson-South
Copy link

Maybe we can change SUPPORTS_ to HAS_ to make it shorter.

Works for me. A task for another time though. Merge away!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants