-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undo jpeg perf regression, add various optimizations #1143
Conversation
based on Resize profile
Wow, today I learned that:
|
Ouch, nice catch! Weird that that has never been detected before though, I'd imagine it's been there for quite a while! I guess we should start adding more |
Getting weird, minimal ( |
@antonfirsov It's the new There must be a small difference between that and the extended intrinsics versions that is enough to accumulate across the convolution methods. |
/// Thrown when the backing group is discontiguous. | ||
/// </exception> | ||
[MethodImpl(InliningOptions.ShortMethod)] | ||
internal Span<T> GetSingleSpan() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new methods cause the Buffer2DTests.CreateEmpty
tests to fail.
private static Vector256<int> ConvertToInt32(Vector256<float> vf, Vector256<float> scale) | ||
{ | ||
vf = Avx.Multiply(vf, scale); | ||
return Avx.ConvertToVector256Int32(vf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things I noticed. You're using Avx
over Avx2
. I've noticed that in @saucecontrol -s original source also. Wouldn't Avx2
be faster?
We're not clamping like we do in the other implementations. I had a quick go. (I don't know how Vector256<float>.Zero
is treated here, should it be passed as a param?). More tests pass but there's still minor differences.
private static Vector256<int> ConvertToInt32(Vector256<float> vf, Vector256<float> scale, Vector256<float> offset)
{
vf = Avx2.Multiply(vf, scale);
vf = Avx2.Add(vf, offset);
vf = Avx2.Min(Avx2.Max(vf, Vector256<float>.Zero), scale);
return Avx2.ConvertToVector256Int32(vf);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avx2
inherits from Avx
, and that Multiply
overload is not part of the AVX2 instruction set, you're just accessing it through the Avx2
class here, but you're calling the same method. In fact if you use Re# it does suggest to simplify the name and just use Avx
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimBobSquarePants new x86 instructions sets are extensions to previous families of instructions. Despite the fact that these methods are static, designers of System.Runtime.Intrinsics
have chosen to use nonstatic classes with a nice inheritance chain to model this. Multiply(Vector256<float>, Vector256<float>)
is in fact defined in the base class Avx
, because the backing VMULPS ymm, ymm, ymm/m256
instruction is already available in AVX. New x86 CPU families improve the performance of existing instructions instead of replacing them with new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish the docs were better. You have to look up each method on the intel docs which is just misdirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to duplicate the Intel content because the specification is owned and maintained by Intel. Some instructions do crazy complex stuff. (See: specification of Avx2.PackUnsignedSaturate
/ _mm256_packus_epi16
under "Operation"). Other instructions are trivially simple (Add
, Multiply
), while still best specified by the original docs.
Adding a link to related Intel docs would be nice though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't Avx2 be faster
In general AVX instructions operate on floats and AVX2 operate on integer types. If seeing them mixed in code bothers you, you can simply using static System.Runtime.Intrinsics.X86.Avx2
and have access to everything without the class name. I personally find it useful to specify the exact ISA for each instruction because it makes it easier to spot when an IsSupported
check is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised VS actually suggests Avx
anyway, offering to simplify if you use Avx2
. Pretty neat!
That's going to happen a lot as you introduce HWIntrinsics code. Almost always, it's because the new code is more accurate. The
Likewise, the You'll find the same when you get into Fused Multiply Add, where the fused operation is not only faster but more accurate than the separate ones. |
@JimBobSquarePants to give you a better picture of the differences, here's the annotated asm of a float->byte conversion fully optimized using vmovupd ymm4,ymmword ptr [rdx] ; load 8xfloat input
vmovupd ymm5,ymmword ptr [rdx+20h] ; x2
vmovupd ymm6,ymmword ptr [rdx+40h] ; x3
vmovupd ymm7,ymmword ptr [rdx+60h] ; x4
add rdx,80h ; increment input ptr
vmulps ymm4,ymm4,ymm2 ; multiply input by scale
vaddps ymm4,ymm4,ymm3 ; then add half to round
vmulps ymm5,ymm5,ymm2 ; x2
vaddps ymm5,ymm5,ymm3 ; x2
vmulps ymm6,ymm6,ymm2 ; x3
vaddps ymm6,ymm6,ymm3 ; x3
vmulps ymm7,ymm7,ymm2 ; x4
vaddps ymm7,ymm7,ymm3 ; x4
vcvttps2dq ymm4,ymm4 ; convert 8xfloat->8xint with truncation
vcvttps2dq ymm5,ymm5 ; x2
vcvttps2dq ymm6,ymm6 ; x3
vcvttps2dq ymm7,ymm7 ; x4
vperm2i128 ymm9,ymm4,ymm5,20h ; deinterleave 2 chunks of 128-bits (low lanes) \
vperm2i128 ymm8,ymm4,ymm5,31h ; x2 (high lanes) \
vpslld ymm9,ymm9,10h ; shift left by 16 \
vpsrld ymm9,ymm9,10h ; then shift right to clear upper bits > Vector<int>.Narrow()
vpslld ymm8,ymm8,10h ; x2 /
vpsrld ymm8,ymm8,10h ; x2 /
vpackusdw ymm4,ymm9,ymm8 ; narrow/pack 2x8xint->16xushort with saturation /
vperm2i128 ymm8,ymm6,ymm7,20h ; deinterleave 2 chunks of 128-bits (low lanes) \
vperm2i128 ymm5,ymm6,ymm7,31h ; x2 (high lanes) \
vpslld ymm8,ymm8,10h ; shift left by 16 \
vpsrld ymm8,ymm8,10h ; then shift right to clear upper bits > Vector<int>.Narrow()
vpslld ymm5,ymm5,10h ; x2 /
vpsrld ymm5,ymm5,10h ; x2 /
vpackusdw ymm6,ymm8,ymm5 ; narrow/pack 2x8xint->16xushort with saturation /
vpmaxsw ymm4,ymm0,ymm4 ; clamp to min 0 (reinterpreted ushort->short)
vpminsw ymm4,ymm4,ymm1 ; and max 255
vpmaxsw ymm6,ymm0,ymm6 ; x2
vpminsw ymm6,ymm6,ymm1 ; x2
vmovaps ymm5,ymm6 ; pointless move
vperm2i128 ymm7,ymm4,ymm5,20h ; deinterleave 2 chunks of 128-bits (low lanes) \
vperm2i128 ymm6,ymm4,ymm5,31h ; x2 (high lanes) \
vpsllw ymm7,ymm7,8 ; shift left by 8 \
vpsrlw ymm7,ymm7,8 ; then shift right to clear upper bits > Vector<ushort>.Narrow()
vpsllw ymm6,ymm6,8 ; x2 /
vpsrlw ymm6,ymm6,8 ; x2 /
vpackuswb ymm4,ymm7,ymm6 ; narrow/pack 2x16xshort->32xbyte with saturation /
vmovupd ymmword ptr [r8],ymm4 ; store output
add r8,20h ; increment output pointer And the same with HWIntrinsics vmulps ymm2,ymm0,ymmword ptr [rdx] ; multiply 8xfloat input by scale (combined load/mul)
vmulps ymm3,ymm0,ymmword ptr [rdx+20h] ; x2
vmulps ymm4,ymm0,ymmword ptr [rdx+40h] ; x3
vmulps ymm5,ymm0,ymmword ptr [rdx+60h] ; x4
add rdx,80h ; increment input ptr
vcvtps2dq ymm2,ymm2 ; convert 8xfloat->8xint with rounding
vcvtps2dq ymm3,ymm3 ; x2
vcvtps2dq ymm4,ymm4 ; x3
vcvtps2dq ymm5,ymm5 ; x4
vpackssdw ymm2,ymm2,ymm3 ; narrow/pack 2x8xint->16xshort with saturation
vpackssdw ymm3,ymm4,ymm5 ; x2
vpackuswb ymm2,ymm2,ymm3 ; narrow/pack 2x16xshort->32xbyte with saturation
vpermd ymm2,ymm1,ymm2 ; deinterleave 8 chunks of 32-bits
vmovdqu ymmword ptr [r8],ymm2 ; store output
add r8,20h ; increment output ptr Which is why it's so much faster. The behavior of the The numeric difference is almost certainly down to the rounding. In your modified HWIntrinsics example, you would likely get the same result as |
Fortunately, the users wishing for this kind of binary compatibility (with other libraries or between versions/platforms) is very low. We already have deviations between runtimes, especially when it comes to 32 bit LegacyJIT. We can simply increase the numeric tolerance in our tests to handle this. However, I want to understand why are the errors more significant in certain scenarios, and less significant in others.
@JimBobSquarePants is this really the case? I thought those processors are already done with all calculations when we are converting |
@saucecontrol Thanks for such a detailed explanation, It's really, really helpful 👍 @antonfirsov I don't know what is happening quite yet tbh. Like you say, everything happens in between conversion to and from It's definitely a rounding difference. 255 vs 0. While the percentile difference is small if you drop the alpha you suddenly get a massive difference in those areas. Using |
In that first image (❤️ BeyondCompare), it looks like the image on the left (your reference, I assume?) is wrong. When you work with premultiplied alpha, the RGB channels should never be anything other than 0 if the A is 0. You're more likely to see rounding errors with premultiplied alpha, simply because the error is also multiplied. |
@saucecontrol That's a very valid point.... However We may be digging into a world of some of my previous hackery in regards to convolution. We currently convolve in two ways.
The second option is what is used in the edge detection algorithms. It could be there's a flaw in my approach (It will be) but when I used premultiplied alpha I lost masses of image data |
Interesting... I can't imagine offhand what would cause that. Of course, when I say it's 'wrong', I mean only insofar as it could be more correct. Since the visual output is the same, it doesn't matter. I just wanted to be sure you know the differences from the HWIntrinsics aren't necessarily something to worry about as long as you can tune your test parameters, because unless something has gone horribly wrong, they will tend to be slightly more accurate. I struggle as well with automated testing because I have as many as 6 different implementations of some of my processors, optimized for different platforms. My errors are almost always <= 1 step in 8-bit output, but they can be larger in RGBA images, particularly in semi-transparent pixels. If you're premultiplying for your comparisons as well, most of that will wash out. |
Wondering now: |
I think I'm going to revisit edge detection anyway. I think we should be removing transparency before operating on the image. We could then optimize our convolution algorithms to work in bulk. |
@saucecontrol I remembered why those values would be 255 and also why there was data loss when convolving all four channels. Edge detection uses only 3 channels, if you introduce the alpha channel then the continuous alpha value means that no edges are detected. In my edge detection approach I convolve the RGB channels but preserve the alpha component of the source. I haven't seen another implementation that does this (normally the result is opaque) but you would get the same result if you saved the image without an alpha component. |
Ah yes, that makes sense. I believe in that case, you still want to premultiply, even though you are going to ignore the alpha. Imagine the black and white checkerboard in the upper left corner of your test image has all 0 alpha values. There's no edge there because all the pixels are actually transparent. Or imagine their RGBA values are 255,255,255,1 and 0,0,0,2. The true pixel difference there would be 1, not 255, so there should be no edge. I can't think of any imaging algorithm that isn't more correct with premultiplication. Same goes for evaluating your output in tests... if you're not premultiplying before comparing outputs, you'll show differences that aren't really there. The above example is an extreme one (255,255,255,0 vs 0,0,0,0), but any cases of alpha rounding can cause the color value to be further off numerically than would seem correct. And the more transparent the pixel, the higher that value can be while still being meaningless visually. Now, if that particular test doesn't do premultiplication at any point in its processing, then your concern over the values flipping from 255 to 0 is perfectly valid. That, I don't have any explanation for... |
// The result dimensions of EntropyCrop may differ on .NET Core 3.1 because of unstable edge detection results. | ||
// TODO: Re-enable this test case if we manage to improve stability. | ||
#if SUPPORTS_RUNTIME_INTRINSICS | ||
if (provider.SourceFileOrDescription.Contains(TestImages.Png.Ducky)) | ||
{ | ||
return; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky one, I would consider it as a bug, but it should not prevent us from merging this, since jpeg & resize speed is much more important than EntropyCrop
stability. Probably worth opening a new issue to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup... I can have a look at this.
Fixed all failures by fine-tuning the comparer tolerance depending on input. Ready to review now, we can handle edge detection and EntropyCrop concerns separately. |
{ | ||
public static class Avx2Intrinsics | ||
{ | ||
private static ReadOnlySpan<byte> PermuteMaskDeinterleave8x32 => new byte[] { 0, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0, 0, 5, 0, 0, 0, 2, 0, 0, 0, 6, 0, 0, 0, 3, 0, 0, 0, 7, 0, 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future. We should add comments to this kind of stuff so I can understand what is actually does! 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would be happy if I could place any meaningful comment here, but the truth is that I have no idea what does it do exactly.
All I know is that it's a permuatation mask to unshuffle the bytes returned by PackSignedSaturate
which are in a meaningless order to my naive eyes for some reason I not understand, and haven't taken the time to research it any further. Maybe if @saucecontrol has some more time to clarify the high level concept..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic idea is that nearly all AVX instructions operate independently on 2 128-bit lanes rather than on the 256-bit register as a whole. So if you have 4 Vector256<int>
that contain pixels 0,1 | 2,3 | 4,5 | 6,7, when you narrow and pack them, they end up in 2 registers as 0,2,1,3 | 4,6,5,7. Then you do that again, and you get 1 register with 0,2,4,6,1,3,5,7.
Permute instructions essentially do a shuffle across lanes, so you give it the order 0,4,1,5,2,6,3,7 to undo the interleaving that happened in the previous steps. That ROS just has those 8 32-bit integers written in little endian order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nearly all AVX instructions operate independently on 2 128-bit lanes rather than on the 256-bit register as a whole
Knowing this, the whole thing makes much more sense now.
Codecov Report
@@ Coverage Diff @@
## master #1143 +/- ##
==========================================
- Coverage 82.27% 82.26% -0.02%
==========================================
Files 678 679 +1
Lines 29216 29256 +40
Branches 3278 3281 +3
==========================================
+ Hits 24038 24067 +29
- Misses 4483 4498 +15
+ Partials 695 691 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. At some point we should probably make our compiler conditionals less verbose and drop the SUPPORTS_
prefix. The feature matrix in the props file is getting huge.
I like the prefix, it clarifies the meaning of the preprocessor condition without any further reading. Maybe we can change |
Works for me. A task for another time though. Merge away! |
Prerequisites
Description
Taking out 1 week old work from the fridge that resulted in 9% aggregate speedup for LoadResizeSave.
Buffer2D<T>
andBufferArea<T>
should avoid touchingMemoryGroup
and usethis.cachedMemory
whenever possible.Block8x8.ScaledCopyTo()
(renamed fromBlock8x8F.CopyTo()
) by consuming pre-calculated references instead ofBufferArea<float>
float->byte
conversion. /cc @saucecontrolVector8
is being used for "classic"Vector<float>
AVX path.UPDATE
Getting weird, minimal (<1%) pixel differences in seemingly unrelated processors. I'll investigate that later. Marking as [WIP].