We got reports from Oetiker that he was getting horrible performance on the HP t610, which is a high-end machine with lots of power. We're also seeing the same issue on Wyse Z90D, which has exactly the same processor. No clue as to what the problem is at this point. Probably something wrong with the SIMD code.
The problem is in the SSE2 code. Forcing it off gives expected performance.
Problem is in the YUV to RGB conversion routine, which is good as it is one of the simpler ones.
This is the code that is slow for some odd reason: pcmpeqb xmmH,xmmH ; xmmH=(all 1's) maskmovdqu xmmA,xmmH ; movntdqu XMMWORD [edi], xmmA add edi, byte SIZEOF_XMMWORD ; outptr maskmovdqu xmmD,xmmH ; movntdqu XMMWORD [edi], xmmD add edi, byte SIZEOF_XMMWORD ; outptr maskmovdqu xmmF,xmmH ; movntdqu XMMWORD [edi], xmmF add edi, byte SIZEOF_XMMWORD ; outptr
The culprit is "maskmovdqu". This is a silly little instruction that serves little purpose, and AMD therefore decided not to waste silicon on it. Why it is in the JPEG code is because it's trying to emulate the instruction "movntdq" without the alignment requirement it normally has. The proper instruction for that is "movdqu", but it has slightly different cache properties. I'm not sure the cache avoidance of the current code is beneficial, so we need to look further into that. I'm also concerned by the fact that this code is in the fallback section for unaligned buffers. That code will never be fast, so we are calling something incorrectly somewhere.
maskmovdqu has been eliminated in upstream libjpeg-turbo. Need to upgrade our build system. The performance boost is almost 10x on the Bobcat architecture, but we're also seeing improvement up to 10% on other CPUs.
Avoiding the cache or not in the rest of the code didn't have any measurable effect on performance for simple tests. It might show something on higher level tests, but we don't have time for that right now. Aligning buffers better was moved to a separate bug.
DRC found that the 32-bit code is producing bad output. Need to investigate.
Several more fixes were done upstream. Brought in to our build system in r25415.
A few tests with original 3.4.0 client release and build build 3671 reveals a huge difference in the performance, video and web page scrolling was involved in test.