Bug 4328 - very slow performance with AMD G-T56N APU
Summary: very slow performance with AMD G-T56N APU
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VNC (show other bugs)
Version: pre-1.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.0.0
Assignee: Pierre Ossman
Keywords: hean01_tester
Depends on:
Reported: 2012-06-07 13:53 CEST by Pierre Ossman
Modified: 2012-11-28 12:33 CET (History)
0 users

See Also:
Acceptance Criteria:


Description Pierre Ossman cendio 2012-06-07 13:53:05 CEST
We got reports from Oetiker that he was getting horrible performance on the HP t610, which is a high-end machine with lots of power. We're also seeing the same issue on Wyse Z90D, which has exactly the same processor.

No clue as to what the problem is at this point. Probably something wrong with the SIMD code.
Comment 1 Pierre Ossman cendio 2012-06-08 12:53:36 CEST
The problem is in the SSE2 code. Forcing it off gives expected performance.
Comment 2 Pierre Ossman cendio 2012-06-08 13:06:47 CEST
Problem is in the YUV to RGB conversion routine, which is good as it is one of the simpler ones.
Comment 3 Pierre Ossman cendio 2012-06-08 13:47:42 CEST
This is the code that is slow for some odd reason:

	pcmpeqb    xmmH,xmmH			; xmmH=(all 1's)
	maskmovdqu xmmA,xmmH			; movntdqu XMMWORD [edi], xmmA
	add	edi, byte SIZEOF_XMMWORD	; outptr
	maskmovdqu xmmD,xmmH			; movntdqu XMMWORD [edi], xmmD
	add	edi, byte SIZEOF_XMMWORD	; outptr
	maskmovdqu xmmF,xmmH			; movntdqu XMMWORD [edi], xmmF
	add	edi, byte SIZEOF_XMMWORD	; outptr
Comment 4 Pierre Ossman cendio 2012-06-08 17:00:52 CEST
The culprit is "maskmovdqu". This is a silly little instruction that serves little purpose, and AMD therefore decided not to waste silicon on it.

Why it is in the JPEG code is because it's trying to emulate the instruction "movntdq" without the alignment requirement it normally has. The proper instruction for that is "movdqu", but it has slightly different cache properties.

I'm not sure the cache avoidance of the current code is beneficial, so we need to look further into that.

I'm also concerned by the fact that this code is in the fallback section for unaligned buffers. That code will never be fast, so we are calling something incorrectly somewhere.
Comment 5 Pierre Ossman cendio 2012-06-13 11:18:36 CEST
maskmovdqu has been eliminated in upstream libjpeg-turbo. Need to upgrade our build system.

The performance boost is almost 10x on the Bobcat architecture, but we're also seeing improvement up to 10% on other CPUs.
Comment 6 Pierre Ossman cendio 2012-06-13 11:20:04 CEST
Avoiding the cache or not in the rest of the code didn't have any measurable effect on performance for simple tests. It might show something on higher level tests, but we don't have time for that right now.

Aligning buffers better was moved to a separate bug.
Comment 7 Pierre Ossman cendio 2012-06-14 10:51:27 CEST
DRC found that the 32-bit code is producing bad output. Need to investigate.
Comment 8 Pierre Ossman cendio 2012-07-02 12:38:21 CEST
Several more fixes were done upstream. Brought in to our build system in r25415.
Comment 9 Henrik Andersson cendio 2012-10-04 10:26:14 CEST
A few tests with original 3.4.0 client release and build build 3671 
reveals a huge difference in the performance, video and web page scrolling
was involved in test.

Note You need to log in before you can comment on or make changes to this bug.