4328 – very slow performance with AMD G-T56N APU

Bug 4328 - very slow performance with AMD G-T56N APU

Summary: very slow performance with AMD G-T56N APU

Status:	CLOSED FIXED

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VNC (show other bugs)
Version:	pre-1.0
Hardware:	PC Unknown

Importance:	P2 Normal
Target Milestone:	4.0.0
Assignee:	Pierre Ossman

URL:
Keywords:	hean01_tester

Depends on:
Blocks:

Reported:	2012-06-07 13:53 CEST by Pierre Ossman
Modified:	2012-11-28 12:33 CET (History)
CC List:	0 users

See Also:
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Pierre Ossman cendio

2012-06-07 13:53:05 CEST

We got reports from Oetiker that he was getting horrible performance on the HP t610, which is a high-end machine with lots of power. We're also seeing the same issue on Wyse Z90D, which has exactly the same processor.

No clue as to what the problem is at this point. Probably something wrong with the SIMD code.

Comment 1 Pierre Ossman cendio

2012-06-08 12:53:36 CEST

The problem is in the SSE2 code. Forcing it off gives expected performance.

Comment 2 Pierre Ossman cendio

2012-06-08 13:06:47 CEST

Problem is in the YUV to RGB conversion routine, which is good as it is one of the simpler ones.

Comment 3 Pierre Ossman cendio

2012-06-08 13:47:42 CEST

This is the code that is slow for some odd reason:

	pcmpeqb    xmmH,xmmH			; xmmH=(all 1's)
	maskmovdqu xmmA,xmmH			; movntdqu XMMWORD [edi], xmmA
	add	edi, byte SIZEOF_XMMWORD	; outptr
	maskmovdqu xmmD,xmmH			; movntdqu XMMWORD [edi], xmmD
	add	edi, byte SIZEOF_XMMWORD	; outptr
	maskmovdqu xmmF,xmmH			; movntdqu XMMWORD [edi], xmmF
	add	edi, byte SIZEOF_XMMWORD	; outptr

Comment 4 Pierre Ossman cendio

2012-06-08 17:00:52 CEST

The culprit is "maskmovdqu". This is a silly little instruction that serves little purpose, and AMD therefore decided not to waste silicon on it.

Why it is in the JPEG code is because it's trying to emulate the instruction "movntdq" without the alignment requirement it normally has. The proper instruction for that is "movdqu", but it has slightly different cache properties.

I'm not sure the cache avoidance of the current code is beneficial, so we need to look further into that.

I'm also concerned by the fact that this code is in the fallback section for unaligned buffers. That code will never be fast, so we are calling something incorrectly somewhere.

Comment 5 Pierre Ossman cendio

2012-06-13 11:18:36 CEST

maskmovdqu has been eliminated in upstream libjpeg-turbo. Need to upgrade our build system.

The performance boost is almost 10x on the Bobcat architecture, but we're also seeing improvement up to 10% on other CPUs.

Comment 6 Pierre Ossman cendio

2012-06-13 11:20:04 CEST

Avoiding the cache or not in the rest of the code didn't have any measurable effect on performance for simple tests. It might show something on higher level tests, but we don't have time for that right now.

Aligning buffers better was moved to a separate bug.

Comment 7 Pierre Ossman cendio

2012-06-14 10:51:27 CEST

DRC found that the 32-bit code is producing bad output. Need to investigate.

Comment 8 Pierre Ossman cendio

2012-07-02 12:38:21 CEST

Several more fixes were done upstream. Brought in to our build system in r25415.

Comment 9 Henrik Andersson cendio

2012-10-04 10:26:14 CEST

A few tests with original 3.4.0 client release and build build 3671 
reveals a huge difference in the performance, video and web page scrolling
was involved in test.

Note You need to log in before you can comment on or make changes to this bug.