Bug 5618 - multi-threaded VNC rect decoding
Summary: multi-threaded VNC rect decoding
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VNC (show other bugs)
Version: pre-1.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.6.0
Assignee: Pierre Ossman
URL:
Keywords: derfian_tester, hean01_tester, relnotes
: 5876 (view as bug list)
Depends on: 5360
Blocks: performance
  Show dependency treegraph
 
Reported: 2015-08-25 15:35 CEST by Pierre Ossman
Modified: 2016-05-06 07:56 CEST (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2015-08-25 15:35:01 CEST
We should try to make use of more than one CPU when decoding incoming rects in the VNC client. The rects are however not independent of each other, so there are some issues to overcome:

 - The number of bytes used to encode a rect is not known on a high level. Each decoder would have to handle reading its rects before we can start a parallel decode.

 - Rects might overlap. We need some locking/queueing to handle these in order. We also need to check if its better to stall the decoding of these rects, or buffer the writes on a queue.

 - Some encodings might read existing data (e.g. CopyRect). These also need to be handled in a serial manner with regard to other rects that write to these areas.

 - Some encodings might have hidden dependencies between rects. E.g all zlib based encodings tend to reuse the same zlib stream, meaning they have to be decoded in a serial manner. Tight makes this extra complex by having four possible streams.

We also need to fix bug 5615 if we want to use OpenMP for this.


TurboVNC had a threaded handling of rects in its Unix client:

https://github.com/TurboVNC/turbovnc/tree/27388e5962cb83abdcf07a25d79aabfaed353cd8/unix/vncviewer

Note that it doesn't seem to handle overlaps. Still, it could be useful to look at.
Comment 1 Pierre Ossman cendio 2015-11-09 14:21:43 CET
OpenMP doesn't seem useful here unfortunately. We read a single rect at a time and then return, an OpenMP needs to contain the entire parallel section within one block.

So we'll have to do this manually ourselves. Which might not be a major loss given the complex dependencies we have between rects.
Comment 2 Pierre Ossman cendio 2015-11-11 17:07:05 CET
Urgh. XP doesn't have condition variables which makes this a lot more difficult. Perhaps it is time to implement bug 5618 and drop XP support?
Comment 3 Pierre Ossman cendio 2015-11-16 15:17:50 CET
I have something that works now. Need to do more testing though, primarily for regressions. We don't want slow single core machines to be penalised too much by this change.
Comment 4 Pierre Ossman cendio 2015-11-16 15:19:49 CET
I tested it on a somewhat slow armhf machine and found a fun issue; vncviewer now consumes so much CPU that X and ssh are getting starved. The entire machine becomes very unresponsive when this happens. Probably need to do something about this problem.
Comment 5 Pierre Ossman cendio 2015-11-18 09:06:30 CET
Waiting for the X server seems to solve that issue, as well as the massive tearing. Which isn't really surprising after looking at the code. We are using shared memory to give the pixel data to the X server. Yet we don't wait for the X server to finish reading it before we start writing to it again.
Comment 6 Pierre Ossman cendio 2015-11-18 16:36:44 CET
However this exposed bug 5720.
Comment 7 Pierre Ossman cendio 2015-11-19 13:40:01 CET
Implementing this bug has screwed with how the bandwidth estimation code works, so I need to have a look at that as well. Perhaps bug 2494 should be included at that point.
Comment 9 Samuel Mannehed cendio 2015-12-03 16:58:50 CET
I have had a look at the auto mode, comparing how it worked prior to these changes with now. Initially I used my phone as a wifi hotspot and limited the networks it used.

I measured recieve bandwidth using the Task Manager on windows. JPEG quality (JQ), full color (fc), and pixel format depth (pfd) was gathered from tlclient.log.

I had two test cases:
* testing moving the profile chooser around on the solid background
* played the first 20 seconds of this video: https://www.youtube.com/watch?v=tQ-WuUHNIz8

Looking at the logfile, I can gather from my tests that the threshold for changing from quality level 8 to 6 seems to be 16000 kbit/s. The threshold for disabling full color and changing pixel format depth seems to both be at 250 kbit/s. This didn't change between the old and the new code.

Windows 10 client tests:


3G networks, using new code:
		| recieve	| JQ	| fc	| pfd
----------------+---------------+-------+-------+------
Moving a window	|~350Kbps	| 8	| yes	| 24
Playing a video	|~12Mbps	| 6-8	| yes	| 24

The logfile reported heavy fluctuations between quality level 6 and 8 while playing the test video. It changed back and forth roughly once every half-second.


2G networks, using new code:
		| recieve	| JQ	| fc	| pfd
----------------+---------------+-------+-------+------
Moving a window	|~200Kbps	| 6	| yes	| 24
Playing a video	|~200Kbps	| 6	| yes/no| 8-24

The quality level was constantly at 6. The pixel format depth and full color fluctuated a few times though.


3G networks, using old code:
		| recieve	| JQ	| fc	| pfd
----------------+---------------+-------+-------+------
Moving a window	|~400Kbps	| 8	| yes	| 24
Playing a video	|~12Mbps	| 6-8	| yes	| 24

Same as with new code.


I have exceeded my data limit for the month so I have to continue my tests using a different method.
Comment 10 Pierre Ossman cendio 2015-12-08 16:49:03 CET
A performance report is available here:

https://github.com/TigerVNC/tigervnc/tree/master/tests/results/multicore

The short summary is that we're seeing between 20% and 125% improvement in the tests. Real world scenarios will probably see less as there are other bottle necks besides the CPU.
Comment 11 Pierre Ossman cendio 2015-12-08 16:50:30 CET
I also had a look at if we should leave one "spare" CPU for ssh and other processes. But I could not find any scenario where they were competing for resources, so for now we'll keep things as they are.
Comment 12 Samuel Mannehed cendio 2015-12-17 11:31:11 CET
Using the throttle machine that is available in /import/vmware-images/Throttle/ to limit bandwidth we got the following results testing the auto setting:

First we were playing a maximized video in a session with a resolution of 1280x1024. The new code resulted in a threshold for switching down from level 8 to level 6 JPEG compression on 18 mbit/s and 128 kbit/s for turning off full color. The old code showed similar results with a first threshold at 15 mbit/s and the second threshold at 128 kbit/s.

Lastly we tested scrolling through text documents in the same session resolution. The new code displayed the same 18 mbit/s and 128 kbit/s thresholds as in the previous test. The old code had thresholds of 16 mbit/s and 128 kbit/s.

This is similar enough. The new code does seem a little bit more accurate at estimating bandwidth in fact. Both are horrible in reality though.
Comment 13 Karl Mikaelsson cendio 2016-01-21 10:27:00 CET
I've tested the new multi-threaded code on several different terminals and workstations, and the general experience is much better for streaming video when there's more cores to use. Single-core performance feels about the same as before.

Looks good to me.
Comment 14 Pierre Ossman cendio 2016-04-12 14:38:51 CEST
We've decided to support XP for a while longer, so we need some hack to get the viewer running there (albeit without multi-core support).
Comment 16 Pierre Ossman cendio 2016-04-18 14:02:15 CEST
Seems to work fine. Linux and modern Windows gets multiple threads, and XP gets just the one.
Comment 17 Henrik Andersson cendio 2016-04-19 08:32:37 CEST
When running nightly ThinLinc client 4.5.0post build 5097 on XP machine i get:

  "Warning: this Windows version is not supported"
Comment 18 Henrik Andersson cendio 2016-04-19 08:59:52 CEST
(In reply to comment #17)
> When running nightly ThinLinc client 4.5.0post build 5097 on XP machine i get:
> 
>   "Warning: this Windows version is not supported"

This is handled on bug #5360
Comment 19 Henrik Andersson cendio 2016-04-19 09:05:09 CEST
Verified using(In reply to comment #14)
> We've decided to support XP for a while longer, so we need some hack to get the
> viewer running there (albeit without multi-core support).

Tested using ThinLinc client 4.5.0post build 5097.

Verified that client works as expected on Windows XP platform (with 2 cores) and verified that using process explorer that only one thread was used.

Verified that client on a Window 10 platform (with 4 cores) works as expected and that vncviewer used more than one thread for decoding.
Comment 20 Pierre Ossman cendio 2016-05-04 14:30:17 CEST
*** Bug 5876 has been marked as a duplicate of this bug. ***
Comment 21 Pierre Ossman cendio 2016-05-04 14:30:58 CEST
Getting crashes on HP t610 in the update synchronisation code. See bug 5876.
Comment 23 Pierre Ossman cendio 2016-05-04 17:57:25 CEST
The X server on HP's machine has some serious bug. It sends out extra completion events, which screw up our counter. The window id in those extra event is also completely bogus (seems to be the actual id OR:ed with 0x40000001). For extra fun you get a segfault in the X server if you do xwininfo on that id.

Fortunately it was fairly easy to work around. I suspect getting a fix from HP anytime soon is not feasible.
Comment 24 Pierre Ossman cendio 2016-05-05 11:49:08 CEST
Nightly build works fine.
Comment 25 Henrik Andersson cendio 2016-05-06 07:56:19 CEST
Verified using ThinLinc client 4.6.0 build 5113 on a few client workstations including HP terminal. Seems to work as expected.

Note You need to log in before you can comment on or make changes to this bug.