This is a tracker bug to document ideas and plans for how we improve performance related aspects of the VNC/RFB portion in ThinLinc. This includes:
- Bandwidth usage
- Latency sensitivity
- CPU usage
- Image quality, perceived or otherwise.
- Evaluate ZRLE (bug 1215)
- Evaluate deferred updates (bug 2931)
- Evaluate Comparing Update Tracker (bug 4020)
- Handle high latency networks (bug 2566)
- SIMD accelerated JPEG (bug 2926)
- Optimise JPEG entry coding (bug 3009)
- Double buffering (bug 2930)
- Clean up and move magic out of the Tight encoder (bug 4915 and bug 5026)
- Record sessions:
Needed so we can properly evaluate parts of the system with real world data. Such recordings need to losslessly record every change, with timing information.
- Evaluate existing components more thoroughly/systematically
Are we getting a good trade off for CPU/bandwidth for these:
* Comparing Update Tracker
* Deferred updates
* Solid area detection
* Sub-rect analysis
- Evaluate CODECs
Write some framework that allows us to get reliably numbers for what the trade off is between CPU, bandwidth and quality for our CODECs. We need to consider the effects the compression and quality settings has, as well as what type of data might fit each CODEC best.
A suitable metric for quality might be SSIM.
- Evaluate sub-rect classes
Connected to the above, we need to investigate if we are classifying sub-rects in a way that makes sense with regard to real world data, and allows each CODEC to play to its strengths.
We also should look at the optimal size for sub rects. Perhaps we should have a fixed grid of e.g. 64x64 pixels? That could also simplify update tracking and save CPU time there, at the cost of sending a bit more data.
- Server side compression selection
The server is much better at selecting which CODEC to use and how much compression to apply.
- Automatic lossless refresh (bug 2928)
- Partial updates (bug 4735)
Allows us to multiplex other packets when we have a large update and limited bandwidth. Primarily it allows us to send out more congestion control packets, which normally stops working as the update size approaches and exceeds BDP. A protocol extension will be needed to fix this.
- Socket buffer (bug 4734)
Currently we'll hang the server if we fill the outgoing socket. We should have a memory buffer for when this happens to avoid screwing up timing sensitive things (like the congestion control).
- High level compression improvements:
* Track multiple copy operations (limited to one right now)
* More aggressive search for solid rects.
* Reduce colours before encoding
* Image caching (bug 1814)
* Multi threaded encoding (OpenMP?)
- New CODECs:
* Video CODECs (h.263/264/265, VP7/VP8/VP9, Theora, Dirac, ...)
* Delta compared to previous frame
* JPEG encoding separate from Tight, to encourage use
- Hardware acceleration (GPU, more SIMD) (bug 4982)
- Tweak deflateInit
You can give it hints about what kind of data to expect, and we might not be using that fully today.
- Dithering (bug 3893)
- Lossless hint
The server might do lots of lossy stuff to the data, not just JPEG, so we need a general way for the client to say that it wants pixel perfect data at all times.
- Indicate stalled/slow connection (bug 4956)
DRC har done some tests with the x264 codec:
* Check the overhead of the JPEG JFIF header. There is a lot of data there which might make JPEG a bad choice for small blocks. Perhaps there are pre-defined quantisation and huffman tables that can be used in such cases?
- Finish SIMD on ARM (it apparently still lacks significant portions)
- The palette code doesn't mask away irrelevant bits. This means that a single colour might get multiple entries and reduce the efficiency of the compression.
- Look at using ORC from GStreamer. It can probably optimise and generalise many heavy loops. Might also be useful in libjpeg-turbo.
- Look at Google's Snappy as an alternative to zlib based encodings.
(In reply to comment #6)
> - Finish SIMD on ARM (it apparently still lacks significant portions)
The important bits have been fixed now. There are still some missing pieces, but nothing that is normally used. Coverage for SIMD optimisation in libjpeg-turbo can be seen here:
More Google compression algorithms to look at:
- Zofpli: https://github.com/google/zopfli
- Brotli: http://google-opensource.blogspot.se/2015/09/introducing-brotli-new-compression.html
- FLIF (http://flif.info/)
Both Intel and Cloudflare have implemented optimised versions of zlib that are a drop in replacement:
- The HTML client uses compression level 9 even though we've seen little benefit above level 3. The native client and the server use 2 by default.
We might want to see if pixman has some routines that can be used to speed up parts of the pipeline.
More benchmarking of modern compression routines:
Apple joins the fray with their own compression algorithm:
(In reply to comment #2)
> Are we getting a good trade off for CPU/bandwidth for these:
> * Comparing Update Tracker
I did a quick test running youtube under various desktop environments and measured how much compression the CUT achieved:
Gnome 3: 1:1.6
KDE, Xfce (without composite), MATE (with composite): 1:5
So it seems to do a lot for this use case for most environments. However Gnome 3 maintained a much lower frame rate, which could be also be a large factor.
> Guetzli is a JPEG encoder that aims for excellent compression density at high visual quality.
> Guetzli-generated images are typically 20-30% smaller than images of equivalent quality
> generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster
> decompression speeds they offer.
And another compression format:
(bug 4333 degraded from own bug to just an idea comment:)
Most SIMD instructions require a specific memory alignment to work optimally.
As discovered on bug 4328, we're not always doing this properly. It could be
worth investigating if we can improve alignment and thereby improve
Microsoft developed a new lossless CODEC as part of RemoteFX which could be interesting to have in VNC as well. Some overview here:
There has popped up some alternatives to pako (for zlib handling in Web Access) that claim to be faster:
The client bandwidth estimation got a tweak upstream, which should hopefully make it a bit more accurate:
the comparing update tracker got tweaked upstream here:
It now spends a bit more CPU in order to reduce the changed area.
(In reply to Pierre Ossman from comment #4)
> * Check the overhead of the JPEG JFIF header. There is a lot of data there
> which might make JPEG a bad choice for small blocks. Perhaps there are
> pre-defined quantisation and huffman tables that can be used in such cases?
Apparently RealVNC's JPEG encoding has some hacks in this area where some sections of the header are reused between rects. At least according to this PR to noVNC:
The downside to the approach used there though is that it makes it more difficult to decode several JPEG rects concurrently.
Most of the new image codec groups seems to have joined forces in favour of a new standard called JPEG XL. The main contributors are Google with their various formats, and the FLIF people. Performance is claimed to be equivalent to libjpeg-turbo, but with massive gains in quality per bit.
Browser support is experimental in both chrome and Firefox, but many of the big players (Google, Facebook, Cloud flare) are backing this format.
It is in the later stages of ISO standardisation and the format seems to be fixed at this point.