There is a bug in our version of Xorg that makes fbBlt use an un-optimised code path. This is a fairly common operation so it is important for many use cases that it performs well. This is fixed in newer versions of Xorg: https://cgit.freedesktop.org/xorg/xserver/commit/?id=a2880699e8 A quick test here with glxgears, firefox and some youtube resulted in a decrease of 17% to 12% for fbBlt in its overall CPU usage, as measured by perf.
Did a test with Xvnc from 4.9.0 and from build 5901. Manually started Xvnc, no client and just glxgears. perf then reports 34% vs 17% of the time spent in fbBlt. I can also see the new version calling a sse2 optimised sub-function. Lastly, glxgears reports a bit higher frame rate (2000 vs 1700). Seems like we are indeed getting the faster version now.