We've noticed on some terminals that SSH's CPU usage is the bottle neck. We should look at using SIMD to improve the performance of our SSH clients. We could also look at improving the server end and pushing such changes upstream. That is a much bigger and long term project though.
We should profile a client before deciding if we should implement.
Maybe have a look at http://www.psc.edu/index.php/hpn-ssh, also.
Modern AMD and Intel CPUs have an instruction set called AES-NI that accelerates AES operations: http://en.wikipedia.org/wiki/AES_instruction_set OpenSSL 1.0.1 and newer has support for this, so it is in our build. The performance gain seems to be about a factor 2 on modern CPUs. This is on my workstation with a i7-3770: ~ [ossman@ossman]$ openssl speed -elapsed -evp aes-128-cbc You have chosen to measure elapsed time instead of user CPU time. Doing aes-128-cbc for 3s on 16 size blocks: 131863971 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 34947471 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 8913094 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 2192882 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 275734 aes-128-cbc's in 3.00s OpenSSL 1.0.1e-fips 11 Feb 2013 built on: Wed Jan 8 07:20:55 UTC 2014 options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 703274.51k 745546.05k 760584.02k 748503.72k 752937.64k ~ [ossman@ossman]$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-cbc You have chosen to measure elapsed time instead of user CPU time. Doing aes-128-cbc for 3s on 16 size blocks: 63098121 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 17559733 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 4531972 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 1152076 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 142745 aes-128-cbc's in 3.00s OpenSSL 1.0.1e-fips 11 Feb 2013 built on: Wed Jan 8 07:20:55 UTC 2014 options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 336523.31k 374607.64k 386728.28k 393241.94k 389789.01k The magical OPENSSL_ia32cap is a way to manipulate what openssl thinks the CPU is capable of and to force AES-NI to not be used. I got the flags from an example I found with a quick google. Their meaning is described here: http://www.openssl.org/docs/crypto/OPENSSL_ia32cap.html
OpenSSH also benefits from this, although not quite as much: ~ [ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | time ssh -c aes128-cbc localhost "cat >/dev/null" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.40011 s, 238 MB/s 3.09user 0.68system 0:04.40elapsed 85%CPU (0avgtext+0avgdata 4384maxresident)k 0inputs+0outputs (0major+2086minor)pagefaults 0swaps ~ [ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | OPENSSL_ia32cap="~0x200000200000000" time ssh -c aes128-cbc localhost "cat >/dev/null" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 5.83286 s, 180 MB/s 4.70user 0.66system 0:05.83elapsed 92%CPU (0avgtext+0avgdata 6152maxresident)k 0inputs+0outputs (0major+2590minor)pagefaults 0swaps Although, in this test sshd is still using acceleration so it might be the same factor 2 if I disabled it there as well.
Works well with our ssh as well: ~ [ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | time /opt/thinlinc/lib/tlclient/ssh -c aes128-cbc localhost "cat >/dev/null"NEXT AUTHMETHOD: none AUTH FAILURE NEXT AUTHMETHOD: publickey AUTH SUCCESS CONNECTED 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.02853 s, 260 MB/s COMMAND_EXITSTATUS: 0 3.05user 0.66system 0:04.02elapsed 92%CPU (0avgtext+0avgdata 2224maxresident)k 0inputs+0outputs (0major+646minor)pagefaults 0swaps ~ [ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | OPENSSL_ia32cap="~0x200000200000000" time /opt/thinlinc/lib/tlclient/ssh -c aes128-cbc localhost "cat >/dev/null" NEXT AUTHMETHOD: none AUTH FAILURE NEXT AUTHMETHOD: publickey AUTH SUCCESS CONNECTED 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 5.68193 s, 185 MB/s COMMAND_EXITSTATUS: 0 4.80user 0.57system 0:05.68elapsed 94%CPU (0avgtext+0avgdata 2228maxresident)k 0inputs+0outputs (0major+648minor)pagefaults 0swaps
It seems AES acceleration does not work correctly on at least our Windows 7 machine in the lab: SSH consumes about 10x CPU compared to my Linux workstation. A quick openssl run also indicates this: [astrand@scilla ~]$ openssl speed -evp aes-256-cbc Doing aes-256-cbc for 3s on 16 size blocks: 94221776 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 24693835 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 6241088 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 1564631 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 195628 aes-256-cbc's in 3.00s OpenSSL 1.0.1e-fips 11 Feb 2013 built on: Thu Jul 23 19:06:35 UTC 2015 options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256-cbc 502516.14k 526801.81k 532572.84k 534060.71k 534194.86k H:\tmp>openssl speed -evp aes-256-cbc WARNING: can't open config file: /usr/i686-pc-mingw32/sys-root/mingw/ssl/openssl .cnf Doing aes-256-cbc for 3s on 16 size blocks: 26686790 aes-256-cbc's in 2.98s Doing aes-256-cbc for 3s on 64 size blocks: 8316807 aes-256-cbc's in 3.01s Doing aes-256-cbc for 3s on 256 size blocks: 2197129 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 554808 aes-256-cbc's in 3.01s Doing aes-256-cbc for 3s on 8192 size blocks: 70383 aes-256-cbc's in 3.00s OpenSSL 1.0.1j 15 Oct 2014 built on: Thu Dec 11 12:21:54 UTC 2014 options:bn(64,32) rc4(8x,mmx) des(ptr,risc1,16,long) aes(partial) idea(int) blow fish(idx) compiler: i686-pc-mingw32-gcc -D_WINDLL -DOPENSSL_USE_APPLINK -DOPENSSL_PIC -DZL IB -DOPENSSL_THREADS -D_MT -DDSO_WIN32 -DOPENSSL_NO_CAPIENG -DL_ENDIAN -DWIN32_L EAN_AND_MEAN -fomit-frame-pointer -O3 -march=i486 -Wall -DOPENSSL_BN_ASM_PART_WO RDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM - DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DVPAES_ASM -DWHIRLPOO L_ASM -DGHASH_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256-cbc 143303.10k 176787.64k 187787.60k 188693.95k 192499.28k Wrt compiler flags, Windows lacks -DBSAES_ASM.
Also slow with 64-bit stuff on Windows: H:\tmp64>openssl speed -evp aes-256-cbc WARNING: can't open config file: /usr/x86_64-w64-mingw32/sys-root/mingw/ssl/open ssl.cnf Doing aes-256-cbc for 3s on 16 size blocks: 27698891 aes-256-cbc's in 3.01s Doing aes-256-cbc for 3s on 64 size blocks: 8191483 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 2155360 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 548596 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 69058 aes-256-cbc's in 3.01s OpenSSL 1.0.1j 15 Oct 2014 built on: Thu Dec 11 12:28:41 UTC 2014 options:bn(64,64) rc4(16x,int) des(idx,cisc,2,long) aes(partial) idea(int) blowf ish(idx) compiler: x86_64-w64-mingw32-gcc -D_WINDLL -DOPENSSL_PIC -DZLIB -DOPENSSL_THREAD S -D_MT -DDSO_WIN32 -DL_ENDIAN -O3 -Wall -DWIN32_LEAN_AND_MEAN -DUNICODE -D_UNIC ODE -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_B N_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256-cbc 147196.56k 175030.57k 184217.62k 187552.99k 187896.74k BSAES is even present.