Bug 3002 - accelerated ssh (SIMD, AES-NI, etc.)
Summary: accelerated ssh (SIMD, AES-NI, etc.)
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Client (show other bugs)
Version: pre-1.0
Hardware: PC All
: P2 Enhancement
Target Milestone: MediumPrio
Assignee: Peter Åstrand
URL:
Keywords:
Depends on:
Blocks: 5438 5616
  Show dependency treegraph
 
Reported: 2009-01-27 15:56 CET by Pierre Ossman
Modified: 2015-10-08 11:54 CEST (History)
1 user (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2009-01-27 15:56:44 CET
We've noticed on some terminals that SSH's CPU usage is the bottle neck. We should look at using SIMD to improve the performance of our SSH clients.

We could also look at improving the server end and pushing such changes upstream. That is a much bigger and long term project though.
Comment 1 Peter Åstrand cendio 2009-02-03 11:01:55 CET
We should profile a client before deciding if we should implement. 
Comment 2 Patrik Pira 2012-12-12 08:22:48 CET
Maybe have a look at http://www.psc.edu/index.php/hpn-ssh, also.
Comment 3 Pierre Ossman cendio 2014-03-20 11:26:00 CET
Modern AMD and Intel CPUs have an instruction set called AES-NI that accelerates AES operations:

http://en.wikipedia.org/wiki/AES_instruction_set

OpenSSL 1.0.1 and newer has support for this, so it is in our build.

The performance gain seems to be about a factor 2 on modern CPUs. This is on my workstation with a i7-3770:

~
[ossman@ossman]$ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 131863971 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 34947471 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 8913094 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2192882 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 275734 aes-128-cbc's in 3.00s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Wed Jan  8 07:20:55 UTC 2014
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     703274.51k   745546.05k   760584.02k   748503.72k   752937.64k

~
[ossman@ossman]$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 63098121 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 17559733 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 4531972 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1152076 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 142745 aes-128-cbc's in 3.00s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Wed Jan  8 07:20:55 UTC 2014
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     336523.31k   374607.64k   386728.28k   393241.94k   389789.01k


The magical OPENSSL_ia32cap is a way to manipulate what openssl thinks the CPU is capable of and to force AES-NI to not be used. I got the flags from an example I found with a quick google. Their meaning is described here:

http://www.openssl.org/docs/crypto/OPENSSL_ia32cap.html
Comment 4 Pierre Ossman cendio 2014-03-20 11:30:16 CET
OpenSSH also benefits from this, although not quite as much:

~
[ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | time ssh -c aes128-cbc localhost "cat >/dev/null"
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.40011 s, 238 MB/s
3.09user 0.68system 0:04.40elapsed 85%CPU (0avgtext+0avgdata 4384maxresident)k
0inputs+0outputs (0major+2086minor)pagefaults 0swaps

~
[ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | OPENSSL_ia32cap="~0x200000200000000" time ssh -c aes128-cbc localhost "cat >/dev/null"
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 5.83286 s, 180 MB/s
4.70user 0.66system 0:05.83elapsed 92%CPU (0avgtext+0avgdata 6152maxresident)k
0inputs+0outputs (0major+2590minor)pagefaults 0swaps

Although, in this test sshd is still using acceleration so it might be the same factor 2 if I disabled it there as well.
Comment 5 Pierre Ossman cendio 2014-03-20 11:32:55 CET
Works well with our ssh as well:

~
[ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | time /opt/thinlinc/lib/tlclient/ssh -c aes128-cbc localhost "cat >/dev/null"NEXT AUTHMETHOD: none
AUTH FAILURE
NEXT AUTHMETHOD: publickey
AUTH SUCCESS
CONNECTED
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.02853 s, 260 MB/s

COMMAND_EXITSTATUS: 0
3.05user 0.66system 0:04.02elapsed 92%CPU (0avgtext+0avgdata 2224maxresident)k
0inputs+0outputs (0major+646minor)pagefaults 0swaps

~
[ossman@ossman]$ dd if=/dev/zero count=1000 bs=1M | OPENSSL_ia32cap="~0x200000200000000" time /opt/thinlinc/lib/tlclient/ssh -c aes128-cbc localhost "cat >/dev/null"
NEXT AUTHMETHOD: none
AUTH FAILURE
NEXT AUTHMETHOD: publickey
AUTH SUCCESS
CONNECTED
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 5.68193 s, 185 MB/s

COMMAND_EXITSTATUS: 0
4.80user 0.57system 0:05.68elapsed 94%CPU (0avgtext+0avgdata 2228maxresident)k
0inputs+0outputs (0major+648minor)pagefaults 0swaps
Comment 6 Peter Åstrand cendio 2015-10-08 10:34:27 CEST
It seems AES acceleration does not work correctly on at least our Windows 7 machine in the lab: SSH consumes about 10x CPU compared to my Linux workstation. A quick openssl run also indicates this:

[astrand@scilla ~]$ openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 94221776 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 24693835 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 6241088 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1564631 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 195628 aes-256-cbc's in 3.00s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Thu Jul 23 19:06:35 UTC 2015
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     502516.14k   526801.81k   532572.84k   534060.71k   534194.86k



H:\tmp>openssl speed -evp aes-256-cbc
WARNING: can't open config file: /usr/i686-pc-mingw32/sys-root/mingw/ssl/openssl
.cnf
Doing aes-256-cbc for 3s on 16 size blocks: 26686790 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 64 size blocks: 8316807 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 2197129 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 554808 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 70383 aes-256-cbc's in 3.00s
OpenSSL 1.0.1j 15 Oct 2014
built on: Thu Dec 11 12:21:54 UTC 2014
options:bn(64,32) rc4(8x,mmx) des(ptr,risc1,16,long) aes(partial) idea(int) blow
fish(idx)
compiler: i686-pc-mingw32-gcc -D_WINDLL -DOPENSSL_USE_APPLINK -DOPENSSL_PIC -DZL
IB -DOPENSSL_THREADS -D_MT -DDSO_WIN32 -DOPENSSL_NO_CAPIENG -DL_ENDIAN -DWIN32_L
EAN_AND_MEAN -fomit-frame-pointer -O3 -march=i486 -Wall -DOPENSSL_BN_ASM_PART_WO
RDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -
DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DVPAES_ASM -DWHIRLPOO
L_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     143303.10k   176787.64k   187787.60k   188693.95k   192499.28k

Wrt compiler flags, Windows lacks -DBSAES_ASM.
Comment 7 Peter Åstrand cendio 2015-10-08 11:54:56 CEST
Also slow with 64-bit stuff on Windows:


H:\tmp64>openssl speed -evp aes-256-cbc
WARNING: can't open config file: /usr/x86_64-w64-mingw32/sys-root/mingw/ssl/open
ssl.cnf
Doing aes-256-cbc for 3s on 16 size blocks: 27698891 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 8191483 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 2155360 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 548596 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 69058 aes-256-cbc's in 3.01s
OpenSSL 1.0.1j 15 Oct 2014
built on: Thu Dec 11 12:28:41 UTC 2014
options:bn(64,64) rc4(16x,int) des(idx,cisc,2,long) aes(partial) idea(int) blowf
ish(idx)
compiler: x86_64-w64-mingw32-gcc -D_WINDLL -DOPENSSL_PIC -DZLIB -DOPENSSL_THREAD
S -D_MT -DDSO_WIN32 -DL_ENDIAN -O3 -Wall -DWIN32_LEAN_AND_MEAN -DUNICODE -D_UNIC
ODE -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_B
N_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM
-DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     147196.56k   175030.57k   184217.62k   187552.99k   187896.74k

BSAES is even present.

Note You need to log in before you can comment on or make changes to this bug.