Commit 64d92d74 authored by Andy Polyakov's avatar Andy Polyakov
Browse files

x86_64 assembly pack: "optimize" for Knights Landing, add AVX-512 results.



"Optimize" is in quotes because it's rather a "salvage operation"
for now. Idea is to identify processor capability flags that
drive Knights Landing to suboptimial code paths and mask them.
Two flags were identified, XSAVE and ADCX/ADOX. Former affects
choice of AES-NI code path specific for Silvermont (Knights Landing
is of Silvermont "ancestry"). And 64-bit ADCX/ADOX instructions are
effectively mishandled at decode time. In both cases we are looking
at ~2x improvement.

AVX-512 results cover even Skylake-X :-)

Hardware used for benchmarking courtesy of Atos, experiments run by
Romain Dolbeau <romain.dolbeau@atos.net>. Kudos!

Reviewed-by: default avatarRich Salz <rsalz@openssl.org>
parent bbb4ceb8
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -179,6 +179,7 @@
# Haswell	4.44/0.63	0.63	0.73	0.63	0.70
# Skylake	2.62/0.63	0.63	0.63	0.63
# Silvermont	5.75/3.54	3.56	4.12	3.87(*)	4.11
# Knights L	2.54/0.77	0.78	0.85	-	1.50
# Goldmont	3.82/1.26	1.26	1.29	1.29	1.50
# Bulldozer	5.77/0.70	0.72	0.90	0.70	0.95
# Ryzen		2.71/0.35	0.35	0.44	0.38	0.49
+4 −2
Original line number Diff line number Diff line
@@ -24,7 +24,7 @@
#
# Performance in cycles per byte out of large buffer.
#
#		IALU/gcc 4.8(i)	1xSSSE3/SSE2	4xSSSE3	    8xAVX2
#		IALU/gcc 4.8(i)	1xSSSE3/SSE2	4xSSSE3	    NxAVX(v)
#
# P4		9.48/+99%	-/22.7(ii)	-
# Core2		7.83/+55%	7.90/8.08	4.35
@@ -32,8 +32,9 @@
# Sandy Bridge	8.31/+42%	5.45/6.76	2.72
# Ivy Bridge	6.71/+46%	5.40/6.49	2.41
# Haswell	5.92/+43%	5.20/6.45	2.42	    1.23
# Skylake	5.87/+39%	4.70/-		2.31	    1.19
# Skylake[-X]	5.87/+39%	4.70/-		2.31	    1.19[0.57]
# Silvermont	12.0/+33%	7.75/7.40	7.03(iii)
# Knights L	11.7/-		-		9.60(iii)   0.80
# Goldmont	10.6/+17%	5.10/-		3.28
# Sledgehammer	7.28/+52%	-/14.2(ii)	-
# Bulldozer	9.66/+28%	9.85/11.1	3.06(iv)
@@ -50,6 +51,7 @@
#	limitations, SSE2 can do better, but gain is considered too
#	low to justify the [maintenance] effort;
# (iv)	Bulldozer actually executes 4xXOP code path that delivers 2.20;
# (v)	8xAVX2 or 16xAVX-512, whichever best applicable;

$flavour = shift;
$output  = shift;
+2 −0
Original line number Diff line number Diff line
@@ -35,6 +35,8 @@
# Applications using the EVP interface will observe a few percent
# worse performance.]
#
# Knights Landing processes 1 byte in 1.25 cycles (measured with EVP).
#
# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
# [2] http://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf

+3 −0
Original line number Diff line number Diff line
@@ -74,6 +74,7 @@
# Skylake	0.44(+110%)(if system doesn't support AVX)
# Bulldozer	1.49(+27%)
# Silvermont	2.88(+13%)
# Knights L	2.12(-)    (if system doesn't support AVX)
# Goldmont	1.08(+24%)

# March 2013
@@ -86,6 +87,8 @@
# it performs in 0.41 cycles per byte on Haswell processor, in
# 0.29 on Broadwell, and in 0.36 on Skylake.
#
# Knights Landing achieves 1.09 cpb.
#
# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest

$flavour = shift;
+5 −2
Original line number Diff line number Diff line
@@ -27,14 +27,15 @@
# Numbers are cycles per processed byte with poly1305_blocks alone,
# measured with rdtsc at fixed clock frequency.
#
#		IALU/gcc-4.8(*)	AVX(**)		AVX2
#		IALU/gcc-4.8(*)	AVX(**)		AVX2	AVX-512
# P4		4.46/+120%	-
# Core 2	2.41/+90%	-
# Westmere	1.88/+120%	-
# Sandy Bridge	1.39/+140%	1.10
# Haswell	1.14/+175%	1.11		0.65
# Skylake	1.13/+120%	0.96		0.51
# Skylake[-X]	1.13/+120%	0.96		0.51	[0.35]
# Silvermont	2.83/+95%	-
# Knights L	3.60/-		1.65		1.10	(***)
# Goldmont	1.70/+180%	-
# VIA Nano	1.82/+150%	-
# Sledgehammer	1.38/+160%	-
@@ -49,6 +50,8 @@
#	Core processors, 50-30%, less newer processor is, but slower on
#	contemporary ones, for example almost 2x slower on Atom, and as
#	former are naturally disappearing, SSE2 is deemed unnecessary;
# (***)	Current AVX-512 code requires BW and VL extensions and can not
#	execute on Knights Landing;

$flavour = shift;
$output  = shift;
Loading