Commit d52d5ad1 authored by Andy Polyakov's avatar Andy Polyakov
Browse files

modes/asm/ghash-*.pl: switch to [more reproducible] performance results

collected with 'apps/openssl speed ghash'.
parent a3b0c44b
Loading
Loading
Loading
Loading
+3 −3
Original line number Diff line number Diff line
@@ -12,9 +12,9 @@
# The module implements "4-bit" GCM GHASH function and underlying
# single multiplication operation in GF(2^128). "4-bit" means that it
# uses 256 bytes per-key table [+128 bytes shared table]. On PA-7100LC
# it processes one byte in 19 cycles, which is more than twice as fast
# as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for 8
# cycles, but measured performance on PA-8600 system is ~9 cycles per
# it processes one byte in 19.6 cycles, which is more than twice as
# fast as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for
# 8 cycles, but measured performance on PA-8600 system is ~9 cycles per
# processed byte. This is ~2.2x faster than 64-bit code generated by
# vendor compiler (which used to be very hard to beat:-).
#
+2 −2
Original line number Diff line number Diff line
@@ -17,8 +17,8 @@
#
#		gcc 3.3.x	cc 5.2		this assembler
#
# 32-bit build	81.0		48.6		11.8	(+586%/+311%)
# 64-bit build	27.5		20.3		11.8	(+133%/+72%)
# 32-bit build	81.4		43.3		12.6	(+546%/+244%)
# 64-bit build	20.2		21.2		12.6	(+60%/+68%)
#
# Here is data collected on UltraSPARC T1 system running Linux:
#
+34 −35
Original line number Diff line number Diff line
@@ -21,17 +21,18 @@
#
#		gcc 2.95.3(*)	MMX assembler	x86 assembler
#
# Pentium	100/112(**)	-		50
# PIII		63 /77		12.2		24
# P4		96 /122		18.0		84(***)
# Opteron	50 /71		10.1		30
# Core2		54 /68		8.6		18
# Pentium	105/111(**)	-		50
# PIII		68 /75		12.2		24
# P4		125/125		17.8		84(***)
# Opteron	66 /70		10.1		30
# Core2		54 /67		8.4		18
#
# (*)	gcc 3.4.x was observed to generate few percent slower code,
#	which is one of reasons why 2.95.3 results were chosen,
#	another reason is lack of 3.4.x results for older CPUs;
#	comparison is not completely fair, because C results are
#	for vanilla "256B" implementations, not "528B";-)
#	comparison with MMX results is not completely fair, because C
#	results are for vanilla "256B" implementation, while
#	assembler results are for "528B";-)
# (**)	second number is result for code compiled with -fPIC flag,
#	which is actually more relevant, because assembler code is
#	position-independent;
@@ -44,7 +45,7 @@

# May 2010
#
# Add PCLMULQDQ version performing at 2.13 cycles per processed byte.
# Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
# The question is how close is it to theoretical limit? The pclmulqdq
# instruction latency appears to be 14 cycles and there can't be more
# than 2 of them executing at any given time. This means that single
@@ -60,38 +61,36 @@
# Before we proceed to this implementation let's have closer look at
# the best-performing code suggested by Intel in their white paper.
# By tracing inter-register dependencies Tmod is estimated as ~19
# cycles and Naggr is 4, resulting in 2.05 cycles per processed byte.
# As implied, this is quite optimistic estimate, because it does not
# account for Karatsuba pre- and post-processing, which for a single
# multiplication is ~5 cycles. Unfortunately Intel does not provide
# performance data for GHASH alone, only for fused GCM mode. But
# we can estimate it by subtracting CTR performance result provided
# in "AES Instruction Set" white paper: 3.54-1.38=2.16 cycles per
# processed byte or 5% off the estimate. It should be noted though
# that 3.54 is GCM result for 16KB block size, while 1.38 is CTR for
# 1KB block size, meaning that real number is likely to be a bit
# further from estimate.
# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
# processed byte. As implied, this is quite optimistic estimate,
# because it does not account for Karatsuba pre- and post-processing,
# which for a single multiplication is ~5 cycles. Unfortunately Intel
# does not provide performance data for GHASH alone. But benchmarking
# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
# the result accounts even for pre-computing of degrees of the hash
# key H, but its portion is negligible at 16KB buffer size.
#
# Moving on to the implementation in question. Tmod is estimated as
# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
# 2.16. How is it possible that measured performance is better than
# optimistic theoretical estimate? There is one thing Intel failed
# to recognize. By fusing GHASH with CTR former's performance is
# really limited to above (Tmul + Tmod/Naggr) equation. But if GHASH
# procedure is detached, the modulo-reduction can be interleaved with
# Naggr-1 multiplications and under ideal conditions even disappear
# from the equation. So that optimistic theoretical estimate for this
# implementation is ... 28/16=1.75, and not 2.16. Well, it's probably
# way too optimistic, at least for such small Naggr. I'd argue that
# (28+Tproc/Naggr), where Tproc is time required for Karatsuba pre-
# and post-processing, is more realistic estimate. In this case it
# gives ... 1.91 cycles per processed byte. Or in other words,
# depending on how well we can interleave reduction and one of the
# two multiplications the performance should be betwen 1.91 and 2.16.
# As already mentioned, this implementation processes one byte [out
# of 1KB buffer] in 2.13 cycles, while x86_64 counterpart - in 2.07.
# x86_64 performance is better, because larger register bank allows
# to interleave reduction and multiplication better.
# to recognize. By serializing GHASH with CTR in same subroutine
# former's performance is really limited to above (Tmul + Tmod/Naggr)
# equation. But if GHASH procedure is detached, the modulo-reduction
# can be interleaved with Naggr-1 multiplications at instruction level
# and under ideal conditions even disappear from the equation. So that
# optimistic theoretical estimate for this implementation is ...
# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
# at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
# where Tproc is time required for Karatsuba pre- and post-processing,
# is more realistic estimate. In this case it gives ... 1.91 cycles.
# Or in other words, depending on how well we can interleave reduction
# and one of the two multiplications the performance should be betwen
# 1.91 and 2.16. As already mentioned, this implementation processes
# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
# - in 2.02. x86_64 performance is better, because larger register
# bank allows to interleave reduction and multiplication better.
#
# Does it make sense to increase Naggr? To start with it's virtually
# impossible in 32-bit mode, because of limited register bank
+5 −4
Original line number Diff line number Diff line
@@ -20,17 +20,18 @@
#		gcc 3.4.x(*)	assembler
#
# P4		28.6		14.0		+100%
# Opteron	18.5		7.7		+140%
# Core2		17.5		8.1(**)		+115%
# Opteron	19.3		7.7		+150%
# Core2		17.8		8.1(**)		+120%
#
# (*)	comparison is not completely fair, because C results are
#	for vanilla "256B" implementation, not "528B";-)
#	for vanilla "256B" implementation, while assembler results
#	are for "528B";-)
# (**)	it's mystery [to me] why Core2 result is not same as for
#	Opteron;

# May 2010
#
# Add PCLMULQDQ version performing at 2.07 cycles per processed byte.
# Add PCLMULQDQ version performing at 2.02 cycles per processed byte.
# See ghash-x86.pl for background information and details about coding
# techniques.
#