aes-586.pl

#!/usr/bin/env perl
#
# ====================================================================
# Written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
# project. Rights for redistribution and usage in source and binary
# forms are granted according to the OpenSSL license.
# ====================================================================
#
# Version 4.3.
#
# You might fail to appreciate this module performance from the first
# try. If compared to "vanilla" linux-ia32-icc target, i.e. considered
# to be *the* best Intel C compiler without -KPIC, performance appears
# to be virtually identical... But try to re-configure with shared
# library support... Aha! Intel compiler "suddenly" lags behind by 30%
# [on P4, more on others]:-) And if compared to position-independent
# code generated by GNU C, this code performs *more* than *twice* as
# fast! Yes, all this buzz about PIC means that unlike other hand-
# coded implementations, this one was explicitly designed to be safe
# to use even in shared library context... This also means that this
# code isn't necessarily absolutely fastest "ever," because in order
# to achieve position independence an extra register has to be
# off-loaded to stack, which affects the benchmark result.
#
# Special note about instruction choice. Do you recall RC4_INT code
# performing poorly on P4? It might be the time to figure out why.
# RC4_INT code implies effective address calculations in base+offset*4
# form. Trouble is that it seems that offset scaling turned to be
# critical path... At least eliminating scaling resulted in 2.8x RC4
# performance improvement [as you might recall]. As AES code is hungry
# for scaling too, I [try to] avoid the latter by favoring off-by-2
# shifts and masking the result with 0xFF<<2 instead of "boring" 0xFF.
#
# As was shown by Dean Gaudet <dean@arctic.org>, the above note turned
# void. Performance improvement with off-by-2 shifts was observed on
# intermediate implementation, which was spilling yet another register
# to stack... Final offset*4 code below runs just a tad faster on P4,
# but exhibits up to 10% improvement on other cores.
#
# Second version is "monolithic" replacement for aes_core.c, which in
# addition to AES_[de|en]crypt implements AES_set_[de|en]cryption_key.
# This made it possible to implement little-endian variant of the
# algorithm without modifying the base C code. Motivating factor for
# the undertaken effort was that it appeared that in tight IA-32
# register window little-endian flavor could achieve slightly higher
# Instruction Level Parallelism, and it indeed resulted in up to 15%
# better performance on most recent µ-archs...
#
# Third version adds AES_cbc_encrypt implementation, which resulted in
# up to 40% performance imrovement of CBC benchmark results. 40% was
# observed on P4 core, where "overall" imrovement coefficient, i.e. if
# compared to PIC generated by GCC and in CBC mode, was observed to be
# as large as 4x:-) CBC performance is virtually identical to ECB now
# and on some platforms even better, e.g. 17.6 "small" cycles/byte on
# Opteron, because certain function prologues and epilogues are
# effectively taken out of the loop...
#
# Version 3.2 implements compressed tables and prefetch of these tables
# in CBC[!] mode. Former means that 3/4 of table references are now
# misaligned, which unfortunately has negative impact on elder IA-32
# implementations, Pentium suffered 30% penalty, PIII - 10%.
#
# Version 3.3 avoids L1 cache aliasing between stack frame and
# S-boxes, and 3.4 - L1 cache aliasing even between key schedule. The
# latter is achieved by copying the key schedule to controlled place in
# stack. This unfortunately has rather strong impact on small block CBC
# performance, ~2x deterioration on 16-byte block if compared to 3.3.
#
# Version 3.5 checks if there is L1 cache aliasing between user-supplied
# key schedule and S-boxes and abstains from copying the former if
# there is no. This allows end-user to consciously retain small block
# performance by aligning key schedule in specific manner.
#
# Version 3.6 compresses Td4 to 256 bytes and prefetches it in ECB.
#
# Current ECB performance numbers for 128-bit key in CPU cycles per
# processed byte [measure commonly used by AES benchmarkers] are:
#
#		small footprint		fully unrolled
# P4		24			22
# AMD K8	20			19
# PIII		25			23
# Pentium	81			78
#
# Version 3.7 reimplements outer rounds as "compact." Meaning that
# first and last rounds reference compact 256 bytes S-box. This means
# that first round consumes a lot more CPU cycles and that encrypt
# and decrypt performance becomes asymmetric. Encrypt performance
# drops by 10-12%, while decrypt - by 20-25%:-( 256 bytes S-box is
# aggressively pre-fetched.
#
# Version 4.0 effectively rolls back to 3.6 and instead implements
# additional set of functions, _[x86|sse]_AES_[en|de]crypt_compact,
# which use exclusively 256 byte S-box. These functions are to be
# called in modes not concealing plain text, such as ECB, or when
# we're asked to process smaller amount of data [or unconditionally
# on hyper-threading CPU]. Currently it's called unconditionally from
# AES_[en|de]crypt, which affects all modes, but CBC. CBC routine
# still needs to be modified to switch between slower and faster
# mode when appropriate... But in either case benchmark landscape
# changes dramatically and below numbers are CPU cycles per processed
# byte for 128-bit key.
#
#		ECB encrypt	ECB decrypt	CBC large chunk
# P4		56[60]		84[100]		23
# AMD K8	48[44]		70[79]		18
# PIII		41[50]		61[91]		24
# Pentium	120		160		77
#
# Version 4.1 switches to compact S-box even in key schedule setup.
#
# Version 4.2 prefetches compact S-box in every SSE round or in other
# words every cache-line is *guaranteed* to be accessed within ~50
# cycles window. Why just SSE? Because it's needed on hyper-threading
# CPU! Which is also why it's prefetched with 64 byte stride. Best
# part is that it has no negative effect on performance:-)  
#
# Version 4.3 implements switch between compact and non-compact block
# functions in AES_cbc_encrypt depending on how much data was asked
# to process in one stroke.
#
# Timing attacks are classified in two classes: synchronous when
# attacker consciously initiates cryptographic operation and collects
# timing data of various character afterwards, and asynchronous when
# malicious code is executed on same CPU simultaneously with AES,
# instruments itself and performs statistical analysis of this data.
#
# As far as synchronous attacks go the root to the AES timing
# vulnerability is twofold. Firstly, of 256 S-box elements at most 160
# are referred to in single 128-bit block operation. Well, in C
# implementation with 4 distinct tables it's actually as little as 40
# references per 256 elements table, but anyway... Secondly, even
# though S-box elements are clustered into smaller amount of cache-
# lines, smaller than 160 and even 40, it turned out that for certain
# plain-text pattern[s] or simply put chosen plain-text and given key
# few cache-lines remain unaccessed during block operation. Now, if
# attacker can figure out this access pattern, he can deduct the key
# [or at least part of it]. The natural way to mitigate this kind of
# attacks is to minimize the amount of cache-lines in S-box and/or
# prefetch them to ensure that every one is accessed for more uniform
# timing. But note that *if* plain-text was concealed in such way that
# input to block function is distributed *uniformly*, then attack
# wouldn't apply. Now note that some encryption modes, most notably
# CBC, do masks the plain-text in this exact way [secure cipher output
# is distributed uniformly]. Yes, one still might find input that
# would reveal the information about given key, but if amount of
# candidate inputs to be tried is larger than amount of possible key
# combinations then attack becomes infeasible. This is why revised
# AES_cbc_encrypt "dares" to switch to larger S-box when larger chunk
# of data is to be processed in one stroke. The current size limit of
# 512 bytes is chosen to provide same [diminishigly low] probability
# for cache-line to remain untouched in large chunk operation with
# large S-box as for single block operation with compact S-box and
# surely needs more careful consideration...
#
# As for asynchronous attacks. There are two flavours: attacker code
# being interleaved with AES on hyper-threading CPU at *instruction*
# level, and two processes time sharing single core. As for latter.
# Two vectors. 1. Given that attacker process has higher priority,
# yield execution to process performing AES just before timer fires
# off the scheduler, immediately regain control of CPU and analyze the
# cache state. For this attack to be efficient attacker would have to
# effectively slow down the operation by several *orders* of magnitute,
# by ratio of time slice to duration of handful of AES rounds, which
# unlikely to remain unnoticed. Not to mention that this also means
# that he would spend correspondigly more time to collect enough
# statistical data to mount the attack. It's probably appropriate to
# say that if adeversary reckons that this attack is beneficial and
# risks to be noticed, you probably have larger problems having him
# mere opportunity. In other words suggested code design expects you
# to preclude/mitigate this attack by overall system security design.
# 2. Attacker manages to make his code interrupt driven. In order for
# this kind of attack to be feasible, interrupt rate has to be high
# enough, again comparable to duration of handful of AES rounds. But
# is there interrupt source of such rate? Hardly, not even 1Gbps NIC
# generates interrupts at such raging rate...
#
# And now back to the former, hyper-threading CPU or more specifically
# Intel P4. Recall that asynchronous attack implies that malicious
# code instruments itself. And naturally instrumentation granularity
# has be noticeably lower than duration of codepath accessing S-box.
# Given that all cache-lines are accessed during that time that is.
# Current implementation accesses *all* cache-lines within ~50 cycles
# window, which is actually *less* than RDTSC latency on Intel P4!

push(@INC,"perlasm","../../perlasm");
require "x86asm.pl";

&asm_init($ARGV[0],"aes-586.pl",$x86only = $ARGV[$#ARGV] eq "386");

$s0="eax";
$s1="ebx";
$s2="ecx";
$s3="edx";
$key="edi";
$acc="esi";
$tbl="ebp";

# stack frame layout in _[x86|sse]_AES_* routines, frame is allocated
# by caller