.explicit
.text
.ident "ia64.S, Version 2.1"
.ident "IA-64 ISA artwork by Andy Polyakov "
//
// ====================================================================
// Written by Andy Polyakov for the OpenSSL
// project.
//
// Rights for redistribution and usage in source and binary forms are
// granted according to the OpenSSL license. Warranty of any kind is
// disclaimed.
// ====================================================================
//
// Version 2.x is Itanium2 re-tune. Few words about how Itanum2 is
// different from Itanium to this module viewpoint. Most notably, is it
// "wider" than Itanium? Can you experience loop scalability as
// discussed in commentary sections? Not really:-( Itanium2 has 6
// integer ALU ports, i.e. it's 2 ports wider, but it's not enough to
// spin twice as fast, as I need 8 IALU ports. Amount of floating point
// ports is the same, i.e. 2, while I need 4. In other words, to this
// module Itanium2 remains effectively as "wide" as Itanium. Yet it's
// essentially different in respect to this module, and a re-tune was
// required. Well, because some intruction latencies has changed. Most
// noticeably those intensively used:
//
// Itanium Itanium2
// ldf8 9 6 L2 hit
// ld8 2 1 L1 hit
// getf 2 5
// xma[->getf] 7[+1] 4[+0]
// add[->st8] 1[+1] 1[+0]
//
// What does it mean? You might ratiocinate that the original code
// should run just faster... Because sum of latencies is smaller...
// Wrong! Note that getf latency increased. This means that if a loop is
// scheduled for lower latency (as they were), then it will suffer from
// stall condition and the code will therefore turn anti-scalable, e.g.
// original bn_mul_words spun at 5*n or 2.5 times slower than expected
// on Itanium2! What to do? Reschedule loops for Itanium2? But then
// Itanium would exhibit anti-scalability. So I've chosen to reschedule
// for worst latency for every instruction aiming for best *all-round*
// performance.
// Q. How much faster does it get?
// A. Here is the output from 'openssl speed rsa dsa' for vanilla
// 0.9.6a compiled with gcc version 2.96 20000731 (Red Hat
// Linux 7.1 2.96-81):
//
// sign verify sign/s verify/s
// rsa 512 bits 0.0036s 0.0003s 275.3 2999.2
// rsa 1024 bits 0.0203s 0.0011s 49.3 894.1
// rsa 2048 bits 0.1331s 0.0040s 7.5 250.9
// rsa 4096 bits 0.9270s 0.0147s 1.1 68.1
// sign verify sign/s verify/s
// dsa 512 bits 0.0035s 0.0043s 288.3 234.8
// dsa 1024 bits 0.0111s 0.0135s 90.0 74.2
//
// And here is similar output but for this assembler
// implementation:-)
//
// sign verify sign/s verify/s
// rsa 512 bits 0.0021s 0.0001s 549.4 9638.5
// rsa 1024 bits 0.0055s 0.0002s 183.8 4481.1
// rsa 2048 bits 0.0244s 0.0006s 41.4 1726.3
// rsa 4096 bits 0.1295s 0.0018s 7.7 561.5
// sign verify sign/s verify/s
// dsa 512 bits 0.0012s 0.0013s 891.9 756.6
// dsa 1024 bits 0.0023s 0.0028s 440.4 376.2
//
// Yes, you may argue that it's not fair comparison as it's
// possible to craft the C implementation with BN_UMULT_HIGH
// inline assembler macro. But of course! Here is the output
// with the macro:
//
// sign verify sign/s verify/s
// rsa 512 bits 0.0020s 0.0002s 495.0 6561.0
// rsa 1024 bits 0.0086s 0.0004s 116.2 2235.7
// rsa 2048 bits 0.0519s 0.0015s 19.3 667.3
// rsa 4096 bits 0.3464s 0.0053s 2.9 187.7
// sign verify sign/s verify/s
// dsa 512 bits 0.0016s 0.0020s 613.1 510.5
// dsa 1024 bits 0.0045s 0.0054s 221.0 183.9
//
// My code is still way faster, huh:-) And I believe that even
// higher performance can be achieved. Note that as keys get
// longer, performance gain is larger. Why? According to the
// profiler there is another player in the field, namely
// BN_from_montgomery consuming larger and larger portion of CPU
// time as keysize decreases. I therefore consider putting effort
// to assembler implementation of the following routine:
//
// void bn_mul_add_mont (BN_ULONG *rp,BN_ULONG *np,int nl,BN_ULONG n0)
// {
// int i,j;
// BN_ULONG v;
//
// for (i=0; i