Loading crypto/sha/asm/sha512-sparcv9.pl +14 −6 Original line number Diff line number Diff line Loading @@ -17,7 +17,7 @@ # Performance is >75% better than 64-bit code generated by Sun C and # over 2x than 32-bit code. X[16] resides on stack, but access to it # is scheduled for L2 latency and staged through 32 least significant # bits of %l0-%l7. The latter is done to achieve 32-/64-bit bit ABI # bits of %l0-%l7. The latter is done to achieve 32-/64-bit ABI # duality. Nevetheless it's ~40% faster than SHA256, which is pretty # good [optimal coefficient is 50%]. # Loading @@ -25,14 +25,22 @@ # # It's not any faster than 64-bit code generated by Sun C 5.8. This is # because 64-bit code generator has the advantage of using 64-bit # loads to access X[16], which I consciously traded for 32-/64-bit ABI # duality [as per above]. But it surpasses 32-bit Sun C generated code # by 60%, not to mention that it doesn't suffer from severe decay when # running 4 times physical cores threads and that it leaves gcc [3.4] # behind by over 4x factor! If compared to SHA256, single thread # loads(*) to access X[16], which I consciously traded for 32-/64-bit # ABI duality [as per above]. But it surpasses 32-bit Sun C generated # code by 60%, not to mention that it doesn't suffer from severe decay # when running 4 times physical cores threads and that it leaves gcc # [3.4] behind by over 4x factor! If compared to SHA256, single thread # performance is only 10% better, but overall throughput for maximum # amount of threads for given CPU exceeds corresponding one of SHA256 # by 30% [again, optimal coefficient is 50%]. # # (*) Unlike pre-T1 UltraSPARC loads on T1 are executed strictly # in-order, i.e. load instruction has to complete prior next # instruction in given thread is executed, even if the latter is # not dependent on load result! This means that on T1 two 32-bit # loads are always slower than one 64-bit load. Once again this # is unlike pre-T1 UltraSPARC, where, if scheduled appropriately, # 2x32-bit loads can be as fast as 1x64-bit ones. $bits=32; for (@ARGV) { $bits=64 if (/\-m64/ || /\-xarch\=v9/); } Loading Loading
crypto/sha/asm/sha512-sparcv9.pl +14 −6 Original line number Diff line number Diff line Loading @@ -17,7 +17,7 @@ # Performance is >75% better than 64-bit code generated by Sun C and # over 2x than 32-bit code. X[16] resides on stack, but access to it # is scheduled for L2 latency and staged through 32 least significant # bits of %l0-%l7. The latter is done to achieve 32-/64-bit bit ABI # bits of %l0-%l7. The latter is done to achieve 32-/64-bit ABI # duality. Nevetheless it's ~40% faster than SHA256, which is pretty # good [optimal coefficient is 50%]. # Loading @@ -25,14 +25,22 @@ # # It's not any faster than 64-bit code generated by Sun C 5.8. This is # because 64-bit code generator has the advantage of using 64-bit # loads to access X[16], which I consciously traded for 32-/64-bit ABI # duality [as per above]. But it surpasses 32-bit Sun C generated code # by 60%, not to mention that it doesn't suffer from severe decay when # running 4 times physical cores threads and that it leaves gcc [3.4] # behind by over 4x factor! If compared to SHA256, single thread # loads(*) to access X[16], which I consciously traded for 32-/64-bit # ABI duality [as per above]. But it surpasses 32-bit Sun C generated # code by 60%, not to mention that it doesn't suffer from severe decay # when running 4 times physical cores threads and that it leaves gcc # [3.4] behind by over 4x factor! If compared to SHA256, single thread # performance is only 10% better, but overall throughput for maximum # amount of threads for given CPU exceeds corresponding one of SHA256 # by 30% [again, optimal coefficient is 50%]. # # (*) Unlike pre-T1 UltraSPARC loads on T1 are executed strictly # in-order, i.e. load instruction has to complete prior next # instruction in given thread is executed, even if the latter is # not dependent on load result! This means that on T1 two 32-bit # loads are always slower than one 64-bit load. Once again this # is unlike pre-T1 UltraSPARC, where, if scheduled appropriately, # 2x32-bit loads can be as fast as 1x64-bit ones. $bits=32; for (@ARGV) { $bits=64 if (/\-m64/ || /\-xarch\=v9/); } Loading