Entry tags:
SHA-1 in 1300 clock cycles. (Warning: Long)
I haven't had time to post anything new in here. (Malware analysis and stuff.) So here's something old I wrote, that few people have seen.
I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.
There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip
I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.
There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip
; SHA-1 transform
; Version: 0.14
; Date: Jan 20, 2004
; Author: Julia Wolf
; Copyright 2004, Distribution in accordence with the GPL (http://www.gnu.org/)
; I wrote the first version (0.1) of this in about a day working only off of
; the FIPS 180-1 standard document itself (and my previous MMX MD-5 routine).
; Then I spend a week or so shaving off about 300 clock cycles.
; It's probabily best to have a copy of FIPS 180-1 sitting in front of you if
; you want to try to follow what the code is doing.
; The bizzare instruction order, the NOPs and seemingly random
; use of the same value in registers and from memory load, are all
; to shave off a few clock cycles here and there.
; In summary, these instructions have been layed out to completely
; saturate all of the pipelines and execution units in the AMD K6-II
; whenever possible, and also make the instruction decoder happy.
; (it really is faster with the NOP's)
; I can probibally shave off a few more clock cycles, but I've pretty much
; hit the flat part of the curve now. It's not going to get any faster than
; it is now by more than a few clocks unless I start removing productive
; code. (Or adding more 64 bit registers to the silicon)
; Statistics on my AMD K6-II (fastest machine I have because I'm poor)
; (Timmed with RDTSC)
; best time avg. ~1295 clocks
; 826 opcodes to final RET, including NOP's
; 777 opcodes without NOP's
; 1295/777 = 1.66 opcodes per clock cycle...
; ...or about 0.60 clocks per opcode.
; total length 2720 bytes including the NOP's after RET filling the cache
; line. Or, exactly 85 cache lines. The four main rounds were rolled back up
; to reduce cache misses on the instruction cache.
; This function assumes that the 512-bit blocks have already been padded
; at the end.
; You can call this like this:
; extern void sha1_mmx(unsigned char* x, unsigned char* y);
; unsigned char a[8];
; unsigned char b[8];
; unsigned char c[8];
; unsigned char d[8];
; unsigned char e[8];
; unsigned char x[64], y[64];
; sha1_mmx(x,y)
; x and y are your 512 bits of message, padded apropreately
; ('1' bit with the size mod 2^64 at the end, see FIPS 180-1)
; a, b, c, d, and e are the two contexts, foreach z in a,b,c,d,e:
; z[0],z[1],z[2],z[3] belong to the first context and
; z[4],z[5],z[6],z[7] belong to the second context.
; I really don't know why I bother with using byte arrays rather than int's
; since this will only ever run on a Pentium MMX and above I don't have to
; worry about endedness issues. (force of habbit I guess)
[SECTION .bss align=8 noexec]
W RESQ 80 ; first 16 elements same as M see FIPS 180-1
[SECTION .rodata align=8 noexec]
K1 dd 0x5A827999, 0x5A827999 ; for T[ 0..19]
K2 dd 0x6ED9EBA1, 0x6ED9EBA1 ; for T[20..39]
K3 dd 0x8F1BBCDC, 0x8F1BBCDC ; for T[40..59]
K4 dd 0xCA62C1D6, 0xCA62C1D6 ; for T[60..79]
; TODO: pass the context as an argument insted of using globals
[SECTION .data align=8 noexec]
extern a
extern b
extern c
extern d
extern e
[SECTION .text allign=8 nowrite]
global sha1_mmx:function
sha1_mmx:
; This is faster than ENTER 0,0 by a few clocks, strangely
; EBP, ESI, and stuff arn't really used here, so
; I could take these ops out, there are no local variables.
PUSH EBP ; 55
MOV EBP, ESP ; 89 e5
PUSH EBX ; 53
PUSH ESI ; 56
PUSH EDX ; 52
%define X EDX
%define Y EBX
MOV Y, [EBP+8] ; 8b 55 08 ; arrrg 2
MOV X, [EBP+12] ; 8b 5d 0c ; arrrg 1
; I think this is the only instruction that makes this
; not compatible with the original Pentium MMX
; (replace with EMMS if there's a problem)
FEMMS ; 0f 0e
PREFETCH [X+32] ; 0f 0d 42 20
PREFETCH [Y+32] ; 0f 0d 43 20
XOR ECX, ECX ; 31 c9 ; shorter than doing MOV ECX, 0 ; b9 00 00 00 00
; Rather than using the [disp32+index] addressing mode
; storing [W] in unused register EAX allows for shorter
; instruction length.
%define index EAX
MOV index, W ; b8 38 a7 04 08 ($0x804a738)
NOP ; 90
NOP ; 90
NOP ; 90
;-- cache line boundry
; This routine takes 64 bits from each block of 512 bits
; and mixes them together into alternating blocks of 32 bits
; for example, x[n] and y[n] are 32 bits each:
; input: x[1], x[2], x[3], x[4], x[5] ...
; input: y[1], y[2], y[3], y[4], y[5] ...
; output: x[1], y[1], x[2], y[2], x[3], y[3], x[4], y[4], ...
; The output is placed into the first 16 QWORDS of [W]
; This is done so that each 32 bits fills the top and bottom
; halves of a 64 bit MMX register. Allowing both to be worked
; on in parallel. (It's the rationale behind this whole SHA-1
; function if you havn't guessed already)
; only need to do this for 8 DWORDs to make 16 QWORDs (hence ECX)
; this is about 20 to 90 clocks.
; These were changed to byte-sized version to save space
; and ECX never goes above 256 so it's ok
; ADD ECX, 8 ; 81 c1 08 00 00 00
; CMP ECX, 64 ; 81 f9 40 00 00 00
multiplex:
MOVQ MM6, [X+ECX] ; 0f 6f 34 0a ; 12345678
MOVQ MM7, [Y+ECX] ; 0f 6f 3c 0b ; abcdefgh
MOVQ MM4, MM6 ; 0f 6f e6
MOVQ MM5, MM7 ; 0f 6f ef
PUNPCKHDQ MM5, MM4 ; 0f 6a ec ; 1234abcd
PUNPCKLDQ MM7, MM6 ; 0f 62 fe ; 5678efgh
MOVQ [index+ECX*2], MM7 ; 0f 7f 3c 48
MOVQ [index+ECX*2+8], MM5 ; 0f 7f 6c 48 08
ADD CL, BYTE 8 ; 80 c1 08
;-- cache line
CMP CL, BYTE 64 ; 80 f9 40
JNE multiplex ; 75 d5
; Pre-compute the words for T=16 through T=79
; It's better to do this now while the MMX registers are
; still free, rather than later when they are used.
; Only needs an extra 64 QWORDS of memory.
; Originally this performed 16 loops of these 4 operations:
; for a total of (80-16)=64 words
; T (W[T-3] XOR W[T-8] XOR W[T-14] XOR W[T-16])<<<1
; one (W[T-2] XOR W[T-7] XOR W[T-13] XOR W[T-15])<<<1
; two (W[T-1] XOR W[T-6] XOR W[T-12] XOR W[T-14])<<<1
; three (W[T ] XOR W[T-5] XOR W[T-11] XOR W[T-13])<<<1
; On the last operation you can see that it is dependent upon
; the first operation compleating, which wasn't much of a problem
; just make that the last XOR performed.
; Can be done in a loop of (4 ops * 16 loops)
; Then I rolled it back up into a loop and only did three at a time
; So I could have done (3*21)+1 but then unrolled it a little bit
; and it now does (6*10)+4 operations.
; This freed some MMX registers for some speed improvements
; doing RISC-like programming, seperate mload and meu operations.
; To avoid contention over the single MMX bit shifter on the K6-II
; left shifts are done by adding the register to itself, since the
; integer unit is free at that point. (The right shift can be done
; in parallel to it)
; If there was a faster way to get the sign bit of each half of
; an MMX reg, then that could be used in place of the PSRLD.
; If there was an MMX equivilant to ROR or ROL then this would
; be much faster, but there isn't so I need to make a copy of
; the register, shift one left, and the other right, and then
; OR them back together.
; And I think copying the MMX register out into one of the general
; E?X registers, doing the rotate there and copying back into MMX
; would take even longer than this.
; the following is about 440 clocks
; so at about 44 clocks per loop, and 6 words per loop
; it's about 7.3 clocks per two 32bit words
ugh:
%define counter ECX
%define bailout EDX
; T=16
MOV CL, BYTE 128 ; b1 80 ; shorter than MOV ECX, 128 ; b9 80 00 00 00
; high half of ECX still 0 from above
; there might be a superset dependancy here,
; but I don't see any speed diference with ECX vs CL
ADD counter, index ; 01 c1
MOV bailout, counter; 89 ca
; end when T=16+60
ADD bailout, 480 ; 81 c2 e0 01 00 00 ; 128+480
; skip the padding, shaves off a few clocks
JMP make_words ; e9 0a 00 00 00
; pad up to the next cache line so that the following loop goes faster
NOP ; 6
NOP ; 7
NOP ; 8
NOP ; 9
NOP ; A
NOP ; B
NOP ; C
NOP ; D
NOP ; E
NOP ; F
;-- cache line
make_words:
; mload T-3, T-2, T-1
MOVQ MM0, [counter-24]; 0f 6f 41 e8 ; T-3
MOVQ MM1, [counter-16]; 0f 6f 49 f0 ; T-2
MOVQ MM2, [counter-8] ; 0f 6f 51 f8 ; T-1
; insted of doing PXOR MM0, [ECX-64] etc. etc.
; doing MOVQ then PXOR shaves off a bunch of clocks here...
; mload T-8, T-7, T-6
MOVQ MM3, [counter-64]; 0f 6f 59 c0 ; T-8
MOVQ MM4, [counter-56]; 0f 6f 61 c8 ; T-7
PXOR MM0, MM3 ; 0f ef c3 ; T-8 XOR T-3
MOVQ MM5, [counter-48]; 0f 6f 69 d0 ; T-6
PXOR MM1, MM4 ; 0f ef cc ; T-7 XOR T-2
;-- cacheline-2 this will be long decoded
PXOR MM2, MM5 ; 0f ef d5 ; T-6 XOR T-1
; ...but doesn't save time when done here:
; mload T-14, T-13, T-12
PXOR MM0, [counter-112]; 0f ef 41 90 ; T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-104]; 0f ef 49 98 ; T-13 XOR T-7 XOR T-2
PXOR MM2, [counter-96] ; 0f ef 51 a0 ; T-12 XOR T-6 XOR T-1
; mload T-16, T-15, T-14
PXOR MM0, [counter-128]; 0f ef 41 80 ; T-16 XOR T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-120]; 0f ef 49 88 ; T-15 XOR T-13 XOR T-7 XOR T-2
; PXOR MM2, [ECX-112] was split up into a MOVQ, PXOR combo on MM3
MOVQ MM3, [counter-112]; 0f 6f 59 90 ; T-14
MOVQ MM4, MM0 ; 0f 6f e0 ; for left shift below
PXOR MM2, MM3 ; 0f ef d3 ; T-14 XOR T-12 XOR T-6 XOR T-1
NOP ; 90
;-- cache line
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>(32-1)
MOVQ MM5, MM1 ; 0f 6f e9 ; for left shift below
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>(32-1)
MOVQ MM6, MM2 ; 0f 6f f2 ; for left shift below
PADDD MM4, MM4 ; 0f fe e4 ; <<1
PSRLD MM2, 31 ; 0f 72 d2 1f ; >>(32-1)
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4 ; MM0<<<1=(MM0>>31)|(MM4<<1)
PADDD MM6, MM6 ; 0f fe f6 ; <<1
NOP ; 90
NOP ; 90
;-- cache line
; mstore T, T+1, T+2
POR MM1, MM5 ; 0f eb cd ; MM1<<<1=(MM1>>31)|(MM5<<1)
MOVQ [counter], MM0 ; 0f 7f 01
POR MM2, MM6 ; 0f eb d6 ; MM2<<<1=(MM2>>31)|(MM6<<1)
MOVQ [counter+8], MM1 ; 0f 7f 49 08
MOVQ [counter+16], MM2; 0f 7f 51 10
; next 3 ================================
; T=T+3
ADD counter, BYTE 24 ; 83 c1 18
; mload T-3, T-2, T-1
MOVQ MM0, [counter-24]; 0f 6f 41 e8 ; T-3
MOVQ MM1, [counter-16]; 0f 6f 49 f0 ; T-2
MOVQ MM2, [counter-8] ; 0f 6f 51 f8 ; T-1
;-- cache line
; mload T-8, T-7, T-6
MOVQ MM3, [counter-64]; 0f 6f 59 c0 ; T-8
MOVQ MM4, [counter-56]; 0f 6f 61 c8 ; T-7
PXOR MM0, MM3 ; 0f ef c3 ; T-8 XOR T-3
MOVQ MM5, [counter-48]; 0f 6f 69 d0 ; T-6
PXOR MM1, MM4 ; 0f ef cc ; T-7 XOR T-2
PXOR MM2, MM5 ; 0f ef d5 ; T-6 XOR T-1
; mload T-14, T-13, T-12
PXOR MM0, [counter-112]; 0f ef 41 90 ; T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-104]; 0f ef 49 98 ; T-13 XOR T-7 XOR T-2
PXOR MM2, [counter-96] ; 0f ef 51 a0 ; T-12 XOR T-6 XOR T-1
;-- cacheline +1
; mload T-16, T-15, T-14
PXOR MM0, [counter-128]; 0f ef 41 80 ; T-16 XOR T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-120]; 0f ef 49 88 ; T-15 XOR T-13 XOR T-7 XOR T-2
MOVQ MM4, MM0 ; 0f 6f e0 ; for left shift below
PXOR MM2, [counter-112]; 0f ef 51 90 ; T-14 XOR T-12 XOR T-6 XOR T-1
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>(32-1)
MOVQ MM5, MM1 ; 0f 6f e9 ; for left shift below
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>(32-1)
MOVQ MM6, MM2 ; 0f 6f f2 ; for left shift below
PADDD MM4, MM4 ; 0f fe e4 ; <<1
;-- cache line +1
PSRLD MM2, 31 ; 0f 72 d2 1f ; >>(32-1)
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4 ; MM0<<<1=(MM0>>31)|(MM4<<1)
PADDD MM6, MM6 ; 0f fe f6 ; <<1
; mstore T, T+1, T+2
POR MM1, MM5 ; 0f eb cd ; MM1<<<1=(MM1>>31)|(MM5<<1)
MOVQ [counter], MM0 ; 0f 7f 01 ; T
POR MM2, MM6 ; 0f eb d6 ; MM2<<<1=(MM2>>31)|(MM6<<1)
MOVQ [counter+8], MM1 ; 0f 7f 49 08 ; T+1
MOVQ [counter+16], MM2; 0f 7f 51 10 ; T+2
NOP ; 90
;-- cache line
; T=T+3
ADD counter, BYTE 24; 83 c1 18
CMP counter, bailout; 39 d1
JNE make_words ; 0f 85 16 ff ff ff
;--------------------------------------
; these four are about 30 clocks
; about 7.5 clocks per two 32bit words
; only doing two words at once here.
last_4:
MOVQ MM0, [ECX-24] ; 0f 6f 41 e8
MOVQ MM1, [ECX-16] ; 0f 6f 49 f0
MOVQ MM3, [ECX-64] ; 0f 6f 59 c0
MOVQ MM4, [ECX-56] ; 0f 6f 61 c8
PXOR MM0, MM3 ; 0f ef c3
NOP ; 90
NOP ; 90
;-- cache line
PXOR MM1, MM4 ; 0f ef cc
PXOR MM0, [ECX-112] ; 0f ef 41 90
PXOR MM1, [ECX-104] ; 0f ef 49 98
PXOR MM0, [ECX-128] ; 0f ef 41 80
PXOR MM1, [ECX-120] ; 0f ef 49 88
MOVQ MM4, MM0 ; 0f 6f e0
MOVQ MM5, MM1 ; 0f 6f e9
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>31
PADDD MM4, MM4 ; 0f fe e4 ; <<1
;-- cache line
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>31
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4
POR MM1, MM5 ; 0f eb cd
MOVQ [ECX], MM0 ; 0f 7f 01
MOVQ [ECX+8], MM1 ; 0f 7f 49 08
ADD ECX, BYTE 16 ; 83 c1 10 ; T=T+2
MOVQ MM3, [ECX-64] ; 0f 6f 59 c0
MOVQ MM0, [ECX-24] ; 0f 6f 41 e8
NOP ; 90
;-- cache line
MOVQ MM1, [ECX-16] ; 0f 6f 49 f0
MOVQ MM4, [ECX-56] ; 0f 6f 61 c8
PXOR MM0, MM3 ; 0f ef c3
PXOR MM1, MM4 ; 0f ef cc
PXOR MM0, [ECX-112] ; 0f ef 41 90
PXOR MM1, [ECX-104] ; 0f ef 49 98
PXOR MM0, [ECX-128] ; 0f ef 41 80
PXOR MM1, [ECX-120] ; 0f ef 49 88
NOP ; 90
NOP ; 90
;-- cache line
MOVQ MM4, MM0 ; 0f 6f e0
MOVQ MM5, MM1 ; 0f 6f e9
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>31
PADDD MM4, MM4 ; 0f fe e4 ; <<1
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>31
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4
POR MM1, MM5 ; 0f eb cd
MOVQ [ECX], MM0 ; 0f 7f 01
MOVQ [ECX+8], MM1 ; 0f 7f 49 08
;-- cache line
; from here to the end of round 19 is about 170 to 220 clocks
; Copy the two SHA-1 states into the first 5 registers.
; A, B, C, D, and E should be self-explanitory
; (see FIPS 180-1 otherwise)
; Each half of each register holds completely seperate
; contexts.
; the NOPs make it go faster, some weird memory allignment thing I
; think. The address to each MOVQ fall on divisible-by-4 addresses.
state_copy:
NOP ; 90
MOVQ MM0, QWORD [a] ; 0f 6f 05 c0 a9 04 08 (0x804a9c0)
NOP ; 90
MOVQ MM1, QWORD [b] ; 0f 6f 0d a8 a9 04 08 (0x804a9a8)
NOP ; 90
MOVQ MM2, QWORD [c] ; 0f 6f 15 b0 a9 04 08 (0x804a9b0)
NOP ; 90
MOVQ MM3, QWORD [d] ; 0f 6f 1d b8 a9 04 08 (0x804a9b8)
;-- cache line
NOP ; 90
MOVQ MM4, QWORD [e] ; 0f 6f 25 c8 a9 04 08 (0x804a9c8)
; This is the core of the SHA-1 transform
; The following is done 80 times (T = 0..79)
; A' = A<<<5 + F(B,C,D) + E + W[t] + K[t]
; B' = A
; C' = B<<<30
; D' = C
; E' = D
; Function F(B,C,D) (op_F) is (B AND C) OR (NOT B AND D) for T 0..19
; Function F(B,C,D) (op_G) is B XOR C XOR D for T 20..39 and 60..79
; Function F(B,C,D) (op_H) is (B AND C) OR (B AND D) OR (C AND D) for T 40..59
; I've named these three functions F, G, and H respectively (like in MD4)
; Note: FIPS 180-2 says (B AND C) XOR (NOT B AND D)
; (B AND C) XOR (B AND D) XOR (C AND D)
; which produces the exact same result.
; again, this would be *alot* faster if there was a MMX ROR/ROL equivalent
; op_H would be alot faster if I had one more 64 bit register.
; The rounds were rolled up into three iterations of a loop of six
; rounds each. This was to try to avoid flooding the instruction caches
; (Reduction of code size by 1/3 at the expense of a 18 loop increments, a compaire and jump.)
; (The old unrolled version used macro-expanded constants for W[T], no index computation neccesary)
; I can get away with keeping the entire SHA-1 state in registers
; and save several MOVs if I just cycle through the registers
; between each round (ie output variables are shifted for input)
; For example, four rounds of E goes like:
; E' = D
; E'' = D' = C
; E''' = D'' = C' = B<<<30
; E'''' = D''' = C'' = B'<<<30 = A (<<<30)
; So I'm just substituting the last set of state variables back in.
; And no you can't skip ahead because every state var is needed
; to compute A every round.
;= (B AND C) OR (NOT B AND D) ===============================================
%macro op_F 8
; These three instructions were moved up to the last few
; instructions of the previous operation (so they're at the end here)
; This was to promote better parallelism on the K6-II
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
;-- cache line here first time through
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f f9 ; | B
POR %7, MM6 ; 0f eb ee ; A = A<<<5 = (A<<5)|(A>>27) |
MOVQ MM6, %3 ; 0f 6f f1 ; | B |
PADDD %7, %6 ; 0f fe ec ; A<<<5 + E | |
PAND MM6, %4 ; 0f db f2 ; | | B & C |
PADDD %7, [%1]; 0f fe 29 ; A<<<5 + E + W[t] | | |
PANDN MM7, %5 ; 0f df fb ; | | | | | ~B & D
PADDD %7, [%8]; 0f fe 2df8a50408;A<<<5 + E + W[t] + K | | | |
; 0f fe 23 if EBX used | | | | | | |
;--cache line here if K1 used | | | | | |
POR MM6, MM7; 0f eb f7 ; | | | | (B & C)|(~B & D)
;--cache line here if EBX used | | | | | |
MOVQ MM7, %3 ; 0f 6f f9 ; B | | | | | | |
PADDD %7, MM6 ; 0f fe ee ; A<<<5 + E + W[t] + K +((B & C)|(~B & D))
PSLLD %3, 30 ; 0f 72 f1 1e ; B << 30
MOVQ %6, %7 ; 0f 6f e5 ; "TEMP" (next A) into E
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ MM6, %7 ; 0f 6f f5 ; "TEMP" (next A) into E
PSRLD %6, 27 ; 0f 72 d4 1b ; A >> 32-5 (TEMP)
POR %3, MM7 ; 0f eb cf ; B <<<30
ADD %1, BYTE 8; 83 c1 08 ; next word T=T+1
;-- cache line +1 if K1
;-- cache line -2 if EBX
%endmacro
;-- cache line +8
; These were moved up to the bottom of each previous round.
; So this is needed for the first time through.
MOVQ MM5, MM0 ; 0f 6f e8 ; A
MOVQ MM6, MM0 ; 0f 6f f0 ; A
PSRLD MM5, 27 ; 0f 72 d5 1b ; A >> 32-5
MOV ECX, index ; 89 c1
MOV bailout, index ; 89 c2
ADD bailout, 144 ; 81 c2 90 00 00 00 ; faster than ADD DX, WORD 0x90 ; 66 81 c2 90 00
MOV EBX, K1 ; bb f8 a5 04 08
;-- cache line +2
; this is the only loop that doesn't start on a cache line
; when I do allign this to the start of the line (-2 bytes)
; it runs slower. So this is the fast version.
; Alternatively using [K1] and [EBX] is faster than useing only one.
rounds_0_17:
op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1
op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, K1
op_F ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
op_F ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
op_F ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_F ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_0_17 ; 0f 85 82 fe ff ff
; these two are 'about' 20-30 clocks depending on caches I guess
op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1 ; 18
op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX ; 19
; = B XOR C XOR D ===========================================================
%macro op_G 8
; See above.. for reference..
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
;-- cache line here for first round
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f fd ; B
POR %7, MM6 ; 0f eb de ; A = A<<<5 = (A<<5)|(A>>27)
PXOR MM7, %4 ; 0f ef f8 ; | B ^ C
PADDD %7, %6 ; 0f fe da ; A<<<5 + E | |
PXOR MM7, %5 ; 0f ef f9 ; | | B ^ C ^ D
PADDD %7, MM7 ; 0f fe df ; A<<<5 + E + (B ^ C ^ D)
PADDD %7, [%1]; 0f fe 19 ; A<<<5 + E + W[t] + (B ^ C ^ D)
MOVQ MM7, %3 ; 0f 6f fd ; B | | |
;-- cache line (-4) | | |
; A<<<5 + E + W[t] + K + (B ^ C ^ D)
PADDD %7, [%8]; 0f fe 1d 00 a6 04 08 (0x804a600)
; 0f fe 13 if EBX is used
PSLLD %3, 30 ; 0f 72 f5 1e ; B << 30
MOVQ %6, %7 ; 0f 6f d3 ; "TEMP" (next A) into E
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ MM6, %7 ; 0f 6f f3 ; "TEMP" (next A) into E
POR %3, MM7 ; 0f eb ef ; B <<<30
PSRLD %6, 27 ; 0f 72 d2 1b ; A >> 32-5 (TEMP)
ADD ECX, BYTE 8 ; 83 c1 08 ; T=T+1
;-- cache +27 (or -5) if K2
;-- cache +23 (or -9) if EBX
%endmacro
MOV EBX, K2 ; bb 00 a6 04 08 ($0x804a600)
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00 ; (8*18)-[(2*8) for 18 and 19]
; It's faster to jump over the NOPs
JMP rounds_20_37 ; e9 12 00 00 00
NOP ;0
NOP ;1
NOP ;2
NOP ;3
NOP ;4
NOP ;5
NOP ;6
NOP ;7
NOP ;8
NOP ;9
NOP ;a
NOP ;b
NOP ;c
NOP ;d
NOP ;e
NOP ;f
;-- cache line
; this is almost exactly 200 clock from here to...
rounds_20_37:
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2
op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, EBX
op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_20_37 ; 0f 85 a6 fe ff ff
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2 ; 38
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2 ; 39
; ... here so about 1.6 clocks per opcode
; this is the slowest one, mostly due to register crunch
;= (B AND C) OR (B AND D) OR (C AND D) ======================================
%macro op_H 8
; for reference,
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f fb ; B
POR %7, MM6 ; 0f eb ce ; A = A<<<5 = (A<<5)|(A>>27) |
PAND MM7, %5 ; 0f db fd ; | B & D
MOVQ MM6, %3 ; 0f 6f f3 ; | B | |
PAND MM6, %4 ; 0f db f4 ; | B & C | |
PADDD %7, %6 ; 0f fe c8 ; A<<<5 + E | | | |
POR MM6, MM7; 0f eb f7 ; | | (B & C)|(B & D)
PADDD %7, [%1]; 0f fe 09 ; A<<<5 + E + W[t] |
MOVQ MM7, %4 ; 0f 6f fc ; | | | C |
PADDD %7, [%8]; 0f fe 0b (ebx); A<<<5 + E + W[t] + K |
; 0f fe 0508a60408 (0x804a608)| | | |
PAND MM7, %5 ; 0f db fd ; | | | C & D |
POR MM6, MM7; 0f eb f7 ; | | | (C & D)|(B & C)|(B & D)
PADDD %7, MM6 ; 0f fe ce ; A<<<5 + E + W[t] + K +((B&C)|(B&D)|(C&D))
MOVQ MM7, %3 ; 0f 6f fb ; B
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ %6, %7 ; 0f 6f c1 ; "TEMP" (next A) into E
PSLLD %3, 30 ; 0f 72 f3 1e ; B << 30
MOVQ MM6, %7 ; 0f 6f f1 ; "TEMP" (next A) into E
POR %3, MM7 ; 0f eb df ; B <<<30
PSRLD %6, 27 ; 0f 72 d0 1b ; A >> 32-5 (TEMP)
ADD ECX, BYTE 8 ; 83 c1 08 ; T=T+1
%endmacro
MOV EBX, K3 ; bb 08 a6 04 08
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00
NOP ;d
NOP ;e
NOP ;f
; This is also about 250 clocks
rounds_40_57:
op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3
op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3
op_H ECX, MM0, MM1, MM2, MM3, MM4, MM5, K3
op_H ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
op_H ECX, MM4, MM5, MM0, MM1, MM2, MM3, K3
op_H ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_40_57 ; 0f 85 47 fe ff ff
op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3 ; 58
op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3 ; 59
; No new macro definition here.. this is the same as rounds 20..39
MOV EBX, K4 ; bb 10 a6 04 08 ($0x804a610)
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00
NOP ;d
NOP ;e
NOP ;f
rounds_60_77:
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4
op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_60_77 ; 0f 85 a2 fe ff ff
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4 ; 78
; This last step used to be like this...
; op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4 ; 79
; PADDD MMq, QWORD [z] foreach qz in 4a, 5b, 0c, 1d, 2e
; MOVQ QWORD [z], MMq foreach qz in 4a, 5b, 0c, 1d, 2e
; But I moved the final ADDs and MOVs up into the middle of the
; last round because there were some free execution slots on my K6-II
; that could be used... (It's faster this way basically)
; cache line +21
PADDD MM5, QWORD [b] ; 0f fe 2d a8 a9 04 08 ; add the original values back in
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
;-- cache line
MOVQ MM7, MM0 ; 0f 6f f8 ; B
POR MM4, MM6 ; 0f eb e6 ; A = A<<<5 = (A<<5)|(A>>27)
PXOR MM7, MM1 ; 0f ef f9 ; | B ^ C
PADDD MM4, MM3 ; 0f fe e3 ; A<<<5 + E | |
MOVQ QWORD [b], MM5 ; 0f 7f 2d a8 a9 04 08 ; move final B back out | |
PXOR MM7, MM2 ; 0f ef fa ; | | B ^ C ^ D
PADDD MM1, QWORD [d] ; 0f fe 0d b8 a9 04 08 ; add the original values back in |
PADDD MM4, MM7 ; 0f fe e7 ; A<<<5 + E + (B ^ C ^ D)
;-- cache line | | | | |
PADDD MM2, QWORD [e] ; 0f fe 15 c8 a9 04 08 ; add the original values back in
PADDD MM4, [ECX] ; 0f fe 21 ; A<<<5 + E + W[t] + (B ^ C ^ D)
MOVQ QWORD [d], MM1 ; 0f 7f 0d b8 a9 04 08 ; move final D back out | | |
PADDD MM4, [K4] ; 0f fe 25 10 a6 04 08 ; A<<<5 + E + W[t] + K + (B ^ C ^ D)
; slower: PADDD MM4, [EBX] ; 0f fe 23
MOVQ MM7, MM0 ; 0f 6f f8 ; B
MOVQ QWORD [e], MM2 ; 0f 7f 15 c8 a9 04 08 ; move final E back out
;-- cache line +2
PSLLD MM0, 30 ; 0f 72 f0 1e ; B << 30
PADDD MM4, QWORD [a] ; 0f fe 25 c0 a9 04 08 ; add the original values back in
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ QWORD [a], MM4 ; 0f 7f 25 c0 a9 04 08 ; move final A back out
POR MM0, MM7 ; 0f eb c7 ; B <<<30
PADDD MM0, QWORD [c] ; 0f fe 05 b0 a9 04 08 ; add the original values back in
;-- cache line +2
MOVQ QWORD [c], MM0 ; 0f 7f 05 b0 a9 04 08 ; move final C back out
end:
MOV EAX, 0 ; b8 00 00 00 00 ; return value, probably not needed
POP EDX ; 5a
POP ESI ; 5e
POP EBX ; 5b
; using LEAVE ; c9 goes slightly slower than the following...
MOV ESP, EBP; 89 ec
POP EBP ; 5d
RET ; c3 ; gcc cleans up stack
no subject
*bursts into flames*
no subject
no subject
Sadly this heavy optimization will probably lead to performance drops on K7/K8, and almost definitely on NetBurst.
K7/K8 Performance
If I reorder the instructions for the K7, the same pattern should be good for the K8 too. Since, internally, the two chips are very similar. (Three ALU units, etc.)
Re: K7/K8 Performance
SSE Implementation.
The 'key-setup' stuff that SHA-1 does before the Feistel rounds adds alot of overhead too it (several hundred clock cycles), compared to MD5 which doesn't have that. That's why my earlier MD5 implementation was 1000 clock cycles faster. (Aside from 16 fewer rounds too.)
The other things that would make this alot faster:
(MMX has alot of stupid things about it. AltiVec is so much better.)
The discussion you refer to, seem to mostly be about optimizing that "key-setup" part at the beginning:
Which I got down to about 440 clock cycles... for two independent data blocks. With 128-bit wide registers, and more ALU units, I could probably get it much faster.
Looking at, http://www.arctic.org/~dean/crypto/sha1.html, I think I see what they're doing. They're putting each "column" in one SSE register, and then just XORing everything that way. That is to say:
XMM0 would be:
w[t], w[t-1], w[t-2],w[t-3]XMM1 would be:
w[t-5], w[t-6], w[t-7], w[t-8]XMM2 would be:
w[t-11], w[t-12],etc. etc.Then just
PXOR XMM0, XMM1; PXOR XMM0, XMM2; PXOR XMM0, XMM3;etc.I'm not sure if that'd help me on MMX. Just these sixteen 32-bit values will completely fill all eight 64-bit MMX register. And I'm doing two parallel blocks at once. So I'd still have to do this routine twice in a row.
The...
The (X∧Y)∨(‾X∧Z) ≡ (Y⊕Z)∧X)⊕Z trick I'll have to remember though. I'm not sure if it would have helped at all here though, because I can do (X∧Y) in parallel with (‾X∧Z), but in the other form, I have to do each operation serially. There was another boolean identity that I learned a year ago, that I could apply here. And that's what I was referring to when I said that I though I could shave off another 50-100 clock cycles or so.