I haven't had time to post anything new in here. (Malware analysis and stuff.) So here's something old I wrote, that few people have seen.
I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.
There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip
I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.
There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip
; SHA-1 transform
; Version: 0.14
; Date: Jan 20, 2004
; Author: Julia Wolf
; Copyright 2004, Distribution in accordence with the GPL (http://www.gnu.org/)
; I wrote the first version (0.1) of this in about a day working only off of
; the FIPS 180-1 standard document itself (and my previous MMX MD-5 routine).
; Then I spend a week or so shaving off about 300 clock cycles.
; It's probabily best to have a copy of FIPS 180-1 sitting in front of you if
; you want to try to follow what the code is doing.
; The bizzare instruction order, the NOPs and seemingly random
; use of the same value in registers and from memory load, are all
; to shave off a few clock cycles here and there.
; In summary, these instructions have been layed out to completely
; saturate all of the pipelines and execution units in the AMD K6-II
; whenever possible, and also make the instruction decoder happy.
; (it really is faster with the NOP's)
; I can probibally shave off a few more clock cycles, but I've pretty much
; hit the flat part of the curve now. It's not going to get any faster than
; it is now by more than a few clocks unless I start removing productive
; code. (Or adding more 64 bit registers to the silicon)
; Statistics on my AMD K6-II (fastest machine I have because I'm poor)
; (Timmed with RDTSC)
; best time avg. ~1295 clocks
; 826 opcodes to final RET, including NOP's
; 777 opcodes without NOP's
; 1295/777 = 1.66 opcodes per clock cycle...
; ...or about 0.60 clocks per opcode.
; total length 2720 bytes including the NOP's after RET filling the cache
; line. Or, exactly 85 cache lines. The four main rounds were rolled back up
; to reduce cache misses on the instruction cache.
; This function assumes that the 512-bit blocks have already been padded
; at the end.
; You can call this like this:
; extern void sha1_mmx(unsigned char* x, unsigned char* y);
; unsigned char a[8];
; unsigned char b[8];
; unsigned char c[8];
; unsigned char d[8];
; unsigned char e[8];
; unsigned char x[64], y[64];
; sha1_mmx(x,y)
; x and y are your 512 bits of message, padded apropreately
; ('1' bit with the size mod 2^64 at the end, see FIPS 180-1)
; a, b, c, d, and e are the two contexts, foreach z in a,b,c,d,e:
; z[0],z[1],z[2],z[3] belong to the first context and
; z[4],z[5],z[6],z[7] belong to the second context.
; I really don't know why I bother with using byte arrays rather than int's
; since this will only ever run on a Pentium MMX and above I don't have to
; worry about endedness issues. (force of habbit I guess)
[SECTION .bss align=8 noexec]
W RESQ 80 ; first 16 elements same as M see FIPS 180-1
[SECTION .rodata align=8 noexec]
K1 dd 0x5A827999, 0x5A827999 ; for T[ 0..19]
K2 dd 0x6ED9EBA1, 0x6ED9EBA1 ; for T[20..39]
K3 dd 0x8F1BBCDC, 0x8F1BBCDC ; for T[40..59]
K4 dd 0xCA62C1D6, 0xCA62C1D6 ; for T[60..79]
; TODO: pass the context as an argument insted of using globals
[SECTION .data align=8 noexec]
extern a
extern b
extern c
extern d
extern e
[SECTION .text allign=8 nowrite]
global sha1_mmx:function
sha1_mmx:
; This is faster than ENTER 0,0 by a few clocks, strangely
; EBP, ESI, and stuff arn't really used here, so
; I could take these ops out, there are no local variables.
PUSH EBP ; 55
MOV EBP, ESP ; 89 e5
PUSH EBX ; 53
PUSH ESI ; 56
PUSH EDX ; 52
%define X EDX
%define Y EBX
MOV Y, [EBP+8] ; 8b 55 08 ; arrrg 2
MOV X, [EBP+12] ; 8b 5d 0c ; arrrg 1
; I think this is the only instruction that makes this
; not compatible with the original Pentium MMX
; (replace with EMMS if there's a problem)
FEMMS ; 0f 0e
PREFETCH [X+32] ; 0f 0d 42 20
PREFETCH [Y+32] ; 0f 0d 43 20
XOR ECX, ECX ; 31 c9 ; shorter than doing MOV ECX, 0 ; b9 00 00 00 00
; Rather than using the [disp32+index] addressing mode
; storing [W] in unused register EAX allows for shorter
; instruction length.
%define index EAX
MOV index, W ; b8 38 a7 04 08 ($0x804a738)
NOP ; 90
NOP ; 90
NOP ; 90
;-- cache line boundry
; This routine takes 64 bits from each block of 512 bits
; and mixes them together into alternating blocks of 32 bits
; for example, x[n] and y[n] are 32 bits each:
; input: x[1], x[2], x[3], x[4], x[5] ...
; input: y[1], y[2], y[3], y[4], y[5] ...
; output: x[1], y[1], x[2], y[2], x[3], y[3], x[4], y[4], ...
; The output is placed into the first 16 QWORDS of [W]
; This is done so that each 32 bits fills the top and bottom
; halves of a 64 bit MMX register. Allowing both to be worked
; on in parallel. (It's the rationale behind this whole SHA-1
; function if you havn't guessed already)
; only need to do this for 8 DWORDs to make 16 QWORDs (hence ECX)
; this is about 20 to 90 clocks.
; These were changed to byte-sized version to save space
; and ECX never goes above 256 so it's ok
; ADD ECX, 8 ; 81 c1 08 00 00 00
; CMP ECX, 64 ; 81 f9 40 00 00 00
multiplex:
MOVQ MM6, [X+ECX] ; 0f 6f 34 0a ; 12345678
MOVQ MM7, [Y+ECX] ; 0f 6f 3c 0b ; abcdefgh
MOVQ MM4, MM6 ; 0f 6f e6
MOVQ MM5, MM7 ; 0f 6f ef
PUNPCKHDQ MM5, MM4 ; 0f 6a ec ; 1234abcd
PUNPCKLDQ MM7, MM6 ; 0f 62 fe ; 5678efgh
MOVQ [index+ECX*2], MM7 ; 0f 7f 3c 48
MOVQ [index+ECX*2+8], MM5 ; 0f 7f 6c 48 08
ADD CL, BYTE 8 ; 80 c1 08
;-- cache line
CMP CL, BYTE 64 ; 80 f9 40
JNE multiplex ; 75 d5
; Pre-compute the words for T=16 through T=79
; It's better to do this now while the MMX registers are
; still free, rather than later when they are used.
; Only needs an extra 64 QWORDS of memory.
; Originally this performed 16 loops of these 4 operations:
; for a total of (80-16)=64 words
; T (W[T-3] XOR W[T-8] XOR W[T-14] XOR W[T-16])<<<1
; one (W[T-2] XOR W[T-7] XOR W[T-13] XOR W[T-15])<<<1
; two (W[T-1] XOR W[T-6] XOR W[T-12] XOR W[T-14])<<<1
; three (W[T ] XOR W[T-5] XOR W[T-11] XOR W[T-13])<<<1
; On the last operation you can see that it is dependent upon
; the first operation compleating, which wasn't much of a problem
; just make that the last XOR performed.
; Can be done in a loop of (4 ops * 16 loops)
; Then I rolled it back up into a loop and only did three at a time
; So I could have done (3*21)+1 but then unrolled it a little bit
; and it now does (6*10)+4 operations.
; This freed some MMX registers for some speed improvements
; doing RISC-like programming, seperate mload and meu operations.
; To avoid contention over the single MMX bit shifter on the K6-II
; left shifts are done by adding the register to itself, since the
; integer unit is free at that point. (The right shift can be done
; in parallel to it)
; If there was a faster way to get the sign bit of each half of
; an MMX reg, then that could be used in place of the PSRLD.
; If there was an MMX equivilant to ROR or ROL then this would
; be much faster, but there isn't so I need to make a copy of
; the register, shift one left, and the other right, and then
; OR them back together.
; And I think copying the MMX register out into one of the general
; E?X registers, doing the rotate there and copying back into MMX
; would take even longer than this.
; the following is about 440 clocks
; so at about 44 clocks per loop, and 6 words per loop
; it's about 7.3 clocks per two 32bit words
ugh:
%define counter ECX
%define bailout EDX
; T=16
MOV CL, BYTE 128 ; b1 80 ; shorter than MOV ECX, 128 ; b9 80 00 00 00
; high half of ECX still 0 from above
; there might be a superset dependancy here,
; but I don't see any speed diference with ECX vs CL
ADD counter, index ; 01 c1
MOV bailout, counter; 89 ca
; end when T=16+60
ADD bailout, 480 ; 81 c2 e0 01 00 00 ; 128+480
; skip the padding, shaves off a few clocks
JMP make_words ; e9 0a 00 00 00
; pad up to the next cache line so that the following loop goes faster
NOP ; 6
NOP ; 7
NOP ; 8
NOP ; 9
NOP ; A
NOP ; B
NOP ; C
NOP ; D
NOP ; E
NOP ; F
;-- cache line
make_words:
; mload T-3, T-2, T-1
MOVQ MM0, [counter-24]; 0f 6f 41 e8 ; T-3
MOVQ MM1, [counter-16]; 0f 6f 49 f0 ; T-2
MOVQ MM2, [counter-8] ; 0f 6f 51 f8 ; T-1
; insted of doing PXOR MM0, [ECX-64] etc. etc.
; doing MOVQ then PXOR shaves off a bunch of clocks here...
; mload T-8, T-7, T-6
MOVQ MM3, [counter-64]; 0f 6f 59 c0 ; T-8
MOVQ MM4, [counter-56]; 0f 6f 61 c8 ; T-7
PXOR MM0, MM3 ; 0f ef c3 ; T-8 XOR T-3
MOVQ MM5, [counter-48]; 0f 6f 69 d0 ; T-6
PXOR MM1, MM4 ; 0f ef cc ; T-7 XOR T-2
;-- cacheline-2 this will be long decoded
PXOR MM2, MM5 ; 0f ef d5 ; T-6 XOR T-1
; ...but doesn't save time when done here:
; mload T-14, T-13, T-12
PXOR MM0, [counter-112]; 0f ef 41 90 ; T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-104]; 0f ef 49 98 ; T-13 XOR T-7 XOR T-2
PXOR MM2, [counter-96] ; 0f ef 51 a0 ; T-12 XOR T-6 XOR T-1
; mload T-16, T-15, T-14
PXOR MM0, [counter-128]; 0f ef 41 80 ; T-16 XOR T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-120]; 0f ef 49 88 ; T-15 XOR T-13 XOR T-7 XOR T-2
; PXOR MM2, [ECX-112] was split up into a MOVQ, PXOR combo on MM3
MOVQ MM3, [counter-112]; 0f 6f 59 90 ; T-14
MOVQ MM4, MM0 ; 0f 6f e0 ; for left shift below
PXOR MM2, MM3 ; 0f ef d3 ; T-14 XOR T-12 XOR T-6 XOR T-1
NOP ; 90
;-- cache line
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>(32-1)
MOVQ MM5, MM1 ; 0f 6f e9 ; for left shift below
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>(32-1)
MOVQ MM6, MM2 ; 0f 6f f2 ; for left shift below
PADDD MM4, MM4 ; 0f fe e4 ; <<1
PSRLD MM2, 31 ; 0f 72 d2 1f ; >>(32-1)
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4 ; MM0<<<1=(MM0>>31)|(MM4<<1)
PADDD MM6, MM6 ; 0f fe f6 ; <<1
NOP ; 90
NOP ; 90
;-- cache line
; mstore T, T+1, T+2
POR MM1, MM5 ; 0f eb cd ; MM1<<<1=(MM1>>31)|(MM5<<1)
MOVQ [counter], MM0 ; 0f 7f 01
POR MM2, MM6 ; 0f eb d6 ; MM2<<<1=(MM2>>31)|(MM6<<1)
MOVQ [counter+8], MM1 ; 0f 7f 49 08
MOVQ [counter+16], MM2; 0f 7f 51 10
; next 3 ================================
; T=T+3
ADD counter, BYTE 24 ; 83 c1 18
; mload T-3, T-2, T-1
MOVQ MM0, [counter-24]; 0f 6f 41 e8 ; T-3
MOVQ MM1, [counter-16]; 0f 6f 49 f0 ; T-2
MOVQ MM2, [counter-8] ; 0f 6f 51 f8 ; T-1
;-- cache line
; mload T-8, T-7, T-6
MOVQ MM3, [counter-64]; 0f 6f 59 c0 ; T-8
MOVQ MM4, [counter-56]; 0f 6f 61 c8 ; T-7
PXOR MM0, MM3 ; 0f ef c3 ; T-8 XOR T-3
MOVQ MM5, [counter-48]; 0f 6f 69 d0 ; T-6
PXOR MM1, MM4 ; 0f ef cc ; T-7 XOR T-2
PXOR MM2, MM5 ; 0f ef d5 ; T-6 XOR T-1
; mload T-14, T-13, T-12
PXOR MM0, [counter-112]; 0f ef 41 90 ; T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-104]; 0f ef 49 98 ; T-13 XOR T-7 XOR T-2
PXOR MM2, [counter-96] ; 0f ef 51 a0 ; T-12 XOR T-6 XOR T-1
;-- cacheline +1
; mload T-16, T-15, T-14
PXOR MM0, [counter-128]; 0f ef 41 80 ; T-16 XOR T-14 XOR T-8 XOR T-3
PXOR MM1, [counter-120]; 0f ef 49 88 ; T-15 XOR T-13 XOR T-7 XOR T-2
MOVQ MM4, MM0 ; 0f 6f e0 ; for left shift below
PXOR MM2, [counter-112]; 0f ef 51 90 ; T-14 XOR T-12 XOR T-6 XOR T-1
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>(32-1)
MOVQ MM5, MM1 ; 0f 6f e9 ; for left shift below
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>(32-1)
MOVQ MM6, MM2 ; 0f 6f f2 ; for left shift below
PADDD MM4, MM4 ; 0f fe e4 ; <<1
;-- cache line +1
PSRLD MM2, 31 ; 0f 72 d2 1f ; >>(32-1)
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4 ; MM0<<<1=(MM0>>31)|(MM4<<1)
PADDD MM6, MM6 ; 0f fe f6 ; <<1
; mstore T, T+1, T+2
POR MM1, MM5 ; 0f eb cd ; MM1<<<1=(MM1>>31)|(MM5<<1)
MOVQ [counter], MM0 ; 0f 7f 01 ; T
POR MM2, MM6 ; 0f eb d6 ; MM2<<<1=(MM2>>31)|(MM6<<1)
MOVQ [counter+8], MM1 ; 0f 7f 49 08 ; T+1
MOVQ [counter+16], MM2; 0f 7f 51 10 ; T+2
NOP ; 90
;-- cache line
; T=T+3
ADD counter, BYTE 24; 83 c1 18
CMP counter, bailout; 39 d1
JNE make_words ; 0f 85 16 ff ff ff
;--------------------------------------
; these four are about 30 clocks
; about 7.5 clocks per two 32bit words
; only doing two words at once here.
last_4:
MOVQ MM0, [ECX-24] ; 0f 6f 41 e8
MOVQ MM1, [ECX-16] ; 0f 6f 49 f0
MOVQ MM3, [ECX-64] ; 0f 6f 59 c0
MOVQ MM4, [ECX-56] ; 0f 6f 61 c8
PXOR MM0, MM3 ; 0f ef c3
NOP ; 90
NOP ; 90
;-- cache line
PXOR MM1, MM4 ; 0f ef cc
PXOR MM0, [ECX-112] ; 0f ef 41 90
PXOR MM1, [ECX-104] ; 0f ef 49 98
PXOR MM0, [ECX-128] ; 0f ef 41 80
PXOR MM1, [ECX-120] ; 0f ef 49 88
MOVQ MM4, MM0 ; 0f 6f e0
MOVQ MM5, MM1 ; 0f 6f e9
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>31
PADDD MM4, MM4 ; 0f fe e4 ; <<1
;-- cache line
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>31
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4
POR MM1, MM5 ; 0f eb cd
MOVQ [ECX], MM0 ; 0f 7f 01
MOVQ [ECX+8], MM1 ; 0f 7f 49 08
ADD ECX, BYTE 16 ; 83 c1 10 ; T=T+2
MOVQ MM3, [ECX-64] ; 0f 6f 59 c0
MOVQ MM0, [ECX-24] ; 0f 6f 41 e8
NOP ; 90
;-- cache line
MOVQ MM1, [ECX-16] ; 0f 6f 49 f0
MOVQ MM4, [ECX-56] ; 0f 6f 61 c8
PXOR MM0, MM3 ; 0f ef c3
PXOR MM1, MM4 ; 0f ef cc
PXOR MM0, [ECX-112] ; 0f ef 41 90
PXOR MM1, [ECX-104] ; 0f ef 49 98
PXOR MM0, [ECX-128] ; 0f ef 41 80
PXOR MM1, [ECX-120] ; 0f ef 49 88
NOP ; 90
NOP ; 90
;-- cache line
MOVQ MM4, MM0 ; 0f 6f e0
MOVQ MM5, MM1 ; 0f 6f e9
PSRLD MM0, 31 ; 0f 72 d0 1f ; >>31
PADDD MM4, MM4 ; 0f fe e4 ; <<1
PSRLD MM1, 31 ; 0f 72 d1 1f ; >>31
PADDD MM5, MM5 ; 0f fe ed ; <<1
POR MM0, MM4 ; 0f eb c4
POR MM1, MM5 ; 0f eb cd
MOVQ [ECX], MM0 ; 0f 7f 01
MOVQ [ECX+8], MM1 ; 0f 7f 49 08
;-- cache line
; from here to the end of round 19 is about 170 to 220 clocks
; Copy the two SHA-1 states into the first 5 registers.
; A, B, C, D, and E should be self-explanitory
; (see FIPS 180-1 otherwise)
; Each half of each register holds completely seperate
; contexts.
; the NOPs make it go faster, some weird memory allignment thing I
; think. The address to each MOVQ fall on divisible-by-4 addresses.
state_copy:
NOP ; 90
MOVQ MM0, QWORD [a] ; 0f 6f 05 c0 a9 04 08 (0x804a9c0)
NOP ; 90
MOVQ MM1, QWORD [b] ; 0f 6f 0d a8 a9 04 08 (0x804a9a8)
NOP ; 90
MOVQ MM2, QWORD [c] ; 0f 6f 15 b0 a9 04 08 (0x804a9b0)
NOP ; 90
MOVQ MM3, QWORD [d] ; 0f 6f 1d b8 a9 04 08 (0x804a9b8)
;-- cache line
NOP ; 90
MOVQ MM4, QWORD [e] ; 0f 6f 25 c8 a9 04 08 (0x804a9c8)
; This is the core of the SHA-1 transform
; The following is done 80 times (T = 0..79)
; A' = A<<<5 + F(B,C,D) + E + W[t] + K[t]
; B' = A
; C' = B<<<30
; D' = C
; E' = D
; Function F(B,C,D) (op_F) is (B AND C) OR (NOT B AND D) for T 0..19
; Function F(B,C,D) (op_G) is B XOR C XOR D for T 20..39 and 60..79
; Function F(B,C,D) (op_H) is (B AND C) OR (B AND D) OR (C AND D) for T 40..59
; I've named these three functions F, G, and H respectively (like in MD4)
; Note: FIPS 180-2 says (B AND C) XOR (NOT B AND D)
; (B AND C) XOR (B AND D) XOR (C AND D)
; which produces the exact same result.
; again, this would be *alot* faster if there was a MMX ROR/ROL equivalent
; op_H would be alot faster if I had one more 64 bit register.
; The rounds were rolled up into three iterations of a loop of six
; rounds each. This was to try to avoid flooding the instruction caches
; (Reduction of code size by 1/3 at the expense of a 18 loop increments, a compaire and jump.)
; (The old unrolled version used macro-expanded constants for W[T], no index computation neccesary)
; I can get away with keeping the entire SHA-1 state in registers
; and save several MOVs if I just cycle through the registers
; between each round (ie output variables are shifted for input)
; For example, four rounds of E goes like:
; E' = D
; E'' = D' = C
; E''' = D'' = C' = B<<<30
; E'''' = D''' = C'' = B'<<<30 = A (<<<30)
; So I'm just substituting the last set of state variables back in.
; And no you can't skip ahead because every state var is needed
; to compute A every round.
;= (B AND C) OR (NOT B AND D) ===============================================
%macro op_F 8
; These three instructions were moved up to the last few
; instructions of the previous operation (so they're at the end here)
; This was to promote better parallelism on the K6-II
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
;-- cache line here first time through
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f f9 ; | B
POR %7, MM6 ; 0f eb ee ; A = A<<<5 = (A<<5)|(A>>27) |
MOVQ MM6, %3 ; 0f 6f f1 ; | B |
PADDD %7, %6 ; 0f fe ec ; A<<<5 + E | |
PAND MM6, %4 ; 0f db f2 ; | | B & C |
PADDD %7, [%1]; 0f fe 29 ; A<<<5 + E + W[t] | | |
PANDN MM7, %5 ; 0f df fb ; | | | | | ~B & D
PADDD %7, [%8]; 0f fe 2df8a50408;A<<<5 + E + W[t] + K | | | |
; 0f fe 23 if EBX used | | | | | | |
;--cache line here if K1 used | | | | | |
POR MM6, MM7; 0f eb f7 ; | | | | (B & C)|(~B & D)
;--cache line here if EBX used | | | | | |
MOVQ MM7, %3 ; 0f 6f f9 ; B | | | | | | |
PADDD %7, MM6 ; 0f fe ee ; A<<<5 + E + W[t] + K +((B & C)|(~B & D))
PSLLD %3, 30 ; 0f 72 f1 1e ; B << 30
MOVQ %6, %7 ; 0f 6f e5 ; "TEMP" (next A) into E
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ MM6, %7 ; 0f 6f f5 ; "TEMP" (next A) into E
PSRLD %6, 27 ; 0f 72 d4 1b ; A >> 32-5 (TEMP)
POR %3, MM7 ; 0f eb cf ; B <<<30
ADD %1, BYTE 8; 83 c1 08 ; next word T=T+1
;-- cache line +1 if K1
;-- cache line -2 if EBX
%endmacro
;-- cache line +8
; These were moved up to the bottom of each previous round.
; So this is needed for the first time through.
MOVQ MM5, MM0 ; 0f 6f e8 ; A
MOVQ MM6, MM0 ; 0f 6f f0 ; A
PSRLD MM5, 27 ; 0f 72 d5 1b ; A >> 32-5
MOV ECX, index ; 89 c1
MOV bailout, index ; 89 c2
ADD bailout, 144 ; 81 c2 90 00 00 00 ; faster than ADD DX, WORD 0x90 ; 66 81 c2 90 00
MOV EBX, K1 ; bb f8 a5 04 08
;-- cache line +2
; this is the only loop that doesn't start on a cache line
; when I do allign this to the start of the line (-2 bytes)
; it runs slower. So this is the fast version.
; Alternatively using [K1] and [EBX] is faster than useing only one.
rounds_0_17:
op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1
op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, K1
op_F ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
op_F ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
op_F ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_F ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_0_17 ; 0f 85 82 fe ff ff
; these two are 'about' 20-30 clocks depending on caches I guess
op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1 ; 18
op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX ; 19
; = B XOR C XOR D ===========================================================
%macro op_G 8
; See above.. for reference..
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
;-- cache line here for first round
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f fd ; B
POR %7, MM6 ; 0f eb de ; A = A<<<5 = (A<<5)|(A>>27)
PXOR MM7, %4 ; 0f ef f8 ; | B ^ C
PADDD %7, %6 ; 0f fe da ; A<<<5 + E | |
PXOR MM7, %5 ; 0f ef f9 ; | | B ^ C ^ D
PADDD %7, MM7 ; 0f fe df ; A<<<5 + E + (B ^ C ^ D)
PADDD %7, [%1]; 0f fe 19 ; A<<<5 + E + W[t] + (B ^ C ^ D)
MOVQ MM7, %3 ; 0f 6f fd ; B | | |
;-- cache line (-4) | | |
; A<<<5 + E + W[t] + K + (B ^ C ^ D)
PADDD %7, [%8]; 0f fe 1d 00 a6 04 08 (0x804a600)
; 0f fe 13 if EBX is used
PSLLD %3, 30 ; 0f 72 f5 1e ; B << 30
MOVQ %6, %7 ; 0f 6f d3 ; "TEMP" (next A) into E
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ MM6, %7 ; 0f 6f f3 ; "TEMP" (next A) into E
POR %3, MM7 ; 0f eb ef ; B <<<30
PSRLD %6, 27 ; 0f 72 d2 1b ; A >> 32-5 (TEMP)
ADD ECX, BYTE 8 ; 83 c1 08 ; T=T+1
;-- cache +27 (or -5) if K2
;-- cache +23 (or -9) if EBX
%endmacro
MOV EBX, K2 ; bb 00 a6 04 08 ($0x804a600)
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00 ; (8*18)-[(2*8) for 18 and 19]
; It's faster to jump over the NOPs
JMP rounds_20_37 ; e9 12 00 00 00
NOP ;0
NOP ;1
NOP ;2
NOP ;3
NOP ;4
NOP ;5
NOP ;6
NOP ;7
NOP ;8
NOP ;9
NOP ;a
NOP ;b
NOP ;c
NOP ;d
NOP ;e
NOP ;f
;-- cache line
; this is almost exactly 200 clock from here to...
rounds_20_37:
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2
op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, EBX
op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_20_37 ; 0f 85 a6 fe ff ff
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2 ; 38
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2 ; 39
; ... here so about 1.6 clocks per opcode
; this is the slowest one, mostly due to register crunch
;= (B AND C) OR (B AND D) OR (C AND D) ======================================
%macro op_H 8
; for reference,
; MOVQ %7, %2 ; A
; MOVQ MM6, %2 ; A
; PSRLD %7, 27 ; A >> 32-5
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
MOVQ MM7, %3 ; 0f 6f fb ; B
POR %7, MM6 ; 0f eb ce ; A = A<<<5 = (A<<5)|(A>>27) |
PAND MM7, %5 ; 0f db fd ; | B & D
MOVQ MM6, %3 ; 0f 6f f3 ; | B | |
PAND MM6, %4 ; 0f db f4 ; | B & C | |
PADDD %7, %6 ; 0f fe c8 ; A<<<5 + E | | | |
POR MM6, MM7; 0f eb f7 ; | | (B & C)|(B & D)
PADDD %7, [%1]; 0f fe 09 ; A<<<5 + E + W[t] |
MOVQ MM7, %4 ; 0f 6f fc ; | | | C |
PADDD %7, [%8]; 0f fe 0b (ebx); A<<<5 + E + W[t] + K |
; 0f fe 0508a60408 (0x804a608)| | | |
PAND MM7, %5 ; 0f db fd ; | | | C & D |
POR MM6, MM7; 0f eb f7 ; | | | (C & D)|(B & C)|(B & D)
PADDD %7, MM6 ; 0f fe ce ; A<<<5 + E + W[t] + K +((B&C)|(B&D)|(C&D))
MOVQ MM7, %3 ; 0f 6f fb ; B
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ %6, %7 ; 0f 6f c1 ; "TEMP" (next A) into E
PSLLD %3, 30 ; 0f 72 f3 1e ; B << 30
MOVQ MM6, %7 ; 0f 6f f1 ; "TEMP" (next A) into E
POR %3, MM7 ; 0f eb df ; B <<<30
PSRLD %6, 27 ; 0f 72 d0 1b ; A >> 32-5 (TEMP)
ADD ECX, BYTE 8 ; 83 c1 08 ; T=T+1
%endmacro
MOV EBX, K3 ; bb 08 a6 04 08
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00
NOP ;d
NOP ;e
NOP ;f
; This is also about 250 clocks
rounds_40_57:
op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3
op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3
op_H ECX, MM0, MM1, MM2, MM3, MM4, MM5, K3
op_H ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
op_H ECX, MM4, MM5, MM0, MM1, MM2, MM3, K3
op_H ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_40_57 ; 0f 85 47 fe ff ff
op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3 ; 58
op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3 ; 59
; No new macro definition here.. this is the same as rounds 20..39
MOV EBX, K4 ; bb 10 a6 04 08 ($0x804a610)
MOV bailout, ECX ; 89 ca
ADD bailout, 144 ; 81 c2 90 00 00 00
NOP ;d
NOP ;e
NOP ;f
rounds_60_77:
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4
op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4
op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP ECX, bailout ; 39 d1
JNE rounds_60_77 ; 0f 85 a2 fe ff ff
op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4 ; 78
; This last step used to be like this...
; op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4 ; 79
; PADDD MMq, QWORD [z] foreach qz in 4a, 5b, 0c, 1d, 2e
; MOVQ QWORD [z], MMq foreach qz in 4a, 5b, 0c, 1d, 2e
; But I moved the final ADDs and MOVs up into the middle of the
; last round because there were some free execution slots on my K6-II
; that could be used... (It's faster this way basically)
; cache line +21
PADDD MM5, QWORD [b] ; 0f fe 2d a8 a9 04 08 ; add the original values back in
PSLLD MM6, 5 ; 0f 72 f6 05 ; A << 5
;-- cache line
MOVQ MM7, MM0 ; 0f 6f f8 ; B
POR MM4, MM6 ; 0f eb e6 ; A = A<<<5 = (A<<5)|(A>>27)
PXOR MM7, MM1 ; 0f ef f9 ; | B ^ C
PADDD MM4, MM3 ; 0f fe e3 ; A<<<5 + E | |
MOVQ QWORD [b], MM5 ; 0f 7f 2d a8 a9 04 08 ; move final B back out | |
PXOR MM7, MM2 ; 0f ef fa ; | | B ^ C ^ D
PADDD MM1, QWORD [d] ; 0f fe 0d b8 a9 04 08 ; add the original values back in |
PADDD MM4, MM7 ; 0f fe e7 ; A<<<5 + E + (B ^ C ^ D)
;-- cache line | | | | |
PADDD MM2, QWORD [e] ; 0f fe 15 c8 a9 04 08 ; add the original values back in
PADDD MM4, [ECX] ; 0f fe 21 ; A<<<5 + E + W[t] + (B ^ C ^ D)
MOVQ QWORD [d], MM1 ; 0f 7f 0d b8 a9 04 08 ; move final D back out | | |
PADDD MM4, [K4] ; 0f fe 25 10 a6 04 08 ; A<<<5 + E + W[t] + K + (B ^ C ^ D)
; slower: PADDD MM4, [EBX] ; 0f fe 23
MOVQ MM7, MM0 ; 0f 6f f8 ; B
MOVQ QWORD [e], MM2 ; 0f 7f 15 c8 a9 04 08 ; move final E back out
;-- cache line +2
PSLLD MM0, 30 ; 0f 72 f0 1e ; B << 30
PADDD MM4, QWORD [a] ; 0f fe 25 c0 a9 04 08 ; add the original values back in
PSRLD MM7, 2 ; 0f 72 d7 02 ; B >> 32-30
MOVQ QWORD [a], MM4 ; 0f 7f 25 c0 a9 04 08 ; move final A back out
POR MM0, MM7 ; 0f eb c7 ; B <<<30
PADDD MM0, QWORD [c] ; 0f fe 05 b0 a9 04 08 ; add the original values back in
;-- cache line +2
MOVQ QWORD [c], MM0 ; 0f 7f 05 b0 a9 04 08 ; move final C back out
end:
MOV EAX, 0 ; b8 00 00 00 00 ; return value, probably not needed
POP EDX ; 5a
POP ESI ; 5e
POP EBX ; 5b
; using LEAVE ; c9 goes slightly slower than the following...
MOV ESP, EBP; 89 ec
POP EBP ; 5d
RET ; c3 ; gcc cleans up stack
no subject
Date: 2006-02-11 06:28 pm (UTC)