foxgrrl | SHA-1 in 1300 clock cycles. (Warning: Long)

I haven't had time to post anything new in here. (Malware analysis and stuff.) So here's something old I wrote, that few people have seen.

I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.

There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip

; SHA-1 transform
; Version: 0.14
; Date: Jan 20, 2004
; Author: Julia Wolf
; Copyright 2004, Distribution in accordence with the GPL (http://www.gnu.org/)

; I wrote the first version (0.1) of this in about a day working only off of
; the FIPS 180-1 standard document itself (and my previous MMX MD-5 routine).
; Then I spend a week or so shaving off about 300 clock cycles.

; It's probabily best to have a copy of FIPS 180-1 sitting in front of you if
; you want to try to follow what the code is doing.

; The bizzare instruction order, the NOPs and seemingly random
; use of the same value in registers and from memory load, are all
; to shave off a few clock cycles here and there.

; In summary, these instructions have been layed out to completely
; saturate all of the pipelines and execution units in the AMD K6-II
; whenever possible, and also make the instruction decoder happy.
; (it really is faster with the NOP's)

; I can probibally shave off a few more clock cycles, but I've pretty much
; hit the flat part of the curve now. It's not going to get any faster than
; it is now by more than a few clocks unless I start removing productive
; code. (Or adding more 64 bit registers to the silicon)

; Statistics on my AMD K6-II (fastest machine I have because I'm poor)
; (Timmed with RDTSC)
; best time avg. ~1295 clocks
; 826 opcodes to final RET, including NOP's
; 777 opcodes without NOP's
; 1295/777 = 1.66 opcodes per clock cycle...
; ...or about 0.60 clocks per opcode.

; total length 2720 bytes including the NOP's after RET filling the cache
; line. Or, exactly 85 cache lines. The four main rounds were rolled back up
; to reduce cache misses on the instruction cache.

; This function assumes that the 512-bit blocks have already been padded
; at the end.

; You can call this like this:
;	extern void sha1_mmx(unsigned char* x, unsigned char* y);
;	unsigned char a[8];
;	unsigned char b[8];
;	unsigned char c[8];
;	unsigned char d[8];
;	unsigned char e[8];
;	unsigned char x[64], y[64];
;	sha1_mmx(x,y)
; x and y are your 512 bits of message, padded apropreately
; ('1' bit with the size mod 2^64 at the end, see FIPS 180-1)
; a, b, c, d, and e are the two contexts, foreach z in a,b,c,d,e:
; z[0],z[1],z[2],z[3] belong to the first context and
; z[4],z[5],z[6],z[7] belong to the second context.

; I really don't know why I bother with using byte arrays rather than int's
; since this will only ever run on a Pentium MMX and above I don't have to
; worry about endedness issues. (force of habbit I guess)

[SECTION .bss align=8 noexec]

W	RESQ 80	; first 16 elements same as M see FIPS 180-1

[SECTION .rodata align=8 noexec]

K1	dd 0x5A827999, 0x5A827999 ; for T[ 0..19]
K2	dd 0x6ED9EBA1, 0x6ED9EBA1 ; for T[20..39]
K3	dd 0x8F1BBCDC, 0x8F1BBCDC ; for T[40..59]
K4	dd 0xCA62C1D6, 0xCA62C1D6 ; for T[60..79]

; TODO: pass the context as an argument insted of using globals

[SECTION .data align=8 noexec]

extern	a
extern	b
extern	c
extern	d
extern  e

[SECTION .text allign=8 nowrite]

global sha1_mmx:function

sha1_mmx:
	; This is faster than ENTER 0,0 by a few clocks, strangely
	; EBP, ESI, and stuff arn't really used here, so
	; I could take these ops out, there are no local variables.
	PUSH	EBP		; 55
	MOV	EBP, ESP	; 89 e5
	PUSH	EBX		; 53
	PUSH	ESI		; 56
	PUSH	EDX		; 52

%define X EDX
%define Y EBX
	MOV	Y, [EBP+8]	; 8b 55 08	; arrrg 2
	MOV	X, [EBP+12]	; 8b 5d 0c	; arrrg 1

	; I think this is the only instruction that makes this
	; not compatible with the original Pentium MMX
	; (replace with EMMS if there's a problem)
	FEMMS			; 0f 0e

	PREFETCH [X+32]		; 0f 0d 42 20
	PREFETCH [Y+32]		; 0f 0d 43 20


	XOR	ECX, ECX	; 31 c9	; shorter than doing MOV ECX, 0	; b9 00 00 00 00

	; Rather than using the [disp32+index] addressing mode
	; storing [W] in unused register EAX allows for shorter
	; instruction length.
%define index EAX
	MOV	index, W	; b8 38 a7 04 08 ($0x804a738)
	NOP			; 90
	NOP			; 90
	NOP			; 90
				;-- cache line boundry

	; This routine takes 64 bits from each block of 512 bits
	; and mixes them together into alternating blocks of 32 bits
	; for example, x[n] and y[n] are 32 bits each:
	; input: x[1], x[2], x[3], x[4], x[5] ...
	; input: y[1], y[2], y[3], y[4], y[5] ...
	; output: x[1], y[1], x[2], y[2], x[3], y[3], x[4], y[4], ...

	; The output is placed into the first 16 QWORDS of [W]

	; This is done so that each 32 bits fills the top and bottom
	; halves of a 64 bit MMX register. Allowing both to be worked
	; on in parallel. (It's the rationale behind this whole SHA-1 
	; function if you havn't guessed already)

	; only need to do this for 8 DWORDs to make 16 QWORDs (hence ECX)
	; this is about 20 to 90 clocks.

	; These were changed to byte-sized version to save space
	; and ECX never goes above 256 so it's ok
	;	ADD	ECX, 8		; 81 c1 08 00 00 00
	;	CMP	ECX, 64		; 81 f9 40 00 00 00

multiplex:
	MOVQ	MM6, [X+ECX]	; 0f 6f 34 0a	; 12345678
	MOVQ	MM7, [Y+ECX]	; 0f 6f 3c 0b	; abcdefgh
	MOVQ	MM4, MM6	; 0f 6f e6
	MOVQ	MM5, MM7	; 0f 6f ef
	PUNPCKHDQ MM5, MM4	; 0f 6a ec	; 1234abcd
	PUNPCKLDQ MM7, MM6	; 0f 62 fe	; 5678efgh
	MOVQ	[index+ECX*2], MM7   ; 0f 7f 3c 48
	MOVQ	[index+ECX*2+8], MM5 ; 0f 7f 6c 48 08
	ADD	CL, BYTE 8	; 80 c1 08
				;-- cache line
	CMP	CL, BYTE 64	; 80 f9 40
	JNE multiplex		; 75 d5

	; Pre-compute the words for T=16 through T=79
	; It's better to do this now while the MMX registers are
	; still free, rather than later when they are used.
	; Only needs an extra 64 QWORDS of memory.

	; Originally this performed 16 loops of these 4 operations:
	; for a total of (80-16)=64 words
	; T     (W[T-3] XOR W[T-8] XOR W[T-14] XOR W[T-16])<<<1
	; one   (W[T-2] XOR W[T-7] XOR W[T-13] XOR W[T-15])<<<1
	; two   (W[T-1] XOR W[T-6] XOR W[T-12] XOR W[T-14])<<<1
	; three (W[T  ] XOR W[T-5] XOR W[T-11] XOR W[T-13])<<<1
	; On the last operation you can see that it is dependent upon
	; the first operation compleating, which wasn't much of a problem
	; just make that the last XOR performed.
	; Can be done in a loop of (4 ops * 16 loops) 

	; Then I rolled it back up into a loop and only did three at a time
	; So I could have done (3*21)+1 but then unrolled it a little bit
	; and it now does (6*10)+4 operations.

	; This freed some MMX registers for some speed improvements
	; doing RISC-like programming, seperate mload and meu operations.

	; To avoid contention over the single MMX bit shifter on the K6-II
	; left shifts are done by adding the register to itself, since the
	; integer unit is free at that point. (The right shift can be done
	; in parallel to it)
	; If there was a faster way to get the sign bit of each half of 
	; an MMX reg, then that could be used in place of the PSRLD.

	; If there was an MMX equivilant to ROR or ROL then this would
	; be much faster, but there isn't so I need to make a copy of
	; the register, shift one left, and the other right, and then
	; OR them back together.
	; And I think copying the MMX register out into one of the general
	; E?X registers, doing the rotate there and copying back into MMX
	; would take even longer than this.

	; the following is about 440 clocks
	; so at about 44 clocks per loop, and 6 words per loop
	; it's about 7.3 clocks per two 32bit words

ugh:
%define counter ECX
%define bailout EDX
		; T=16
	MOV	CL, BYTE 128	; b1 80 ; shorter than MOV ECX, 128 ; b9 80 00 00 00
				; 	  high half of ECX still 0 from above
				; there might be a superset dependancy here,
				; but I don't see any speed diference with ECX vs CL
	ADD	counter, index	; 01 c1
	MOV	bailout, counter; 89 ca
		; end when T=16+60
	ADD	bailout, 480	; 81 c2 e0 01 00 00	; 128+480

	; skip the padding, shaves off a few clocks
	JMP make_words		; e9 0a 00 00 00

	; pad up to the next cache line so that the following loop goes faster
	NOP ; 6
	NOP ; 7
	NOP ; 8
	NOP ; 9
	NOP ; A
	NOP ; B
	NOP ; C
	NOP ; D
	NOP ; E
	NOP ; F
				;-- cache line
make_words:
		; mload T-3, T-2, T-1
	MOVQ	MM0, [counter-24]; 0f 6f 41 e8	; T-3
	MOVQ	MM1, [counter-16]; 0f 6f 49 f0	; T-2
	MOVQ	MM2, [counter-8] ; 0f 6f 51 f8	; T-1

	; insted of doing PXOR	MM0, [ECX-64] etc. etc.
	; doing MOVQ then PXOR shaves off a bunch of clocks here...
		; mload T-8, T-7, T-6
	MOVQ	MM3, [counter-64]; 0f 6f 59 c0	; T-8
	MOVQ	MM4, [counter-56]; 0f 6f 61 c8	; T-7
	PXOR	MM0, MM3	 ; 0f ef c3	; T-8 XOR T-3
	MOVQ	MM5, [counter-48]; 0f 6f 69 d0	; T-6
	PXOR	MM1, MM4	 ; 0f ef cc	; T-7 XOR T-2
				 ;-- cacheline-2 this will be long decoded
	PXOR	MM2, MM5	 ; 0f ef d5	; T-6 XOR T-1

	; ...but doesn't save time when done here:
		; mload T-14, T-13, T-12
	PXOR	MM0, [counter-112]; 0f ef 41 90	; T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-104]; 0f ef 49 98	; T-13 XOR T-7 XOR T-2
	PXOR	MM2, [counter-96] ; 0f ef 51 a0	; T-12 XOR T-6 XOR T-1
		; mload T-16, T-15, T-14
	PXOR	MM0, [counter-128]; 0f ef 41 80	; T-16 XOR T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-120]; 0f ef 49 88	; T-15 XOR T-13 XOR T-7 XOR T-2

	; PXOR MM2, [ECX-112] was split up into a MOVQ, PXOR combo on MM3
	MOVQ	MM3, [counter-112]; 0f 6f 59 90	; T-14
	MOVQ	MM4, MM0	  ; 0f 6f e0	; for left shift below
	PXOR	MM2, MM3	  ; 0f ef d3	; T-14 XOR T-12 XOR T-6 XOR T-1
	NOP			  ; 90
				  ;-- cache line

	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>(32-1)
	MOVQ	MM5, MM1	; 0f 6f e9	; for left shift below

	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>(32-1)
	MOVQ	MM6, MM2	; 0f 6f f2	; for left shift below

	PADDD	MM4, MM4	; 0f fe e4	; <<1
	PSRLD	MM2, 31		; 0f 72 d2 1f	; >>(32-1)
	PADDD	MM5, MM5	; 0f fe ed	; <<1

	POR	MM0, MM4	; 0f eb c4	; MM0<<<1=(MM0>>31)|(MM4<<1)
	PADDD	MM6, MM6	; 0f fe f6	; <<1
	NOP			; 90
	NOP			; 90
				;-- cache line

		; mstore T, T+1, T+2
	POR	MM1, MM5	 ; 0f eb cd	; MM1<<<1=(MM1>>31)|(MM5<<1)
	MOVQ	[counter], MM0	 ; 0f 7f 01
	POR	MM2, MM6	 ; 0f eb d6	; MM2<<<1=(MM2>>31)|(MM6<<1)
 	MOVQ	[counter+8], MM1 ; 0f 7f 49 08
	MOVQ	[counter+16], MM2; 0f 7f 51 10

	; next 3 ================================
		; T=T+3
	ADD	counter, BYTE 24 ; 83 c1 18

		; mload T-3, T-2, T-1
	MOVQ	MM0, [counter-24]; 0f 6f 41 e8	; T-3
	MOVQ	MM1, [counter-16]; 0f 6f 49 f0	; T-2
	MOVQ	MM2, [counter-8] ; 0f 6f 51 f8	; T-1
				 ;-- cache line
		; mload T-8, T-7, T-6
	MOVQ	MM3, [counter-64]; 0f 6f 59 c0	; T-8
	MOVQ	MM4, [counter-56]; 0f 6f 61 c8	; T-7
	PXOR	MM0, MM3	 ; 0f ef c3	; T-8 XOR T-3
	MOVQ	MM5, [counter-48]; 0f 6f 69 d0	; T-6
	PXOR	MM1, MM4	 ; 0f ef cc	; T-7 XOR T-2
	PXOR	MM2, MM5	 ; 0f ef d5	; T-6 XOR T-1
		; mload T-14, T-13, T-12
	PXOR	MM0, [counter-112]; 0f ef 41 90	; T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-104]; 0f ef 49 98	; T-13 XOR T-7 XOR T-2
	PXOR	MM2, [counter-96] ; 0f ef 51 a0	; T-12 XOR T-6 XOR T-1
				  ;-- cacheline +1
		; mload T-16, T-15, T-14
	PXOR	MM0, [counter-128]; 0f ef 41 80	; T-16 XOR T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-120]; 0f ef 49 88	; T-15 XOR T-13 XOR T-7 XOR T-2
	MOVQ	MM4, MM0	  ; 0f 6f e0	; for left shift below

	PXOR	MM2, [counter-112]; 0f ef 51 90	; T-14 XOR T-12 XOR T-6 XOR T-1

	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>(32-1)
	MOVQ	MM5, MM1	; 0f 6f e9	; for left shift below

	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>(32-1)
	MOVQ	MM6, MM2	; 0f 6f f2	; for left shift below
	PADDD	MM4, MM4	; 0f fe e4	; <<1
				;-- cache line +1

	PSRLD	MM2, 31		; 0f 72 d2 1f	; >>(32-1)
	PADDD	MM5, MM5	; 0f fe ed	; <<1

	POR	MM0, MM4	; 0f eb c4	; MM0<<<1=(MM0>>31)|(MM4<<1)
	PADDD	MM6, MM6	; 0f fe f6	; <<1

		; mstore T, T+1, T+2
	POR	MM1, MM5	 ; 0f eb cd	; MM1<<<1=(MM1>>31)|(MM5<<1)
	MOVQ	[counter], MM0	 ; 0f 7f 01	; T
	POR	MM2, MM6	 ; 0f eb d6	; MM2<<<1=(MM2>>31)|(MM6<<1)
 	MOVQ	[counter+8], MM1 ; 0f 7f 49 08	; T+1
	MOVQ	[counter+16], MM2; 0f 7f 51 10	; T+2
	NOP			 ; 90
				 ;-- cache line
		; T=T+3
	ADD	counter, BYTE 24; 83 c1 18
	CMP	counter, bailout; 39 d1
	JNE make_words		; 0f 85 16 ff ff ff

	;--------------------------------------
	; these four are about 30 clocks
	; about 7.5 clocks per two 32bit words

	; only doing two words at once here.
last_4:
	MOVQ	MM0, [ECX-24]	; 0f 6f 41 e8
	MOVQ	MM1, [ECX-16]	; 0f 6f 49 f0
	MOVQ	MM3, [ECX-64]	; 0f 6f 59 c0
	MOVQ	MM4, [ECX-56]	; 0f 6f 61 c8
	PXOR	MM0, MM3	; 0f ef c3
	NOP			; 90
	NOP			; 90
				;-- cache line
	PXOR	MM1, MM4	; 0f ef cc
	PXOR	MM0, [ECX-112]	; 0f ef 41 90
	PXOR	MM1, [ECX-104]	; 0f ef 49 98
	PXOR	MM0, [ECX-128]	; 0f ef 41 80
	PXOR	MM1, [ECX-120]	; 0f ef 49 88
	MOVQ	MM4, MM0	; 0f 6f e0
	MOVQ	MM5, MM1	; 0f 6f e9
	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>31
	PADDD	MM4, MM4	; 0f fe e4	; <<1
				;-- cache line
	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>31
	PADDD	MM5, MM5	; 0f fe ed	; <<1
	POR	MM0, MM4	; 0f eb c4
	POR	MM1, MM5	; 0f eb cd
	MOVQ	[ECX], MM0	; 0f 7f 01
 	MOVQ	[ECX+8], MM1	; 0f 7f 49 08
	ADD	ECX, BYTE 16	; 83 c1 10	; T=T+2
	MOVQ	MM3, [ECX-64]	; 0f 6f 59 c0
	MOVQ	MM0, [ECX-24]	; 0f 6f 41 e8
	NOP			; 90
				;-- cache line
	MOVQ	MM1, [ECX-16]	; 0f 6f 49 f0
	MOVQ	MM4, [ECX-56]	; 0f 6f 61 c8
	PXOR	MM0, MM3	; 0f ef c3
	PXOR	MM1, MM4	; 0f ef cc
	PXOR	MM0, [ECX-112]	; 0f ef 41 90
	PXOR	MM1, [ECX-104]	; 0f ef 49 98
	PXOR	MM0, [ECX-128]	; 0f ef 41 80
	PXOR	MM1, [ECX-120]	; 0f ef 49 88
	NOP			; 90
	NOP 			; 90
				;-- cache line
	MOVQ	MM4, MM0	; 0f 6f e0
	MOVQ	MM5, MM1	; 0f 6f e9
	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>31
	PADDD	MM4, MM4	; 0f fe e4	; <<1
	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>31
	PADDD	MM5, MM5	; 0f fe ed	; <<1
	POR	MM0, MM4	; 0f eb c4
	POR	MM1, MM5	; 0f eb cd
	MOVQ	[ECX], MM0	; 0f 7f 01
 	MOVQ	[ECX+8], MM1	; 0f 7f 49 08
				;-- cache line

	; from here to the end of round 19 is about 170 to 220 clocks

	; Copy the two SHA-1 states into the first 5 registers.
	; A, B, C, D, and E should be self-explanitory 
	; (see FIPS 180-1 otherwise)
	; Each half of each register holds completely seperate
	; contexts.

	; the NOPs make it go faster, some weird memory allignment thing I
	; think. The address to each MOVQ fall on divisible-by-4 addresses.

state_copy:
	NOP			; 90
	MOVQ	MM0, QWORD [a]	; 0f 6f 05 c0 a9 04 08 (0x804a9c0)
	NOP			; 90
	MOVQ	MM1, QWORD [b]	; 0f 6f 0d a8 a9 04 08 (0x804a9a8)
	NOP			; 90
	MOVQ	MM2, QWORD [c]	; 0f 6f 15 b0 a9 04 08 (0x804a9b0)
	NOP			; 90
	MOVQ	MM3, QWORD [d]	; 0f 6f 1d b8 a9 04 08 (0x804a9b8)
				;-- cache line
	NOP			; 90
	MOVQ	MM4, QWORD [e]  ; 0f 6f 25 c8 a9 04 08 (0x804a9c8)

	; This is the core of the SHA-1 transform
	; The following is done 80 times (T = 0..79)

	; A' = A<<<5 + F(B,C,D) + E + W[t] + K[t]
	; B' = A
	; C' = B<<<30
	; D' = C
	; E' = D

	; Function F(B,C,D) (op_F) is (B AND C) OR (NOT B AND D) for T 0..19
	; Function F(B,C,D) (op_G) is B XOR C XOR D for T 20..39 and 60..79
	; Function F(B,C,D) (op_H) is (B AND C) OR (B AND D) OR (C AND D) for T 40..59

	; I've named these three functions F, G, and H respectively (like in MD4)

	; Note: FIPS 180-2 says (B AND C) XOR (NOT B AND D)
	;                       (B AND C) XOR (B AND D) XOR (C AND D)
	; which produces the exact same result.

	; again, this would be *alot* faster if there was a MMX ROR/ROL equivalent
	; op_H would be alot faster if I had one more 64 bit register.

	; The rounds were rolled up into three iterations of a loop of six
	; rounds each. This was to try to avoid flooding the instruction caches
	; (Reduction of code size by 1/3 at the expense of a 18 loop increments, a compaire and jump.)
	; (The old unrolled version used macro-expanded constants for W[T], no index computation neccesary)

	; I can get away with keeping the entire SHA-1 state in registers
	; and save several MOVs if I just cycle through the registers
	; between each round (ie output variables are shifted for input)

	; For example, four rounds of E goes like:
	; E' = D
	; E'' = D' = C
	; E''' = D'' = C' = B<<<30
	; E'''' = D''' = C'' = B'<<<30 = A (<<<30)
	; So I'm just substituting the last set of state variables back in.
	; And no you can't skip ahead because every state var is needed
	; to compute A every round.

;= (B AND C) OR (NOT B AND D) ===============================================
%macro	op_F 8

	; These three instructions were moved up to the last few
	; instructions of the previous operation (so they're at the end here)
	; This was to promote better parallelism on the K6-II
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5

			;-- cache line here first time through
	PSLLD	MM6, 5	; 0f 72 f6 05	;  A << 5
	MOVQ	MM7, %3	; 0f 6f f9	;  |                               B
	POR	%7, MM6	; 0f eb ee	;  A = A<<<5 = (A<<5)|(A>>27)	   |
	MOVQ	MM6, %3	; 0f 6f f1	;  |                      B	   |
	PADDD	%7, %6	; 0f fe ec	;  A<<<5 + E		  |	   |
	PAND	MM6, %4	; 0f db f2	;  |       |              B & C	   |
	PADDD	%7, [%1]; 0f fe 29	;  A<<<5 + E + W[t]	  |   |	   |
	PANDN	MM7, %5	; 0f df fb	;  |       |   |          |   |   ~B & D
	PADDD	%7, [%8]; 0f fe 2df8a50408;A<<<5 + E + W[t] + K   |   |	   |   |
			; 0f fe 23 if EBX used	   |   |      |	  |   |	   |   |
			;--cache line here if K1 used  |      |	  |   |    |   |
	POR	MM6, MM7; 0f eb f7	;  |       |   |      |  (B & C)|(~B & D)
			;--cache line here if EBX used |      |	  |   |    |   |
	MOVQ	MM7, %3	; 0f 6f f9	; B	   |   |      |	  |   |	   |   |
	PADDD	%7, MM6	; 0f fe ee	;  A<<<5 + E + W[t] + K +((B & C)|(~B & D))
	PSLLD	%3, 30	; 0f 72 f1 1e	; B << 30

	MOVQ	%6, %7	; 0f 6f e5	; "TEMP" (next A) into E
	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30

	MOVQ	MM6, %7	; 0f 6f f5	; "TEMP" (next A) into E

	PSRLD	%6, 27	; 0f 72 d4 1b	; A >> 32-5 (TEMP)
	POR	%3, MM7	; 0f eb cf	; B <<<30
	ADD	%1, BYTE 8; 83 c1 08	; next word T=T+1
			;-- cache line +1 if K1
			;-- cache line -2 if EBX

%endmacro

				;-- cache line +8
	; These were moved up to the bottom of each previous round. 
	; So this is needed for the first time through.
	MOVQ	MM5, MM0	; 0f 6f e8	; A
	MOVQ	MM6, MM0	; 0f 6f f0	; A
	PSRLD	MM5, 27		; 0f 72 d5 1b	; A >> 32-5

	MOV	ECX, index	; 89 c1
	MOV	bailout, index	; 89 c2
	ADD	bailout, 144	; 81 c2 90 00 00 00 ; faster than ADD DX, WORD 0x90 ; 66 81 c2 90 00
	MOV	EBX, K1		; bb f8 a5 04 08
				;-- cache line +2
	; this is the only loop that doesn't start on a cache line
	; when I do allign this to the start of the line (-2 bytes)
	; it runs slower. So this is the fast version.

	; Alternatively using [K1] and [EBX] is faster than useing only one.

rounds_0_17:
	op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1
	op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, K1
	op_F ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
	op_F ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
	op_F ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_F ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP	ECX, bailout		; 39 d1
JNE rounds_0_17			; 0f 85 82 fe ff ff

	; these two are 'about' 20-30 clocks depending on caches I guess
	op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1	; 18
	op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX	; 19

; = B XOR C XOR D ===========================================================
%macro	op_G 8

	; See above.. for reference..
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5
			;-- cache line here for first round
	PSLLD	MM6, 5	; 0f 72 f6 05	; A << 5
	MOVQ	MM7, %3	; 0f 6f fd	;                         B
	POR	%7, MM6	; 0f eb de	; A = A<<<5 = (A<<5)|(A>>27)
	PXOR	MM7, %4	; 0f ef f8	; |                       B ^ C
	PADDD	%7, %6	; 0f fe da	; A<<<5 + E		  |   |
	PXOR	MM7, %5	; 0f ef f9	; |       |               B ^ C ^ D
	PADDD	%7, MM7	; 0f fe df	; A<<<5 + E 	       + (B ^ C ^ D)
	PADDD	%7, [%1]; 0f fe 19	; A<<<5 + E + W[t]     + (B ^ C ^ D)
	MOVQ	MM7, %3	; 0f 6f fd	; B	  |   |		  |
			;-- cache line (-4)	  |   |		  |
					; A<<<5 + E + W[t] + K + (B ^ C ^ D)
	PADDD	%7, [%8]; 0f fe 1d 00 a6 04 08 (0x804a600)
			; 0f fe 13 if EBX is used

	PSLLD	%3, 30	; 0f 72 f5 1e	; B << 30
	MOVQ	%6, %7	; 0f 6f d3	; "TEMP" (next A) into E

	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30
	MOVQ	MM6, %7	; 0f 6f f3	; "TEMP" (next A) into E

	POR	%3, MM7	; 0f eb ef	; B <<<30
	PSRLD	%6, 27	; 0f 72 d2 1b	; A >> 32-5 (TEMP)
	ADD ECX, BYTE 8	; 83 c1 08	; T=T+1
			;-- cache +27 (or -5) if K2  
			;-- cache +23 (or -9) if EBX
%endmacro

	MOV	EBX, K2		; bb 00 a6 04 08 ($0x804a600)
	MOV	bailout, ECX 	; 89 ca
	ADD	bailout, 144 	; 81 c2 90 00 00 00 ; (8*18)-[(2*8) for 18 and 19]
		; It's faster to jump over the NOPs
	JMP rounds_20_37	; e9 12 00 00 00
	NOP ;0
	NOP ;1
	NOP ;2
	NOP ;3
	NOP ;4
	NOP ;5
	NOP ;6
	NOP ;7
	NOP ;8
	NOP ;9
	NOP ;a
	NOP ;b
	NOP ;c
	NOP ;d
	NOP ;e
	NOP ;f
				;-- cache line

	; this is almost exactly 200 clock from here to...
rounds_20_37:
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2
	op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, EBX
	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
CMP	ECX, bailout	; 39 d1
JNE rounds_20_37	; 0f 85 a6 fe ff ff
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2 ; 38
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2 ; 39
	; ... here so about 1.6 clocks per opcode

; this is the slowest one, mostly due to register crunch
;= (B AND C) OR (B AND D) OR (C AND D) ======================================
%macro	op_H 8

	; for reference,
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5

	PSLLD	MM6, 5	; 0f 72 f6 05	; A << 5
	MOVQ	MM7, %3	; 0f 6f fb	;                                 B
	POR	%7, MM6	; 0f eb ce	; A = A<<<5 = (A<<5)|(A>>27)	  |
	PAND	MM7, %5	; 0f db fd	; |                               B & D
	MOVQ	MM6, %3	; 0f 6f f3	; |                       B	  |   |
	PAND	MM6, %4	; 0f db f4	; |                       B & C	  |   |
	PADDD	%7, %6	; 0f fe c8	; A<<<5 + E		  |   |   |   |
	POR	MM6, MM7; 0f eb f7	; |       |              (B & C)|(B & D)
	PADDD	%7, [%1]; 0f fe 09	; A<<<5 + E + W[t]	  |
	MOVQ	MM7, %4	; 0f 6f fc	; |       |   |   C	  |
	PADDD	%7, [%8]; 0f fe 0b (ebx); A<<<5 + E + W[t] + K	  |
			; 0f fe 0508a60408 (0x804a608)|	  |  |	  |
	PAND	MM7, %5	; 0f db fd	; |       |   |   C & D	  |
	POR	MM6, MM7; 0f eb f7	; |       |   |  (C & D)|(B & C)|(B & D)
	PADDD	%7, MM6	; 0f fe ce	; A<<<5 + E + W[t] + K +((B&C)|(B&D)|(C&D))

	MOVQ	MM7, %3	; 0f 6f fb	; B
	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30
	MOVQ	%6, %7	; 0f 6f c1	; "TEMP" (next A) into E

	PSLLD	%3, 30	; 0f 72 f3 1e	; B << 30
	MOVQ	MM6, %7	; 0f 6f f1	; "TEMP" (next A) into E

	POR	%3, MM7	; 0f eb df	; B <<<30
	PSRLD	%6, 27	; 0f 72 d0 1b	; A >> 32-5 (TEMP)
	ADD ECX, BYTE 8	; 83 c1 08	; T=T+1

%endmacro

	MOV	EBX, K3		; bb 08 a6 04 08
	MOV	bailout, ECX	; 89 ca
	ADD	bailout, 144	; 81 c2 90 00 00 00
	NOP ;d
	NOP ;e
	NOP ;f

	; This is also about 250 clocks
rounds_40_57:
	op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3
	op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3
	op_H ECX, MM0, MM1, MM2, MM3, MM4, MM5, K3
	op_H ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
	op_H ECX, MM4, MM5, MM0, MM1, MM2, MM3, K3
	op_H ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
CMP	ECX, bailout		; 39 d1
JNE rounds_40_57		; 0f 85 47 fe ff ff
	op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3	; 58
	op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3	; 59

	; No new macro definition here.. this is the same as rounds 20..39

	MOV	EBX, K4		; bb 10 a6 04 08 ($0x804a610)
	MOV	bailout, ECX	; 89 ca
	ADD	bailout, 144	; 81 c2 90 00 00 00
	NOP ;d
	NOP ;e
	NOP ;f

rounds_60_77:
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4
	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
	op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP	ECX, bailout		; 39 d1 
JNE rounds_60_77		; 0f 85 a2 fe ff ff
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4 ; 78

	; This last step used to be like this...
	; 	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4 ; 79
	; 	PADDD	MMq, QWORD [z]  foreach qz in 4a, 5b, 0c, 1d, 2e
	;	MOVQ	QWORD [z], MMq  foreach qz in 4a, 5b, 0c, 1d, 2e
	; But I moved the final ADDs and MOVs up into the middle of the
	; last round because there were some free execution slots on my K6-II
	; that could be used... (It's faster this way basically)

				; cache line +21
	PADDD	MM5, QWORD [b]	; 0f fe 2d a8 a9 04 08	; add the original values back in
	PSLLD	MM6, 5		; 0f 72 f6 05		; A << 5
				;-- cache line
	MOVQ	MM7, MM0	; 0f 6f f8		;                         B
	POR	MM4, MM6	; 0f eb e6		; A = A<<<5 = (A<<5)|(A>>27)
	PXOR	MM7, MM1	; 0f ef f9		; |                       B ^ C
	PADDD	MM4, MM3	; 0f fe e3		; A<<<5 + E		  |   |
	MOVQ	QWORD [b], MM5	; 0f 7f 2d a8 a9 04 08	; move final B back out	  |   |
	PXOR	MM7, MM2	; 0f ef fa		; |       |               B ^ C ^ D
	PADDD	MM1, QWORD [d]	; 0f fe 0d b8 a9 04 08	; add the original values back in |
	PADDD	MM4, MM7	; 0f fe e7		; A<<<5 + E            + (B ^ C ^ D)
				;-- cache line		  |	  |		  |   |   |
	PADDD	MM2, QWORD [e]	; 0f fe 15 c8 a9 04 08	; add the original values back in
	PADDD	MM4, [ECX]	; 0f fe 21		; A<<<5 + E + W[t]     + (B ^ C ^ D)
	MOVQ	QWORD [d], MM1	; 0f 7f 0d b8 a9 04 08	; move final D back out	  |   |   |
	PADDD	MM4, [K4]	; 0f fe 25 10 a6 04 08	; A<<<5 + E + W[t] + K + (B ^ C ^ D)
							; slower: PADDD MM4, [EBX] ; 0f fe 23
	MOVQ	MM7, MM0	; 0f 6f f8		; B
	MOVQ	QWORD [e], MM2	; 0f 7f 15 c8 a9 04 08	; move final E back out
				;-- cache line +2
	PSLLD	MM0, 30		; 0f 72 f0 1e		; B << 30
	PADDD	MM4, QWORD [a]	; 0f fe 25 c0 a9 04 08	; add the original values back in
	PSRLD	MM7, 2		; 0f 72 d7 02		; B >> 32-30
	MOVQ	QWORD [a], MM4	; 0f 7f 25 c0 a9 04 08	; move final A back out

	POR	MM0, MM7	; 0f eb c7		; B <<<30
	PADDD	MM0, QWORD [c]	; 0f fe 05 b0 a9 04 08	; add the original values back in
				;-- cache line +2
	MOVQ	QWORD [c], MM0	; 0f 7f 05 b0 a9 04 08	; move final C back out
end:
	MOV	EAX, 0	; b8 00 00 00 00 ; return value, probably not needed
	POP	EDX	; 5a
	POP	ESI	; 5e
	POP	EBX	; 5b
	; using LEAVE ; c9 goes slightly slower than the following...
	MOV	ESP, EBP; 89 ec
	POP	EBP	; 5d
	RET		; c3 		; gcc cleans up stack