SHA-1 in 1300 clock cycles. (Warning: Long)

I haven't had time to post anything new in here. (Malware analysis and stuff.) So here's something old I wrote, that few people have seen.

I thought of a way to shave a few (maybe 50) more clock cycles off this. But I haven't had time to update this.

There's a small .c program to demonstrate how to use this, and to perform some unit tests, here: http://www.virtadpt.net/~julia/sha1.zip

; SHA-1 transform
; Version: 0.14
; Date: Jan 20, 2004
; Author: Julia Wolf
; Copyright 2004, Distribution in accordence with the GPL (http://www.gnu.org/)

; I wrote the first version (0.1) of this in about a day working only off of
; the FIPS 180-1 standard document itself (and my previous MMX MD-5 routine).
; Then I spend a week or so shaving off about 300 clock cycles.

; It's probabily best to have a copy of FIPS 180-1 sitting in front of you if
; you want to try to follow what the code is doing.

; The bizzare instruction order, the NOPs and seemingly random
; use of the same value in registers and from memory load, are all
; to shave off a few clock cycles here and there.

; In summary, these instructions have been layed out to completely
; saturate all of the pipelines and execution units in the AMD K6-II
; whenever possible, and also make the instruction decoder happy.
; (it really is faster with the NOP's)

; I can probibally shave off a few more clock cycles, but I've pretty much
; hit the flat part of the curve now. It's not going to get any faster than
; it is now by more than a few clocks unless I start removing productive
; code. (Or adding more 64 bit registers to the silicon)

; Statistics on my AMD K6-II (fastest machine I have because I'm poor)
; (Timmed with RDTSC)
; best time avg. ~1295 clocks
; 826 opcodes to final RET, including NOP's
; 777 opcodes without NOP's
; 1295/777 = 1.66 opcodes per clock cycle...
; ...or about 0.60 clocks per opcode.

; total length 2720 bytes including the NOP's after RET filling the cache
; line. Or, exactly 85 cache lines. The four main rounds were rolled back up
; to reduce cache misses on the instruction cache.

; This function assumes that the 512-bit blocks have already been padded
; at the end.

; You can call this like this:
;	extern void sha1_mmx(unsigned char* x, unsigned char* y);
;	unsigned char a[8];
;	unsigned char b[8];
;	unsigned char c[8];
;	unsigned char d[8];
;	unsigned char e[8];
;	unsigned char x[64], y[64];
;	sha1_mmx(x,y)
; x and y are your 512 bits of message, padded apropreately
; ('1' bit with the size mod 2^64 at the end, see FIPS 180-1)
; a, b, c, d, and e are the two contexts, foreach z in a,b,c,d,e:
; z[0],z[1],z[2],z[3] belong to the first context and
; z[4],z[5],z[6],z[7] belong to the second context.

; I really don't know why I bother with using byte arrays rather than int's
; since this will only ever run on a Pentium MMX and above I don't have to
; worry about endedness issues. (force of habbit I guess)

[SECTION .bss align=8 noexec]

W	RESQ 80	; first 16 elements same as M see FIPS 180-1

[SECTION .rodata align=8 noexec]

K1	dd 0x5A827999, 0x5A827999 ; for T[ 0..19]
K2	dd 0x6ED9EBA1, 0x6ED9EBA1 ; for T[20..39]
K3	dd 0x8F1BBCDC, 0x8F1BBCDC ; for T[40..59]
K4	dd 0xCA62C1D6, 0xCA62C1D6 ; for T[60..79]

; TODO: pass the context as an argument insted of using globals

[SECTION .data align=8 noexec]

extern	a
extern	b
extern	c
extern	d
extern  e

[SECTION .text allign=8 nowrite]

global sha1_mmx:function

sha1_mmx:
	; This is faster than ENTER 0,0 by a few clocks, strangely
	; EBP, ESI, and stuff arn't really used here, so
	; I could take these ops out, there are no local variables.
	PUSH	EBP		; 55
	MOV	EBP, ESP	; 89 e5
	PUSH	EBX		; 53
	PUSH	ESI		; 56
	PUSH	EDX		; 52

%define X EDX
%define Y EBX
	MOV	Y, [EBP+8]	; 8b 55 08	; arrrg 2
	MOV	X, [EBP+12]	; 8b 5d 0c	; arrrg 1

	; I think this is the only instruction that makes this
	; not compatible with the original Pentium MMX
	; (replace with EMMS if there's a problem)
	FEMMS			; 0f 0e

	PREFETCH [X+32]		; 0f 0d 42 20
	PREFETCH [Y+32]		; 0f 0d 43 20


	XOR	ECX, ECX	; 31 c9	; shorter than doing MOV ECX, 0	; b9 00 00 00 00

	; Rather than using the [disp32+index] addressing mode
	; storing [W] in unused register EAX allows for shorter
	; instruction length.
%define index EAX
	MOV	index, W	; b8 38 a7 04 08 ($0x804a738)
	NOP			; 90
	NOP			; 90
	NOP			; 90
				;-- cache line boundry

	; This routine takes 64 bits from each block of 512 bits
	; and mixes them together into alternating blocks of 32 bits
	; for example, x[n] and y[n] are 32 bits each:
	; input: x[1], x[2], x[3], x[4], x[5] ...
	; input: y[1], y[2], y[3], y[4], y[5] ...
	; output: x[1], y[1], x[2], y[2], x[3], y[3], x[4], y[4], ...

	; The output is placed into the first 16 QWORDS of [W]

	; This is done so that each 32 bits fills the top and bottom
	; halves of a 64 bit MMX register. Allowing both to be worked
	; on in parallel. (It's the rationale behind this whole SHA-1 
	; function if you havn't guessed already)

	; only need to do this for 8 DWORDs to make 16 QWORDs (hence ECX)
	; this is about 20 to 90 clocks.

	; These were changed to byte-sized version to save space
	; and ECX never goes above 256 so it's ok
	;	ADD	ECX, 8		; 81 c1 08 00 00 00
	;	CMP	ECX, 64		; 81 f9 40 00 00 00

multiplex:
	MOVQ	MM6, [X+ECX]	; 0f 6f 34 0a	; 12345678
	MOVQ	MM7, [Y+ECX]	; 0f 6f 3c 0b	; abcdefgh
	MOVQ	MM4, MM6	; 0f 6f e6
	MOVQ	MM5, MM7	; 0f 6f ef
	PUNPCKHDQ MM5, MM4	; 0f 6a ec	; 1234abcd
	PUNPCKLDQ MM7, MM6	; 0f 62 fe	; 5678efgh
	MOVQ	[index+ECX*2], MM7   ; 0f 7f 3c 48
	MOVQ	[index+ECX*2+8], MM5 ; 0f 7f 6c 48 08
	ADD	CL, BYTE 8	; 80 c1 08
				;-- cache line
	CMP	CL, BYTE 64	; 80 f9 40
	JNE multiplex		; 75 d5

	; Pre-compute the words for T=16 through T=79
	; It's better to do this now while the MMX registers are
	; still free, rather than later when they are used.
	; Only needs an extra 64 QWORDS of memory.

	; Originally this performed 16 loops of these 4 operations:
	; for a total of (80-16)=64 words
	; T     (W[T-3] XOR W[T-8] XOR W[T-14] XOR W[T-16])<<<1
	; one   (W[T-2] XOR W[T-7] XOR W[T-13] XOR W[T-15])<<<1
	; two   (W[T-1] XOR W[T-6] XOR W[T-12] XOR W[T-14])<<<1
	; three (W[T  ] XOR W[T-5] XOR W[T-11] XOR W[T-13])<<<1
	; On the last operation you can see that it is dependent upon
	; the first operation compleating, which wasn't much of a problem
	; just make that the last XOR performed.
	; Can be done in a loop of (4 ops * 16 loops) 

	; Then I rolled it back up into a loop and only did three at a time
	; So I could have done (3*21)+1 but then unrolled it a little bit
	; and it now does (6*10)+4 operations.

	; This freed some MMX registers for some speed improvements
	; doing RISC-like programming, seperate mload and meu operations.

	; To avoid contention over the single MMX bit shifter on the K6-II
	; left shifts are done by adding the register to itself, since the
	; integer unit is free at that point. (The right shift can be done
	; in parallel to it)
	; If there was a faster way to get the sign bit of each half of 
	; an MMX reg, then that could be used in place of the PSRLD.

	; If there was an MMX equivilant to ROR or ROL then this would
	; be much faster, but there isn't so I need to make a copy of
	; the register, shift one left, and the other right, and then
	; OR them back together.
	; And I think copying the MMX register out into one of the general
	; E?X registers, doing the rotate there and copying back into MMX
	; would take even longer than this.

	; the following is about 440 clocks
	; so at about 44 clocks per loop, and 6 words per loop
	; it's about 7.3 clocks per two 32bit words

ugh:
%define counter ECX
%define bailout EDX
		; T=16
	MOV	CL, BYTE 128	; b1 80 ; shorter than MOV ECX, 128 ; b9 80 00 00 00
				; 	  high half of ECX still 0 from above
				; there might be a superset dependancy here,
				; but I don't see any speed diference with ECX vs CL
	ADD	counter, index	; 01 c1
	MOV	bailout, counter; 89 ca
		; end when T=16+60
	ADD	bailout, 480	; 81 c2 e0 01 00 00	; 128+480

	; skip the padding, shaves off a few clocks
	JMP make_words		; e9 0a 00 00 00

	; pad up to the next cache line so that the following loop goes faster
	NOP ; 6
	NOP ; 7
	NOP ; 8
	NOP ; 9
	NOP ; A
	NOP ; B
	NOP ; C
	NOP ; D
	NOP ; E
	NOP ; F
				;-- cache line
make_words:
		; mload T-3, T-2, T-1
	MOVQ	MM0, [counter-24]; 0f 6f 41 e8	; T-3
	MOVQ	MM1, [counter-16]; 0f 6f 49 f0	; T-2
	MOVQ	MM2, [counter-8] ; 0f 6f 51 f8	; T-1

	; insted of doing PXOR	MM0, [ECX-64] etc. etc.
	; doing MOVQ then PXOR shaves off a bunch of clocks here...
		; mload T-8, T-7, T-6
	MOVQ	MM3, [counter-64]; 0f 6f 59 c0	; T-8
	MOVQ	MM4, [counter-56]; 0f 6f 61 c8	; T-7
	PXOR	MM0, MM3	 ; 0f ef c3	; T-8 XOR T-3
	MOVQ	MM5, [counter-48]; 0f 6f 69 d0	; T-6
	PXOR	MM1, MM4	 ; 0f ef cc	; T-7 XOR T-2
				 ;-- cacheline-2 this will be long decoded
	PXOR	MM2, MM5	 ; 0f ef d5	; T-6 XOR T-1

	; ...but doesn't save time when done here:
		; mload T-14, T-13, T-12
	PXOR	MM0, [counter-112]; 0f ef 41 90	; T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-104]; 0f ef 49 98	; T-13 XOR T-7 XOR T-2
	PXOR	MM2, [counter-96] ; 0f ef 51 a0	; T-12 XOR T-6 XOR T-1
		; mload T-16, T-15, T-14
	PXOR	MM0, [counter-128]; 0f ef 41 80	; T-16 XOR T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-120]; 0f ef 49 88	; T-15 XOR T-13 XOR T-7 XOR T-2

	; PXOR MM2, [ECX-112] was split up into a MOVQ, PXOR combo on MM3
	MOVQ	MM3, [counter-112]; 0f 6f 59 90	; T-14
	MOVQ	MM4, MM0	  ; 0f 6f e0	; for left shift below
	PXOR	MM2, MM3	  ; 0f ef d3	; T-14 XOR T-12 XOR T-6 XOR T-1
	NOP			  ; 90
				  ;-- cache line

	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>(32-1)
	MOVQ	MM5, MM1	; 0f 6f e9	; for left shift below

	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>(32-1)
	MOVQ	MM6, MM2	; 0f 6f f2	; for left shift below

	PADDD	MM4, MM4	; 0f fe e4	; <<1
	PSRLD	MM2, 31		; 0f 72 d2 1f	; >>(32-1)
	PADDD	MM5, MM5	; 0f fe ed	; <<1

	POR	MM0, MM4	; 0f eb c4	; MM0<<<1=(MM0>>31)|(MM4<<1)
	PADDD	MM6, MM6	; 0f fe f6	; <<1
	NOP			; 90
	NOP			; 90
				;-- cache line

		; mstore T, T+1, T+2
	POR	MM1, MM5	 ; 0f eb cd	; MM1<<<1=(MM1>>31)|(MM5<<1)
	MOVQ	[counter], MM0	 ; 0f 7f 01
	POR	MM2, MM6	 ; 0f eb d6	; MM2<<<1=(MM2>>31)|(MM6<<1)
 	MOVQ	[counter+8], MM1 ; 0f 7f 49 08
	MOVQ	[counter+16], MM2; 0f 7f 51 10

	; next 3 ================================
		; T=T+3
	ADD	counter, BYTE 24 ; 83 c1 18

		; mload T-3, T-2, T-1
	MOVQ	MM0, [counter-24]; 0f 6f 41 e8	; T-3
	MOVQ	MM1, [counter-16]; 0f 6f 49 f0	; T-2
	MOVQ	MM2, [counter-8] ; 0f 6f 51 f8	; T-1
				 ;-- cache line
		; mload T-8, T-7, T-6
	MOVQ	MM3, [counter-64]; 0f 6f 59 c0	; T-8
	MOVQ	MM4, [counter-56]; 0f 6f 61 c8	; T-7
	PXOR	MM0, MM3	 ; 0f ef c3	; T-8 XOR T-3
	MOVQ	MM5, [counter-48]; 0f 6f 69 d0	; T-6
	PXOR	MM1, MM4	 ; 0f ef cc	; T-7 XOR T-2
	PXOR	MM2, MM5	 ; 0f ef d5	; T-6 XOR T-1
		; mload T-14, T-13, T-12
	PXOR	MM0, [counter-112]; 0f ef 41 90	; T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-104]; 0f ef 49 98	; T-13 XOR T-7 XOR T-2
	PXOR	MM2, [counter-96] ; 0f ef 51 a0	; T-12 XOR T-6 XOR T-1
				  ;-- cacheline +1
		; mload T-16, T-15, T-14
	PXOR	MM0, [counter-128]; 0f ef 41 80	; T-16 XOR T-14 XOR T-8 XOR T-3
	PXOR	MM1, [counter-120]; 0f ef 49 88	; T-15 XOR T-13 XOR T-7 XOR T-2
	MOVQ	MM4, MM0	  ; 0f 6f e0	; for left shift below

	PXOR	MM2, [counter-112]; 0f ef 51 90	; T-14 XOR T-12 XOR T-6 XOR T-1

	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>(32-1)
	MOVQ	MM5, MM1	; 0f 6f e9	; for left shift below

	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>(32-1)
	MOVQ	MM6, MM2	; 0f 6f f2	; for left shift below
	PADDD	MM4, MM4	; 0f fe e4	; <<1
				;-- cache line +1

	PSRLD	MM2, 31		; 0f 72 d2 1f	; >>(32-1)
	PADDD	MM5, MM5	; 0f fe ed	; <<1

	POR	MM0, MM4	; 0f eb c4	; MM0<<<1=(MM0>>31)|(MM4<<1)
	PADDD	MM6, MM6	; 0f fe f6	; <<1

		; mstore T, T+1, T+2
	POR	MM1, MM5	 ; 0f eb cd	; MM1<<<1=(MM1>>31)|(MM5<<1)
	MOVQ	[counter], MM0	 ; 0f 7f 01	; T
	POR	MM2, MM6	 ; 0f eb d6	; MM2<<<1=(MM2>>31)|(MM6<<1)
 	MOVQ	[counter+8], MM1 ; 0f 7f 49 08	; T+1
	MOVQ	[counter+16], MM2; 0f 7f 51 10	; T+2
	NOP			 ; 90
				 ;-- cache line
		; T=T+3
	ADD	counter, BYTE 24; 83 c1 18
	CMP	counter, bailout; 39 d1
	JNE make_words		; 0f 85 16 ff ff ff

	;--------------------------------------
	; these four are about 30 clocks
	; about 7.5 clocks per two 32bit words

	; only doing two words at once here.
last_4:
	MOVQ	MM0, [ECX-24]	; 0f 6f 41 e8
	MOVQ	MM1, [ECX-16]	; 0f 6f 49 f0
	MOVQ	MM3, [ECX-64]	; 0f 6f 59 c0
	MOVQ	MM4, [ECX-56]	; 0f 6f 61 c8
	PXOR	MM0, MM3	; 0f ef c3
	NOP			; 90
	NOP			; 90
				;-- cache line
	PXOR	MM1, MM4	; 0f ef cc
	PXOR	MM0, [ECX-112]	; 0f ef 41 90
	PXOR	MM1, [ECX-104]	; 0f ef 49 98
	PXOR	MM0, [ECX-128]	; 0f ef 41 80
	PXOR	MM1, [ECX-120]	; 0f ef 49 88
	MOVQ	MM4, MM0	; 0f 6f e0
	MOVQ	MM5, MM1	; 0f 6f e9
	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>31
	PADDD	MM4, MM4	; 0f fe e4	; <<1
				;-- cache line
	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>31
	PADDD	MM5, MM5	; 0f fe ed	; <<1
	POR	MM0, MM4	; 0f eb c4
	POR	MM1, MM5	; 0f eb cd
	MOVQ	[ECX], MM0	; 0f 7f 01
 	MOVQ	[ECX+8], MM1	; 0f 7f 49 08
	ADD	ECX, BYTE 16	; 83 c1 10	; T=T+2
	MOVQ	MM3, [ECX-64]	; 0f 6f 59 c0
	MOVQ	MM0, [ECX-24]	; 0f 6f 41 e8
	NOP			; 90
				;-- cache line
	MOVQ	MM1, [ECX-16]	; 0f 6f 49 f0
	MOVQ	MM4, [ECX-56]	; 0f 6f 61 c8
	PXOR	MM0, MM3	; 0f ef c3
	PXOR	MM1, MM4	; 0f ef cc
	PXOR	MM0, [ECX-112]	; 0f ef 41 90
	PXOR	MM1, [ECX-104]	; 0f ef 49 98
	PXOR	MM0, [ECX-128]	; 0f ef 41 80
	PXOR	MM1, [ECX-120]	; 0f ef 49 88
	NOP			; 90
	NOP 			; 90
				;-- cache line
	MOVQ	MM4, MM0	; 0f 6f e0
	MOVQ	MM5, MM1	; 0f 6f e9
	PSRLD	MM0, 31		; 0f 72 d0 1f	; >>31
	PADDD	MM4, MM4	; 0f fe e4	; <<1
	PSRLD	MM1, 31		; 0f 72 d1 1f	; >>31
	PADDD	MM5, MM5	; 0f fe ed	; <<1
	POR	MM0, MM4	; 0f eb c4
	POR	MM1, MM5	; 0f eb cd
	MOVQ	[ECX], MM0	; 0f 7f 01
 	MOVQ	[ECX+8], MM1	; 0f 7f 49 08
				;-- cache line

	; from here to the end of round 19 is about 170 to 220 clocks

	; Copy the two SHA-1 states into the first 5 registers.
	; A, B, C, D, and E should be self-explanitory 
	; (see FIPS 180-1 otherwise)
	; Each half of each register holds completely seperate
	; contexts.

	; the NOPs make it go faster, some weird memory allignment thing I
	; think. The address to each MOVQ fall on divisible-by-4 addresses.

state_copy:
	NOP			; 90
	MOVQ	MM0, QWORD [a]	; 0f 6f 05 c0 a9 04 08 (0x804a9c0)
	NOP			; 90
	MOVQ	MM1, QWORD [b]	; 0f 6f 0d a8 a9 04 08 (0x804a9a8)
	NOP			; 90
	MOVQ	MM2, QWORD [c]	; 0f 6f 15 b0 a9 04 08 (0x804a9b0)
	NOP			; 90
	MOVQ	MM3, QWORD [d]	; 0f 6f 1d b8 a9 04 08 (0x804a9b8)
				;-- cache line
	NOP			; 90
	MOVQ	MM4, QWORD [e]  ; 0f 6f 25 c8 a9 04 08 (0x804a9c8)

	; This is the core of the SHA-1 transform
	; The following is done 80 times (T = 0..79)

	; A' = A<<<5 + F(B,C,D) + E + W[t] + K[t]
	; B' = A
	; C' = B<<<30
	; D' = C
	; E' = D

	; Function F(B,C,D) (op_F) is (B AND C) OR (NOT B AND D) for T 0..19
	; Function F(B,C,D) (op_G) is B XOR C XOR D for T 20..39 and 60..79
	; Function F(B,C,D) (op_H) is (B AND C) OR (B AND D) OR (C AND D) for T 40..59

	; I've named these three functions F, G, and H respectively (like in MD4)

	; Note: FIPS 180-2 says (B AND C) XOR (NOT B AND D)
	;                       (B AND C) XOR (B AND D) XOR (C AND D)
	; which produces the exact same result.

	; again, this would be *alot* faster if there was a MMX ROR/ROL equivalent
	; op_H would be alot faster if I had one more 64 bit register.

	; The rounds were rolled up into three iterations of a loop of six
	; rounds each. This was to try to avoid flooding the instruction caches
	; (Reduction of code size by 1/3 at the expense of a 18 loop increments, a compaire and jump.)
	; (The old unrolled version used macro-expanded constants for W[T], no index computation neccesary)

	; I can get away with keeping the entire SHA-1 state in registers
	; and save several MOVs if I just cycle through the registers
	; between each round (ie output variables are shifted for input)

	; For example, four rounds of E goes like:
	; E' = D
	; E'' = D' = C
	; E''' = D'' = C' = B<<<30
	; E'''' = D''' = C'' = B'<<<30 = A (<<<30)
	; So I'm just substituting the last set of state variables back in.
	; And no you can't skip ahead because every state var is needed
	; to compute A every round.

;= (B AND C) OR (NOT B AND D) ===============================================
%macro	op_F 8

	; These three instructions were moved up to the last few
	; instructions of the previous operation (so they're at the end here)
	; This was to promote better parallelism on the K6-II
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5

			;-- cache line here first time through
	PSLLD	MM6, 5	; 0f 72 f6 05	;  A << 5
	MOVQ	MM7, %3	; 0f 6f f9	;  |                               B
	POR	%7, MM6	; 0f eb ee	;  A = A<<<5 = (A<<5)|(A>>27)	   |
	MOVQ	MM6, %3	; 0f 6f f1	;  |                      B	   |
	PADDD	%7, %6	; 0f fe ec	;  A<<<5 + E		  |	   |
	PAND	MM6, %4	; 0f db f2	;  |       |              B & C	   |
	PADDD	%7, [%1]; 0f fe 29	;  A<<<5 + E + W[t]	  |   |	   |
	PANDN	MM7, %5	; 0f df fb	;  |       |   |          |   |   ~B & D
	PADDD	%7, [%8]; 0f fe 2df8a50408;A<<<5 + E + W[t] + K   |   |	   |   |
			; 0f fe 23 if EBX used	   |   |      |	  |   |	   |   |
			;--cache line here if K1 used  |      |	  |   |    |   |
	POR	MM6, MM7; 0f eb f7	;  |       |   |      |  (B & C)|(~B & D)
			;--cache line here if EBX used |      |	  |   |    |   |
	MOVQ	MM7, %3	; 0f 6f f9	; B	   |   |      |	  |   |	   |   |
	PADDD	%7, MM6	; 0f fe ee	;  A<<<5 + E + W[t] + K +((B & C)|(~B & D))
	PSLLD	%3, 30	; 0f 72 f1 1e	; B << 30

	MOVQ	%6, %7	; 0f 6f e5	; "TEMP" (next A) into E
	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30

	MOVQ	MM6, %7	; 0f 6f f5	; "TEMP" (next A) into E

	PSRLD	%6, 27	; 0f 72 d4 1b	; A >> 32-5 (TEMP)
	POR	%3, MM7	; 0f eb cf	; B <<<30
	ADD	%1, BYTE 8; 83 c1 08	; next word T=T+1
			;-- cache line +1 if K1
			;-- cache line -2 if EBX

%endmacro

				;-- cache line +8
	; These were moved up to the bottom of each previous round. 
	; So this is needed for the first time through.
	MOVQ	MM5, MM0	; 0f 6f e8	; A
	MOVQ	MM6, MM0	; 0f 6f f0	; A
	PSRLD	MM5, 27		; 0f 72 d5 1b	; A >> 32-5

	MOV	ECX, index	; 89 c1
	MOV	bailout, index	; 89 c2
	ADD	bailout, 144	; 81 c2 90 00 00 00 ; faster than ADD DX, WORD 0x90 ; 66 81 c2 90 00
	MOV	EBX, K1		; bb f8 a5 04 08
				;-- cache line +2
	; this is the only loop that doesn't start on a cache line
	; when I do allign this to the start of the line (-2 bytes)
	; it runs slower. So this is the fast version.

	; Alternatively using [K1] and [EBX] is faster than useing only one.

rounds_0_17:
	op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1
	op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, K1
	op_F ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
	op_F ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
	op_F ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_F ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP	ECX, bailout		; 39 d1
JNE rounds_0_17			; 0f 85 82 fe ff ff

	; these two are 'about' 20-30 clocks depending on caches I guess
	op_F ECX, MM0, MM1, MM2, MM3, MM4, MM5, K1	; 18
	op_F ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX	; 19

; = B XOR C XOR D ===========================================================
%macro	op_G 8

	; See above.. for reference..
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5
			;-- cache line here for first round
	PSLLD	MM6, 5	; 0f 72 f6 05	; A << 5
	MOVQ	MM7, %3	; 0f 6f fd	;                         B
	POR	%7, MM6	; 0f eb de	; A = A<<<5 = (A<<5)|(A>>27)
	PXOR	MM7, %4	; 0f ef f8	; |                       B ^ C
	PADDD	%7, %6	; 0f fe da	; A<<<5 + E		  |   |
	PXOR	MM7, %5	; 0f ef f9	; |       |               B ^ C ^ D
	PADDD	%7, MM7	; 0f fe df	; A<<<5 + E 	       + (B ^ C ^ D)
	PADDD	%7, [%1]; 0f fe 19	; A<<<5 + E + W[t]     + (B ^ C ^ D)
	MOVQ	MM7, %3	; 0f 6f fd	; B	  |   |		  |
			;-- cache line (-4)	  |   |		  |
					; A<<<5 + E + W[t] + K + (B ^ C ^ D)
	PADDD	%7, [%8]; 0f fe 1d 00 a6 04 08 (0x804a600)
			; 0f fe 13 if EBX is used

	PSLLD	%3, 30	; 0f 72 f5 1e	; B << 30
	MOVQ	%6, %7	; 0f 6f d3	; "TEMP" (next A) into E

	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30
	MOVQ	MM6, %7	; 0f 6f f3	; "TEMP" (next A) into E

	POR	%3, MM7	; 0f eb ef	; B <<<30
	PSRLD	%6, 27	; 0f 72 d2 1b	; A >> 32-5 (TEMP)
	ADD ECX, BYTE 8	; 83 c1 08	; T=T+1
			;-- cache +27 (or -5) if K2  
			;-- cache +23 (or -9) if EBX
%endmacro

	MOV	EBX, K2		; bb 00 a6 04 08 ($0x804a600)
	MOV	bailout, ECX 	; 89 ca
	ADD	bailout, 144 	; 81 c2 90 00 00 00 ; (8*18)-[(2*8) for 18 and 19]
		; It's faster to jump over the NOPs
	JMP rounds_20_37	; e9 12 00 00 00
	NOP ;0
	NOP ;1
	NOP ;2
	NOP ;3
	NOP ;4
	NOP ;5
	NOP ;6
	NOP ;7
	NOP ;8
	NOP ;9
	NOP ;a
	NOP ;b
	NOP ;c
	NOP ;d
	NOP ;e
	NOP ;f
				;-- cache line

	; this is almost exactly 200 clock from here to...
rounds_20_37:
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2
	op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, EBX
	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
CMP	ECX, bailout	; 39 d1
JNE rounds_20_37	; 0f 85 a6 fe ff ff
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, K2 ; 38
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, K2 ; 39
	; ... here so about 1.6 clocks per opcode

; this is the slowest one, mostly due to register crunch
;= (B AND C) OR (B AND D) OR (C AND D) ======================================
%macro	op_H 8

	; for reference,
	;	MOVQ	%7, %2		; A
	;	MOVQ	MM6, %2		; A
	;	PSRLD	%7, 27		; A >> 32-5

	PSLLD	MM6, 5	; 0f 72 f6 05	; A << 5
	MOVQ	MM7, %3	; 0f 6f fb	;                                 B
	POR	%7, MM6	; 0f eb ce	; A = A<<<5 = (A<<5)|(A>>27)	  |
	PAND	MM7, %5	; 0f db fd	; |                               B & D
	MOVQ	MM6, %3	; 0f 6f f3	; |                       B	  |   |
	PAND	MM6, %4	; 0f db f4	; |                       B & C	  |   |
	PADDD	%7, %6	; 0f fe c8	; A<<<5 + E		  |   |   |   |
	POR	MM6, MM7; 0f eb f7	; |       |              (B & C)|(B & D)
	PADDD	%7, [%1]; 0f fe 09	; A<<<5 + E + W[t]	  |
	MOVQ	MM7, %4	; 0f 6f fc	; |       |   |   C	  |
	PADDD	%7, [%8]; 0f fe 0b (ebx); A<<<5 + E + W[t] + K	  |
			; 0f fe 0508a60408 (0x804a608)|	  |  |	  |
	PAND	MM7, %5	; 0f db fd	; |       |   |   C & D	  |
	POR	MM6, MM7; 0f eb f7	; |       |   |  (C & D)|(B & C)|(B & D)
	PADDD	%7, MM6	; 0f fe ce	; A<<<5 + E + W[t] + K +((B&C)|(B&D)|(C&D))

	MOVQ	MM7, %3	; 0f 6f fb	; B
	PSRLD	MM7, 2	; 0f 72 d7 02	; B >> 32-30
	MOVQ	%6, %7	; 0f 6f c1	; "TEMP" (next A) into E

	PSLLD	%3, 30	; 0f 72 f3 1e	; B << 30
	MOVQ	MM6, %7	; 0f 6f f1	; "TEMP" (next A) into E

	POR	%3, MM7	; 0f eb df	; B <<<30
	PSRLD	%6, 27	; 0f 72 d0 1b	; A >> 32-5 (TEMP)
	ADD ECX, BYTE 8	; 83 c1 08	; T=T+1

%endmacro

	MOV	EBX, K3		; bb 08 a6 04 08
	MOV	bailout, ECX	; 89 ca
	ADD	bailout, 144	; 81 c2 90 00 00 00
	NOP ;d
	NOP ;e
	NOP ;f

	; This is also about 250 clocks
rounds_40_57:
	op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3
	op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3
	op_H ECX, MM0, MM1, MM2, MM3, MM4, MM5, K3
	op_H ECX, MM5, MM0, MM1, MM2, MM3, MM4, EBX
	op_H ECX, MM4, MM5, MM0, MM1, MM2, MM3, K3
	op_H ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
CMP	ECX, bailout		; 39 d1
JNE rounds_40_57		; 0f 85 47 fe ff ff
	op_H ECX, MM2, MM3, MM4, MM5, MM0, MM1, K3	; 58
	op_H ECX, MM1, MM2, MM3, MM4, MM5, MM0, K3	; 59

	; No new macro definition here.. this is the same as rounds 20..39

	MOV	EBX, K4		; bb 10 a6 04 08 ($0x804a610)
	MOV	bailout, ECX	; 89 ca
	ADD	bailout, 144	; 81 c2 90 00 00 00
	NOP ;d
	NOP ;e
	NOP ;f

rounds_60_77:
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4
	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4
	op_G ECX, MM4, MM5, MM0, MM1, MM2, MM3, EBX
	op_G ECX, MM3, MM4, MM5, MM0, MM1, MM2, EBX
	op_G ECX, MM2, MM3, MM4, MM5, MM0, MM1, EBX
	op_G ECX, MM1, MM2, MM3, MM4, MM5, MM0, EBX
CMP	ECX, bailout		; 39 d1 
JNE rounds_60_77		; 0f 85 a2 fe ff ff
	op_G ECX, MM0, MM1, MM2, MM3, MM4, MM5, K4 ; 78

	; This last step used to be like this...
	; 	op_G ECX, MM5, MM0, MM1, MM2, MM3, MM4, K4 ; 79
	; 	PADDD	MMq, QWORD [z]  foreach qz in 4a, 5b, 0c, 1d, 2e
	;	MOVQ	QWORD [z], MMq  foreach qz in 4a, 5b, 0c, 1d, 2e
	; But I moved the final ADDs and MOVs up into the middle of the
	; last round because there were some free execution slots on my K6-II
	; that could be used... (It's faster this way basically)

				; cache line +21
	PADDD	MM5, QWORD [b]	; 0f fe 2d a8 a9 04 08	; add the original values back in
	PSLLD	MM6, 5		; 0f 72 f6 05		; A << 5
				;-- cache line
	MOVQ	MM7, MM0	; 0f 6f f8		;                         B
	POR	MM4, MM6	; 0f eb e6		; A = A<<<5 = (A<<5)|(A>>27)
	PXOR	MM7, MM1	; 0f ef f9		; |                       B ^ C
	PADDD	MM4, MM3	; 0f fe e3		; A<<<5 + E		  |   |
	MOVQ	QWORD [b], MM5	; 0f 7f 2d a8 a9 04 08	; move final B back out	  |   |
	PXOR	MM7, MM2	; 0f ef fa		; |       |               B ^ C ^ D
	PADDD	MM1, QWORD [d]	; 0f fe 0d b8 a9 04 08	; add the original values back in |
	PADDD	MM4, MM7	; 0f fe e7		; A<<<5 + E            + (B ^ C ^ D)
				;-- cache line		  |	  |		  |   |   |
	PADDD	MM2, QWORD [e]	; 0f fe 15 c8 a9 04 08	; add the original values back in
	PADDD	MM4, [ECX]	; 0f fe 21		; A<<<5 + E + W[t]     + (B ^ C ^ D)
	MOVQ	QWORD [d], MM1	; 0f 7f 0d b8 a9 04 08	; move final D back out	  |   |   |
	PADDD	MM4, [K4]	; 0f fe 25 10 a6 04 08	; A<<<5 + E + W[t] + K + (B ^ C ^ D)
							; slower: PADDD MM4, [EBX] ; 0f fe 23
	MOVQ	MM7, MM0	; 0f 6f f8		; B
	MOVQ	QWORD [e], MM2	; 0f 7f 15 c8 a9 04 08	; move final E back out
				;-- cache line +2
	PSLLD	MM0, 30		; 0f 72 f0 1e		; B << 30
	PADDD	MM4, QWORD [a]	; 0f fe 25 c0 a9 04 08	; add the original values back in
	PSRLD	MM7, 2		; 0f 72 d7 02		; B >> 32-30
	MOVQ	QWORD [a], MM4	; 0f 7f 25 c0 a9 04 08	; move final A back out

	POR	MM0, MM7	; 0f eb c7		; B <<<30
	PADDD	MM0, QWORD [c]	; 0f fe 05 b0 a9 04 08	; add the original values back in
				;-- cache line +2
	MOVQ	QWORD [c], MM0	; 0f 7f 05 b0 a9 04 08	; move final C back out
end:
	MOV	EAX, 0	; b8 00 00 00 00 ; return value, probably not needed
	POP	EDX	; 5a
	POP	ESI	; 5e
	POP	EBX	; 5b
	; using LEAVE ; c9 goes slightly slower than the following...
	MOV	ESP, EBP; 89 ec
	POP	EBP	; 5d
	RET		; c3 		; gcc cleans up stack

Threaded | Top-Level Comments Only

From:

soltice.livejournal.com

Oh God! Assembly! NooooOOOooooooo!
*bursts into flames*

weev.livejournal.com

A++++ excellence

hollow-hope.livejournal.com

A very interesting effort. Did you show this to me before? =o

Sadly this heavy optimization will probably lead to performance drops on K7/K8, and almost definitely on NetBurst.

foxgrrl

I actually ran it on.. I can't remember exactly, a K7 I think, and something else, either a K8 or a P4. Anyway, it ran at the original unoptimized speed. But no slower.

If I reorder the instructions for the K7, the same pattern should be good for the K8 too. Since, internally, the two chips are very similar. (Three ALU units, etc.)

A more recent SSE vectorized implementation is discussed here.

I wanted to do an SSE and Altivec implementation of SHA-1 (et al.) once I had access to some hardware. But by the time I got my K7 and G4 (7450), I had a full-time job, and no time.

The 'key-setup' stuff that SHA-1 does before the Feistel rounds adds alot of overhead too it (several hundred clock cycles), compared to MD5 which doesn't have that. That's why my earlier MD5 implementation was 1000 clock cycles faster. (Aside from 16 fewer rounds too.)

The other things that would make this alot faster:

If MMX had a ROR or ROL equivalent. Amazingly it doesn't. So you have to do the MOV,SHL,SHR,OR trick. This is faster than copying the two halves of the MMX register into one of the general registers and doing ROR/ROL there, because you have to store out to memory, and the reload. There's no way to move data between the MMX and the general x86 registers.
(MMX has alot of stupid things about it. AltiVec is so much better.)

If there was at least one more MMX register. The K8 has eight more in Long mode. But the K8 has alot of other stuff too. The register pressure wouldn't be so bad if there was an MMX Rotate, see #1.

According to this paper: http://www.cosic.esat.kuleuven.be/publications/article-306.pdf referenced in http://www.arctic.org/~dean/crypto/sha1.html If the K6 had seven ALU units and 128-bit wide registers, then each round can saturate all seven pipelines at once. But the K6 only has two ALU units. And only one of those does multiply, and the other does bit shifts. The layout I have is the best that I think you can do for that.

The discussion you refer to, seem to mostly be about optimizing that "key-setup" part at the beginning:

W[t ] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1) W[t+1] = ROL(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1) W[t+2] = ROL(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1) W[t+3] = ROL(W[t ] ^ W[t-5] ^ W[t-11] ^ W[t-13], 1)

Which I got down to about 440 clock cycles... for two independent data blocks. With 128-bit wide registers, and more ALU units, I could probably get it much faster.

Looking at, http://www.arctic.org/~dean/crypto/sha1.html, I think I see what they're doing. They're putting each "column" in one SSE register, and then just XORing everything that way. That is to say:
XMM0 would be: w[t], w[t-1], w[t-2],w[t-3]
XMM1 would be: w[t-5], w[t-6], w[t-7], w[t-8]
XMM2 would be: w[t-11], w[t-12], etc. etc.

Then just PXOR XMM0, XMM1; PXOR XMM0, XMM2; PXOR XMM0, XMM3;etc.

I'm not sure if that'd help me on MMX. Just these sixteen 32-bit values will completely fill all eight 64-bit MMX register. And I'm doing two parallel blocks at once. So I'd still have to do this routine twice in a row.

The...

W[t+3] = ROL( 0 ^ W[t-5] ^ W[t-11] ^ W[t-13], 1) W[t+3] = W[t+3] ^ ROL(W[t], 1)

...trick might be useful. I can't remember if I did anything like that, and I don't feel like looking at this moment.

The (X∧Y)∨(‾X∧Z) ≡ (Y⊕Z)∧X)⊕Z trick I'll have to remember though. I'm not sure if it would have helped at all here though, because I can do (X∧Y) in parallel with (‾X∧Z), but in the other form, I have to do each operation serially. There was another boolean identity that I learned a year ago, that I could apply here. And that's what I was referring to when I said that I though I could shave off another 50-100 clock cycles or so.

魔法少女プリティジューリャ

Mahō Shōjo Puriti Jurya

SHA-1 in 1300 clock cycles. (Warning: Long)

SHA-1 in 1300 clock cycles. (Warning: Long)

no subject

no subject

no subject

K7/K8 Performance

Re: K7/K8 Performance

SSE Implementation.

Profile

May 2023

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags