Crack The Code

Assembly Code Optimization Techniques For The AMD64 Architecture

 

by David Phillips 10/24/2008 

 

 

 

 

Problem Relevance/Background

 

The selection of efficient instruction sequences in the implementation of an algorithm can affect the observed performance of the algorithm's execution time. Therefore, modern compilers must consider the intricacies of the target architecture in order to maximize the speed potential of the code that is generated.

 

We chose to pursue an architectural problem from the hardware/software interface perspective.  Specifically, we considered the view of the compiler writer who has been tasked to implement code generation for a specific hardware architecture. 

 

There is rarely “one way” to author code for a single solution.  However, some implementations may utilize the architectural features more advantageously than others and thus achieve greater performance.  Naturally, we always desire the more “efficient” (fastest) solution.

 

At the fundamental level, the AMD64 Athlon and Opteron processors execute primitive micro-operations: integer arithmetic (add, subtract, etc), integer logic (branch, etc), floating point arithmetic (add, subtract, etc), memory load, and memory store. The processors implement the IA32 complex instruction set where each complex instruction is decoded by the processor into a sequence of macro-operations which are fixed-length instructions that contain a single arithmetic operation and a load/store operation. When the macro-operation is issued by the processor, the scheduler reduces it into a sequence of micro-operations where each micro-operation implements one of the primitive CPU functions.

 

Each complex instruction is classified as either DirectPath Single, DirectPath Double, or VectorPath. A DirectPath-Single is a complex instruction that is decoded by the processor into a single macro-operation. A DirectPath-Double is a complex instruction that is decoded by the processor into two macro-operations. A VectorPath is a complex instruction that is decoded into three or more macro operations.

 

The AMD instruction decoder is capable of decoding and dispatching up to three micro-operations per cycle. Therefore, the processor is capable of decoding and dispatching up to three DirectPath-Single IA32 instructions, or one and a half DirectPath-Double instructions per cycle. VectorPath instructions can take more than a single cycle to decode and dispatch which makes them much more expensive to execute as they cause subsequent DirectPath instructions to stall and wait for the VectorPath instructions to finish being decoded.

 

Problem Selection

 

While gathering information about the AMD64 architecture, we researched a subset of relevant optimization techniques and hypothesized code segments that should “in theory” yield greater performance than other competitive approaches.  This paper investigates which code segments benefit the most from our optimization techniques.

 

 

Goals

            The goal of this project is to identify the optimization techniques that yield the most significant performance gain to CPU-intensive applications while verifying our hypothesis that we postulated during the literature review phase of the project.  We also attempt to identify the optimization techniques that yield insignificant performance gains; or even worse, degrade program performance.

 

Hypothesis

We identified the following classes of optimization techniques that we believed could be examined and analyzed within the allotted time frame for this project.  The optimization techniques are classified into one of the following categories:

 

1.) Decode

2.) Memory Access

3.) Arithmetic

4.) Integer Scheduling

5.) Loop Instruction Overhead

6.) Loop Unrolling

7.) Function Inline

 

Decode Optimization Hypothesis

          Pretense:

Decoding IA32 instructions is cumbersome and complex.  Using single complex instructions rather than multiple simpler instructions reduces the number of instructions that must be decoded into micro-operations and executed by the CPU.

 

Optimization Candidate Code Segment:

mov ebx, DWORD PTR[eax]

add edx, ebx

 

Optimized Code Segment:

add edx, DWORD PTR[eax]

 

Expected Results From Applying The Optimization:

1.) Reduced decoder usage (fewer clock cycles consumed).

2.) Reduced register-pressure (allocated registers in the register file).

3.) Reduced data dependent instructions in the pipeline (fewer pipeline stalls).

 

Memory Access Optimizations Hypothesis

Pretense:

The L1 cache is implemented as 8 separate banks.  Each line in the bank is 8 bytes wide.  If two consecutive load instructions try to read from a different line in the same bank then the second instruction has to wait for the previous instruction to finish.  If no bank conflict occurs, then both loads can complete during the same cycle.  Assuming that each array value is 4 bytes in size, a bank conflict can not occur since both loads read from the same cache line.

 

Optimization Candidate Code Segment:

                        mov  ebx, DWORD PTR[edi]     

            imul ebx, ebx

            mov  edx, DWORD PTR[edi + TYPE intarray]

            imul edx, edx

 

Optimized Code Segment:

                        mov  ebx, DWORD PTR[edi]     

            mov  edx, DWORD PTR[edi + TYPE intarray]

            imul ebx, ebx    

            imul edx, edx            

 

Expected Results From Applying The Optimization:

1.) Increased loading overlap (fewer clock cycles consumed).

                        2.) Increased retired instructions per clock cycle.

 

Pretense:

16 bit address loads are implemented as “vector path” (multiple micro-instructions.  32 bit address loads are implemented as “direct path” (single micro-instruction).  Therefore, loading 32 bit addresses should be faster than loading 16 bit addresses.

 

Optimization Candidate Code Segment:

                        lea ax, value    

 

Optimized Code Segment:

                        lea eax, value   

 

Expected Results From Applying The Optimization:

1.) Fewer cycles consumed.

 

 

Arithmetic Optimization Hypothesis

Pretense:

The AMD64 ALU can divide 16 bit integers faster than 32 bit integers.

 

            Optimization Candidate Code Segment:

                        mov  edx, 0

            mov  eax, 65535

            mov  ebx, 5

            idiv ebx

 

Optimized Code Segment:

                        mov  dx, 0       

            mov  ax, 65535

            mov  bx, 5

            idiv bx

 

Expected Results From Applying The Optimization:

                        1.) Fewer clock cycles consumed.

                        2.) Increased retired instructions per clock cycle.

 

Integer Scheduling Optimizations Hypothesis

Pretense:

Pushing data onto the stack directly from memory is faster than loading the value into a register first and then pushing the register onto the stack.

 

Optimization Candidate Code Segment:

                        lea  ebx, [edi]                    

            push ebx

 

Optimized Code Segment:

                        push [edi]

 

Expected Results From Applying The Optimization:

1.) Reduced Register-pressure (allocated registers in the register file).

2.) Reduced Data dependent instructions in the pipeline (fewer pipeline stalls).

 

Pretense:

Two writes to different portions of the same register is slower than writing to different registers.

           

Optimization Candidate Code Segment:

                        mov bl, 0012h             ; Load the constant into lower word of EBX

                                mov bh, 0000h             ; Load the constant into upper word of EBX

; Instruction has a false dependency on the ; 

;completion of previous instruction since BL and

;BH share EBX.

 

Optimized Code Segment:

                        mov bl, 0012h             ; Load the constant into lower word of EBX

                                mov ah, 0000h            ; Load the constant into upper word of EAX

                                                                                ; No false dependency on the completion

; of previous instruction since BL and AH are ;different ;registers.

 

Expected Results From Applying The Optimization:

1.)    Reduced Data dependent instructions in the pipeline (fewer pipeline stalls).

 

Loop Instruction Overhead Optimization Hypothesis

Pretense:

The IA32 “LOOP” instruction has an 8-cycle latency.  It is faster to use other instructions like decrement (DEC) and jump (JNZ).

 

Optimization Candidate Code Segment:

                        mov  ecx, LENGTHOF intarray              

            loop L1

 

Optimized Code Segment:

                        mov  ecx, LENGTHOF intarray

            dec  ecx

            jnz  L1

 

Expected Results From Applying The Optimization:

                        1.) Reduction in loop overhead latency (fewer consumed clock cycles).

 

 

Loop Unrolling Optimization Hypothesis

Pretense:

Unrolling the body of a loop reduces the total number of iterations that need to be performed which eliminates a great deal of loop overhead and overall faster execution.

 

Optimization Candidate Code Segment:

                        lea   ebx, [edi]     ; Get the next element from memory into register

                                push ebx                     ; Push the next element onto the stack

                                pop  ebx

 

Optimized Code Segment:

                        lea  ebx, [edi]        ; Get the next element from memory into register

                                push ebx                           ; Push the next element onto the stack

                                pop  ebx                           ; Pop the element from the stack

               

                                lea  ebx, 2[edi]     ; Get the next element from memory into register

                                push ebx                           ; Push the next element onto the stack

                                pop  ebx                           ; Pop the element from the stack

               

                                lea  ebx, 4[edi]     ; Get the next element from memory into register

                                push ebx                           ; Push the next element onto the stack

                                pop  ebx                           ; Pop the element from the stack

               

                                lea  ebx, 6[edi]     ; Get the next element from memory into register

                                push ebx                           ; Push the next element onto the stack

                                pop  ebx                           ; Pop the element from the stack

               

                                lea  ebx, 8[edi]       ; Get the next element from memory into register

                                push ebx                    ; Push the next element onto the stack

                                pop  ebx                    ; Pop the element from the stack     

 

Function Inline Optimization Hypothesis

Pretense:

The body of small functions can replace the function call site in order to reduce function call overhead.

 

Optimization Candidate Code Segment:

                        Call DoDivision   ; Perform the division

 

Optimized Code Segment:

                        mov  edx,  0      ; Sign extend dividend.

            mov  eax,  65535  ; Load the dividend

            mov  ebx,  5      ; Load the divisor

            idiv ebx          ; Perform the division

 

Expected Results From Applying The Optimization:

1.) Reduction in function-call overhead latency (fewer consumed clock cycles).

 

Experiment Methodology

 

Using the Microsoft Macro Assembler (MASM), we implemented a series of 15 small sample programs where each program isolates a single optimization technique (or lack thereof).

 

After assembling all of the test programs, all were instrumented and profiled on a machine containing a single core AMD64 Athlon processor.

 

The following lists the test program names and their test relevance:

Program Name

Optimization Class

DIVIDE16.exe

Arithmetic

DIVIDE32.exe

Arithmetic

DIVIDE32FuncCall.exe

Function Inline

LEA16.exe

Memory Access

LEA32.exe

Memory Access

IssueNoOp.exe

Integer Scheduling

IssueWithOp.exe

Integer Scheduling

IssueWithNoLOOP.exe

Loop Instruction Overhead

IssueWithLoopUnrolled

Loop Unrolling

LoadExecuteNoOp.exe

Decode

LoadExecuteWithOp.exe

Decode

MemAccessNoOp.exe

Memory Access

MemAccessWithOp.exe

Memory Access

PartialRWNoOp.exe

Integer Scheduling

PartialRWWithOp.exe

Integer Scheduling

 

 

We instrumented the runtime using AMD’s freely downloadable “Code Analyst” suite to profile each of the test-program’s behavior and collect results such as clock events, cache misses, dispatch stalls, and millisecond time to completion.

 

Our strategy was to compare the runtime data of the optimized implementations to the non-optimized implementations in order to determine which optimization techniques yielded the most significant performance gain.

 

We calculated the following metrics based upon the collected data:

 

Instructions Per Cycle which is calculated by dividing the number of retired instruction events by the number of CPU clock events.  This metric tells us how many instructions the CPU was able to commit (finish) during each clock tick of the machine.  A low IPC is typically an indicator of poor performance.

 

Stalls Per Instruction which is calculated by dividing the number of dispatch stall events by the number of retired instruction events.  This metric gives us a rough average of how many stalls each IA32 instruction encountered.  A high SPI is typically an indicator of poor performance.

 

Data Cache Misses Per Instruction which is calculated by dividing the number of data cache miss events by the number of retired instruction events.  This metric gives us a rough average of how many data cache misses each IA32 instruction encountered.  A high DCM is typically an indicator of poor performance.

 

 

The following page tabulates the runtime data that was collected as well as the figures that were calculated to provide the additional metrics for evaluating the impact of the optimizations on each of the code segments.

 

 

Results Analysis Summary

 

The following table shows the relative performance gains of each optimization:

Optimized Program

Non-Optimized Program

Performance Gain

IssueWithNoLoop.exe

IssueWithNoOp.exe

2x

IssueWithLoopUnrolled.exe

IssueWithNoOp.exe

1.93x

DIVIDE32.exe

DIVIDE16.exe

1.57x

LoadExecuteWithOp.exe

LoadExecuteNoOp.exe

1.22x

DIVIDE32FuncCall.exe

DIVIDE32.exe

1.043x

MemAccessWithOp.exe

MemAccessNoOp.exe

0x

PartialRWWithOp.exe

PartialRWNoOp.exe

0x

IssueWithOp.exe

IssueNoOp.exe

-1.05x

 

            This table shows that the loop optimizations yielded the highest performance gains. Several of the optimizations yielded no performance improvement at all.  One optimization actually yielded worse performance than the non-optimized program. 

 

Experiment Results Discussion

 

Decode Optimization Results

 

LoadExecuteNoOp vs. LoadExecuteWithOp

 

This experiment confirmed our expectations.

 

Data cache misses were nearly the same for both programs.

 

The non-optimized program required 273144 / 225221 = 1.21x more cycles than the optimized program.

 

The non-optimized program required 2508666 / 2008493 = 1.25x more instructions than the optimized program.

 

The optimized program introduced .095736654 / .064443493 = 1.49x more stalls than the non-optimized program.

 

The non-optimized program took 130687 / 107365 = 1.22x longer to finish than the optimized program.

 

Even though the optimized program had a higher stall rate, the overall reduction in instructions and cycles created an overall net-performance gain.

 

 

 

 

Memory Access Optimization Results

 

            MemAccessNoOp vs. MemAccessWithOp

 

This experiment did not confirm our expectations.

 

There was no significantly observable difference between either of these programs. 

 

Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences. 

 

We are guessing that the same micro-operations get generated even for the optimized program.

 

LEA16 vs. LEA32

 

This experiment did not confirm our expectations.

 

There was no significantly observable difference between either of these programs. 

 

Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences.

 

Arithmetic Optimization Results

 

DIVIDE32 vs. DIVIDE16

 

This experiment confirmed our expectations.

 

Data cache misses were nearly the same for both programs.

 

Both programs executed roughly the same number of instructions. 

 

However, the DIVIDE32 took 44186 / 28232 = 1.57x as many cycles as DIVIDE16 to finish.

 

DIVIDE16 finished 21898 / 13979 = 1.57x faster than DIVIDE32.

 

The instructions per cycle decreased by 0.177823038 / 0.113662699 = 1.56x for the DIVIDE32.

 

The stalls per instruction ratio increased by slightly in the DIVIDE32 program .039177269 / .036802582 = 1.0036x.

 

As expected, the 32bit division appeared to run significantly slower than 16 bit division.

 

Integer Scheduling Optimization Results

 

IssueNoOp vs. IssueWithOp

 

This experiment did not confirm our expectations.

 

The optimized program required 271579 / 256887 = 1.06x more cycles than the non-optimized program.

 

The non-optimized program required 2504269 / 2007048 = 1.25x more instructions.

 

The optimized program had 315596 / 941 = 3468.1x more cache misses than the non optimized program.

 

The optimized program achieved 0.974852367 / 0.739029159 = 1.32x fewer instructions per cycle than the non-optimized program.

 

It took the optimized program 131240 / 124485 = 1.05x longer to finish than the non-optimized program.

 

While the optimization did reduce the code density, the cache miss rate increased greatly which diminished any performance returns to the point that the performance actually became worse.

 

PartialRWNoOp vs. PartialRWWithOp

 

This experiment did not confirm our expectations.

 

Both of these programs had very close to the same performance profile.

 

We expected to see the number of stalls reduce in the optimized program since the false dependencies were eliminated.

 

However both had about the same number of measured stalls and both finished execution in about the same amount of time.

 

 

Loop Instruction Optimization Results

 

IssueNoOp vs. IssueWithNoLOOP

 

This experiment confirmed our expectations.

 

Data cache misses were nearly the same for both programs.

 

Replacing the LOOP instruction with dec/jump had the following effects:

 

            Number of cycles reduced by 256887 / 134073 = 1.91x

 

            Instruction count increased by 3005102 / 2504269 = 1.20x

 

            Stalls increased by .041105427 / .015574844 = 2.64x

 

            Total runtime decreased by 124485 / 62244 = 2.0x

           

While the overall instruction count and stall count increased, the total number of cycles needed was reduced by almost half which allowed a great performance gain.

 

Loop Unrolling Optimization Results

 

IssueNoOp vs. IssueWithLoopUnrolled

 

This experiment confirmed our expectations.

 

Unrolling the loop body 5 times had the following effects:

 

                        Number of cycles reduced by 256887 / 137075 = 1.87x

 

                        Number of instructions reduced by 2504269 / 1707411 = 1.47x

 

                        Cache miss rate was slightly less 941 / 696 = 1.35x

 

                        Instructions per cycle increased 1.245603502 / .974852367 = 1.28x

 

                        Stalls increased by .182267773 / .015574844 = 11x

 

                        Total runtime decreased by 124485 / 64344 = 1.93x

 

Even though more stalls were introduced by the optimization, the total number of required cycles decreased significantly.  Much of the loop overhead was removed which yielded an overall net performance increase.

 

Function Inline Optimization Results

 

DIVIDE32 vs. DIVIDE32FuncCall

 

This experiment confirmed our expectations.

 

Data cache misses were nearly the same for both programs.

 

The DIVIDE32FuncCall required 69730 / 50223 = 1.39x more instructions than the in-lined DIVIDE32.

 

The instructions per cycle increased in the DIVIDE32FuncCall, but this is most likely a false positive as the function call overhead introduced more instructions into the pipeline.

 

The inline DIVIDE32 program finished 22858 / 21898 = 1.043x faster than the function call implementation.

 

The number of stalls increased by .043843396 / .039177269 = 1.12x in the function call implementation.

 

The function call implementation required 45925 / 44186 = 1.04x as many clock cycles as the inline version.

 

As expected, the added overhead of the function call made a noticeable impact on performance.  Also note that we did not pass any parameters to the function.  Had parameters been passed, it could be expected that the overhead would increase.

 

Conclusion

 

          The analysis of each test program yielded varying performance-gain results.  All of the test programs were loop intensive in order to profile the runtime performance with the CPU at maximum usage and to collect a sampling that ranged over a “long” time-span.  The test programs were also intentionally kept small enough to isolate the single optimization block or optimizable block so that the relative performance gains could be more accurately compared without having to consider irrelevant computations.

 

Due to the programs being loop biased, it follows that the loop optimization techniques appear to yield the highest performance gain.  However, the loop intensiveness of the programs still provides a similar runtime behavior to most computation-intensive applications which are the most likely candidates to benefit from such optimizations.

 

            If we had more time to continue this experiment, we would like to create two monolithic programs where one runs all of the optimized code blocks and the other runs all of the non-optimized blocks in order to simulate a real intense computation.  Due to conflicting data types and jump labels between each test program, merging all of them together was going to take more time than was available.

 

            We would also like to look deeper into the cause of the spike in cache misses that caused the performance loss on the Integer Scheduling optimization.  We were unable to successfully find any literature in the AMD documentation that explained the result that we were encountering.  Determining the cause will require more research and in-depth analysis.

 

            Another area for more research would include additional tests for other optimization techniques.  Due to our limiting time factor, we had to reduce our set of optimizations to a number that met our time budgeting constraints.  The optimizations considered in this paper are a subset of possibilities.

 

 

References

 

Software Optimization Guide For AMD64 Processors.

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

 

AMD64 Architecture Programmer's Manual Volume 1: Application Programming Rev 3.14

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.PDF

 

AMD64 Architecture Programmer's Manual Volume 2: System Programming Rev 3.14

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.PDF

 

AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and system Instructions Rev 3.14

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.PDF

 

Athlon 64 vs. Opteron Processors : What's the difference again?

http://developer.amd.com/articlex.jsp?id=49

 

An Introduction to Analysis and Optimization with AMD CodeAnalyst

http://developer.amd.com/article_print.jsp?id=2