Assembly Code
Optimization Techniques For The AMD64 Architecture
by David Phillips
10/24/2008
Problem Relevance/Background
The selection of efficient instruction sequences in the implementation of an algorithm can affect the observed performance of the algorithm's execution time. Therefore, modern compilers must consider the intricacies of the target architecture in order to maximize the speed potential of the code that is generated.
We chose to pursue an architectural problem from the hardware/software interface perspective. Specifically, we considered the view of the compiler writer who has been tasked to implement code generation for a specific hardware architecture.
There is rarely “one way” to author code for a single solution. However, some implementations may utilize the architectural features more advantageously than others and thus achieve greater performance. Naturally, we always desire the more “efficient” (fastest) solution.
At the fundamental level, the AMD64 Athlon and Opteron processors execute primitive micro-operations: integer arithmetic (add, subtract, etc), integer logic (branch, etc), floating point arithmetic (add, subtract, etc), memory load, and memory store. The processors implement the IA32 complex instruction set where each complex instruction is decoded by the processor into a sequence of macro-operations which are fixed-length instructions that contain a single arithmetic operation and a load/store operation. When the macro-operation is issued by the processor, the scheduler reduces it into a sequence of micro-operations where each micro-operation implements one of the primitive CPU functions.
Each complex instruction is classified as either DirectPath Single, DirectPath Double, or VectorPath. A DirectPath-Single is a complex instruction that is decoded by the processor into a single macro-operation. A DirectPath-Double is a complex instruction that is decoded by the processor into two macro-operations. A VectorPath is a complex instruction that is decoded into three or more macro operations.
The AMD instruction decoder is capable of decoding and dispatching up to three micro-operations per cycle. Therefore, the processor is capable of decoding and dispatching up to three DirectPath-Single IA32 instructions, or one and a half DirectPath-Double instructions per cycle. VectorPath instructions can take more than a single cycle to decode and dispatch which makes them much more expensive to execute as they cause subsequent DirectPath instructions to stall and wait for the VectorPath instructions to finish being decoded.
Problem Selection
While gathering information about the AMD64 architecture, we researched a subset of relevant optimization techniques and hypothesized code segments that should “in theory” yield greater performance than other competitive approaches. This paper investigates which code segments benefit the most from our optimization techniques.
Goals
The goal of this project is to identify the optimization techniques that yield the most significant performance gain to CPU-intensive applications while verifying our hypothesis that we postulated during the literature review phase of the project. We also attempt to identify the optimization techniques that yield insignificant performance gains; or even worse, degrade program performance.
Hypothesis
We identified the following classes of optimization techniques that we believed could be examined and analyzed within the allotted time frame for this project. The optimization techniques are classified into one of the following categories:
1.) Decode
2.) Memory Access
3.) Arithmetic
4.) Integer Scheduling
5.)
6.)
7.) Function Inline
Decode Optimization
Hypothesis
Pretense:
Decoding IA32 instructions is cumbersome and complex.
Using single complex instructions rather than multiple simpler instructions
reduces the number of instructions that must be decoded into micro-operations and
executed by the CPU.
Optimization Candidate Code Segment:
mov
ebx, DWORD PTR[eax]
add
edx, ebx
Optimized Code Segment:
add edx, DWORD PTR[eax]
Expected Results From Applying The Optimization:
1.) Reduced decoder usage (fewer clock cycles consumed).
2.) Reduced register-pressure (allocated registers in the register file).
3.) Reduced data dependent instructions in the pipeline (fewer pipeline stalls).
Memory Access Optimizations
Hypothesis
Pretense:
The L1 cache is implemented as 8 separate banks.
Each line in the bank is 8 bytes wide.
If two consecutive load instructions try to read from a different line in
the same bank then the second instruction has to wait for the previous instruction
to finish. If no bank conflict occurs,
then both loads can complete during the same cycle.
Assuming that each array value is 4 bytes in size, a bank conflict can not
occur since both loads read from the same cache line.
Optimization Candidate Code Segment:
mov
ebx, DWORD PTR[edi]
imul
ebx, ebx
mov
edx, DWORD PTR[edi + TYPE intarray]
imul
edx, edx
Optimized Code Segment:
mov
ebx, DWORD PTR[edi]
mov
edx, DWORD PTR[edi + TYPE intarray]
imul
ebx, ebx
imul
edx, edx
Expected Results From Applying The Optimization:
1.) Increased loading overlap (fewer clock cycles consumed).
2.) Increased retired instructions per clock cycle.
Pretense:
16 bit address loads are implemented as “vector path” (multiple micro-instructions. 32 bit address loads are implemented
as “direct path” (single micro-instruction).
Therefore, loading 32 bit addresses should be faster than loading 16 bit addresses.
Optimization Candidate Code Segment:
lea ax, value
Optimized Code Segment:
lea eax, value
Expected Results From Applying The Optimization:
1.) Fewer cycles consumed.
Arithmetic Optimization
Hypothesis
Pretense:
The AMD64 ALU can divide 16 bit integers faster than 32 bit integers.
Optimization Candidate Code Segment:
mov edx, 0
mov
eax, 65535
mov
ebx, 5
idiv ebx
Optimized Code Segment:
mov
dx, 0
mov ax, 65535
mov bx, 5
idiv
bx
Expected Results From Applying The Optimization:
1.) Fewer clock cycles consumed.
2.) Increased retired instructions per clock
cycle.
Integer Scheduling Optimizations
Hypothesis
Pretense:
Pushing data onto the stack directly from memory is faster than loading the value
into a register first and then pushing the register onto the stack.
Optimization Candidate Code Segment:
lea
ebx, [edi]
push
ebx
Optimized Code Segment:
push [edi]
Expected Results From Applying The Optimization:
1.) Reduced Register-pressure (allocated registers in the register file).
2.) Reduced Data dependent instructions in the pipeline (fewer pipeline stalls).
Two writes to different portions of the same register is slower than writing to
different registers.
Optimization Candidate Code Segment:
mov bl, 0012h
; Load the constant into lower word of EBX
mov bh, 0000h
; Load the constant into upper word of EBX
; Instruction has a false dependency on the ;
;completion of previous instruction since BL and
;BH share EBX.
Optimized Code Segment:
mov bl, 0012h
; Load the constant into lower word of EBX
mov ah, 0000h
; Load the constant into upper word of EAX
; No false dependency on the completion
; of previous instruction since BL and AH are ;different
;registers.
Expected Results From Applying The Optimization:
1.) Reduced Data dependent instructions in the pipeline (fewer pipeline stalls).
Loop Instruction Overhead Optimization
Hypothesis
Pretense:
The IA32 “LOOP”
instruction has an 8-cycle latency.
It is faster to use other instructions like decrement (DEC) and jump (JNZ).
Optimization Candidate Code Segment:
mov
ecx, LENGTHOF intarray
loop
L1
Optimized Code Segment:
mov
ecx, LENGTHOF intarray
dec ecx
jnz L1
Expected Results From Applying The Optimization:
1.) Reduction in loop overhead latency (fewer consumed clock cycles).
Pretense:
Unrolling the body of a loop reduces the total number of iterations that need to
be performed which eliminates a great deal of loop overhead and overall faster execution.
Optimization Candidate Code Segment:
lea
ebx, [edi]
; Get the next element from memory into register
push
ebx
; Push the next element
onto the stack
pop ebx
Optimized Code Segment:
lea
ebx, [edi]
; Get the next element from memory into register
push ebx
; Push
the next element onto the stack
pop
ebx
; Pop the element from the stack
lea
ebx, 2[edi]
; Get the next element from memory into register
push ebx
; Push
the next element onto the stack
pop
ebx
; Pop the element from the stack
lea
ebx, 4[edi] ; Get the next element from memory
into register
push ebx
; Push
the next element onto the stack
pop
ebx
; Pop the element from the stack
lea
ebx, 6[edi]
; Get the next element from memory into register
push ebx
; Push
the next element onto the stack
pop
ebx
; Pop the element from the stack
lea
ebx, 8[edi]
; Get the next element from memory into register
push ebx
; Push the next element
onto the stack
pop
ebx
; Pop the element from the stack
Function Inline Optimization
Hypothesis
Pretense:
The body of small functions can replace the function call site in order to reduce function call overhead.
Optimization Candidate Code Segment:
Call
DoDivision ; Perform the division
Optimized Code Segment:
mov edx,
0 ; Sign extend dividend.
mov
eax, 65535
; Load the dividend
mov
ebx, 5
; Load the divisor
idiv ebx
; Perform the division
Expected Results From Applying The Optimization:
1.) Reduction in function-call overhead latency (fewer consumed clock cycles).
Experiment Methodology
Using the Microsoft Macro Assembler (MASM), we implemented a series of 15 small sample programs where each program isolates a single optimization technique (or lack thereof).
After assembling all of the test programs, all were instrumented and profiled on a machine containing a single core AMD64 Athlon processor.
The following lists the test program names and their test relevance:
|
Program Name |
Optimization Class |
|
DIVIDE16.exe |
Arithmetic |
|
DIVIDE32.exe |
Arithmetic |
|
DIVIDE32FuncCall.exe |
Function Inline |
|
LEA16.exe |
Memory Access |
|
LEA32.exe |
Memory Access |
|
IssueNoOp.exe |
Integer Scheduling |
|
IssueWithOp.exe |
Integer Scheduling |
|
IssueWithNoLOOP.exe |
|
|
IssueWithLoopUnrolled |
Loop Unrolling |
|
LoadExecuteNoOp.exe |
Decode |
|
LoadExecuteWithOp.exe |
Decode |
|
MemAccessNoOp.exe |
Memory Access |
|
MemAccessWithOp.exe |
Memory Access |
|
PartialRWNoOp.exe |
Integer Scheduling |
|
PartialRWWithOp.exe |
Integer Scheduling |
We instrumented the runtime using AMD’s freely downloadable “Code Analyst” suite
to profile each of the test-program’s behavior and collect results such as clock
events, cache misses, dispatch stalls, and millisecond time to completion.
Our strategy was to compare the runtime data of the optimized implementations to the non-optimized implementations in order to determine which optimization techniques yielded the most significant performance gain.
We calculated the following metrics based upon the collected data:
Instructions Per Cycle which is calculated by dividing the number of retired instruction events by the number of CPU clock events. This metric tells us how many instructions the CPU was able to commit (finish) during each clock tick of the machine. A low IPC is typically an indicator of poor performance.
Stalls Per Instruction which is calculated by dividing the number of dispatch stall events by the number of retired instruction events. This metric gives us a rough average of how many stalls each IA32 instruction encountered. A high SPI is typically an indicator of poor performance.
Data Cache Misses Per Instruction which is calculated by dividing the number of data cache miss events by the number of retired instruction events. This metric gives us a rough average of how many data cache misses each IA32 instruction encountered. A high DCM is typically an indicator of poor performance.
The following page tabulates the runtime data that was collected as well as the
figures that were calculated to provide the additional metrics for evaluating the
impact of the optimizations on each of the code segments.
Results Analysis
Summary
The following table shows the relative performance gains of each optimization:
|
Optimized Program |
Non-Optimized Program |
Performance Gain |
|
IssueWithNoLoop.exe |
IssueWithNoOp.exe |
2x |
|
IssueWithLoopUnrolled.exe |
IssueWithNoOp.exe |
1.93x |
|
DIVIDE32.exe |
DIVIDE16.exe |
1.57x |
|
LoadExecuteWithOp.exe |
LoadExecuteNoOp.exe |
1.22x |
|
DIVIDE32FuncCall.exe |
DIVIDE32.exe |
1.043x |
|
MemAccessWithOp.exe |
MemAccessNoOp.exe |
0x |
|
PartialRWWithOp.exe |
PartialRWNoOp.exe |
0x |
|
IssueWithOp.exe |
IssueNoOp.exe |
-1.05x |
This table shows that the loop optimizations yielded the highest performance gains. Several of the optimizations yielded no performance improvement at all. One optimization actually yielded worse performance than the non-optimized program.
Experiment Results
Discussion
Decode Optimization Results
LoadExecuteNoOp vs. LoadExecuteWithOp
This experiment confirmed our expectations.
Data cache misses were nearly the same for both programs.
The non-optimized program required 273144 / 225221 = 1.21x more cycles than the
optimized program.
The non-optimized program required 2508666 / 2008493 = 1.25x more instructions than
the optimized program.
The optimized program introduced .095736654 / .064443493 = 1.49x more stalls than
the non-optimized program.
The non-optimized program took 130687 / 107365 = 1.22x longer to finish than the
optimized program.
Even though the optimized program had a higher stall rate, the overall reduction
in instructions and cycles created an overall net-performance gain.
Memory Access Optimization Results
MemAccessNoOp vs. MemAccessWithOp
This experiment did not confirm our expectations.
There was no significantly observable difference
between either of these programs.
Both executed roughly the same number of cycles
in the same period of time with the similar stall and cache miss occurrences.
We are guessing that the same micro-operations
get generated even for the optimized program.
LEA16 vs. LEA32
This experiment did not confirm our expectations.
There was no significantly observable difference
between either of these programs.
Both executed roughly the same number of cycles
in the same period of time with the similar stall and cache miss occurrences.
Arithmetic Optimization Results
DIVIDE32 vs. DIVIDE16
This experiment confirmed our expectations.
Data cache misses were nearly the same for
both programs.
Both programs executed roughly the same number
of instructions.
However, the DIVIDE32 took 44186 / 28232 =
1.57x as many cycles as DIVIDE16 to finish.
DIVIDE16 finished 21898 / 13979 = 1.57x faster
than DIVIDE32.
The instructions per cycle decreased by 0.177823038
/ 0.113662699 = 1.56x for the DIVIDE32.
The stalls per instruction ratio increased
by slightly in the DIVIDE32 program .039177269 / .036802582 = 1.0036x.
As expected, the 32bit division appeared to
run significantly slower than 16 bit division.
Integer Scheduling Optimization Results
IssueNoOp vs. IssueWithOp
This experiment did not confirm our expectations.
The optimized program required 271579 / 256887 = 1.06x more cycles than the non-optimized
program.
The non-optimized program required 2504269 / 2007048 = 1.25x more instructions.
The optimized program had 315596 / 941 = 3468.1x more cache misses than the non
optimized program.
The optimized program achieved 0.974852367 / 0.739029159 = 1.32x fewer instructions
per cycle than the non-optimized program.
It took the optimized program 131240 / 124485 = 1.05x longer to finish than the
non-optimized program.
While the optimization did reduce the code density, the cache miss rate increased greatly which diminished any performance returns to the point that the performance actually became worse.
PartialRWNoOp vs. PartialRWWithOp
This experiment did not confirm our expectations.
Both of these programs had very close to the same performance profile.
We expected to see the number of stalls reduce in the optimized program since the
false dependencies were eliminated.
However both had about the same number of measured stalls and both finished execution in about the same amount of time.
IssueNoOp vs. IssueWithNoLOOP
This experiment confirmed our expectations.
Data cache misses were nearly the same for both programs.
Replacing the
Number of cycles reduced by 256887 / 134073 = 1.91x
Instruction count increased by 3005102 / 2504269 = 1.20x
Stalls increased by .041105427 / .015574844 = 2.64x
Total runtime decreased by 124485 / 62244 = 2.0x
While the overall instruction count and stall count increased, the total number
of cycles needed was reduced by almost half which allowed a great performance gain.
IssueNoOp vs. IssueWithLoopUnrolled
This experiment confirmed our expectations.
Unrolling the loop body 5 times had the following
effects:
Number of cycles reduced by 256887 / 137075 = 1.87x
Number of instructions reduced by 2504269 / 1707411
= 1.47x
Cache miss rate was slightly less 941 / 696 = 1.35x
Instructions per cycle increased 1.245603502 / .974852367
= 1.28x
Stalls increased by .182267773 / .015574844 = 11x
Total runtime decreased by 124485 / 64344 = 1.93x
Even though more stalls were introduced by
the optimization, the total number of required cycles decreased significantly. Much of the loop overhead was removed
which yielded an overall net performance increase.
Function Inline Optimization Results
DIVIDE32 vs. DIVIDE32FuncCall
This experiment confirmed our expectations.
Data cache misses were nearly the same for both programs.
The DIVIDE32FuncCall required 69730 / 50223 = 1.39x more instructions than the in-lined
DIVIDE32.
The instructions per cycle increased in the DIVIDE32FuncCall, but this is most likely
a false positive as the function call overhead introduced more instructions into
the pipeline.
The inline DIVIDE32 program finished 22858 / 21898 = 1.043x faster than the function
call implementation.
The number of stalls increased by .043843396 / .039177269 = 1.12x in the function
call implementation.
The function call implementation required 45925 / 44186 = 1.04x as many clock cycles
as the inline version.
As expected, the added overhead of the function call made a noticeable impact on performance. Also note that we did not pass any parameters to the function. Had parameters been passed, it could be expected that the overhead would increase.
Conclusion
The analysis of each test program yielded varying performance-gain results. All of the test programs were loop intensive in order to profile the runtime performance with the CPU at maximum usage and to collect a sampling that ranged over a “long” time-span. The test programs were also intentionally kept small enough to isolate the single optimization block or optimizable block so that the relative performance gains could be more accurately compared without having to consider irrelevant computations.
Due to the programs being loop biased, it follows that the loop optimization techniques appear to yield the highest performance gain. However, the loop intensiveness of the programs still provides a similar runtime behavior to most computation-intensive applications which are the most likely candidates to benefit from such optimizations.
If we had more time to continue this experiment, we would like to create two monolithic programs where one runs all of the optimized code blocks and the other runs all of the non-optimized blocks in order to simulate a real intense computation. Due to conflicting data types and jump labels between each test program, merging all of them together was going to take more time than was available.
We would also like to look deeper into the cause of the spike in cache misses that caused the performance loss on the Integer Scheduling optimization. We were unable to successfully find any literature in the AMD documentation that explained the result that we were encountering. Determining the cause will require more research and in-depth analysis.
Another area for more research would include additional tests for other optimization techniques. Due to our limiting time factor, we had to reduce our set of optimizations to a number that met our time budgeting constraints. The optimizations considered in this paper are a subset of possibilities.
References
Software Optimization Guide For AMD64 Processors.
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
AMD64 Architecture Programmer's Manual Volume 1:
Application Programming Rev 3.14
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.PDF
AMD64 Architecture Programmer's Manual Volume 2:
System Programming Rev 3.14
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.PDF
AMD64 Architecture Programmer's Manual Volume 3:
General-Purpose and system Instructions Rev 3.14
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.PDF
Athlon 64 vs. Opteron Processors : What's the difference
again?
http://developer.amd.com/articlex.jsp?id=49
An Introduction to Analysis and Optimization with
AMD CodeAnalyst