Performance computing can be an interesting task to undertake, as it lets us get under the hood of a complex abstraction (C++) and talk to the machine directly.
In the next two posts we’ll take a look at the Fibonacci algorithm used in a few other posts and see just how fast we can make it by looking at loop implementation, data type usage, and assembly code optimizations. In this first post we’ll take a look at some basic timing code, what gcc will do for us using the -O2 flag, then in the second post, see what we can do (if anything) to improve upon the generated code.
The Algorithm
The traditional use of the Fibonacci sequence is to show off the use of recursive function calls. The only problem with this is for every recursive call we dump the result of the previous call back on the functions call stack. For a few recursions this isn’t that bad, but in the case of a Fibonacci sequence it can lead to millions of calls and loads of memory–the most common end result then is too many calls will simply stall or crash the application.
The solution then is to use an iterative approach to solving the sequence.
Recall that the fib sequence is simply the sum of the previous two numbers in the sequence:
1+1=2, 1+2=3, 2+3=5, 3+5=8, 5+8=13, 8+13=21 [...]
For the sake of my implementation, we need to tell the loop how many times to run, and to shift the variables values down by 1 for each loop.
So on the first turn we set x and y to be 1 and 1 respectively. We then add them together, placing the value into a third variable.
We then swap the values of x and y down “1″, with x getting y’s old value and y getting z.
-
int iterations = 10;
-
int i;
-
long values[iterations];
-
long x,y,z;
-
-
x = 1;
-
y = 1;
-
-
// standard loop
-
for(i = 0; i < iterations; i++){
-
z = x + y;
-
x = y;
-
y = z;
-
values[i] = z;
-
}
So how can we make this run as fast as possible? The first step is to figure out how fast it currently is. With modern CPU’s being as fast as they are seconds and milliseconds simply won’t do. Thus, we need to look at the finest grain counter an x86 CPU offers: CPU clocks.
To count clocks we’ll implement a CPU clock counting benchmark routine valid on gcc and x86-64 hardware:
-
/*
-
* File: main.cpp
-
* Author: M. Grdinic
-
*
-
* Created on July 19, 2010, 3:30 PM
-
*/
-
-
#include <stdlib.h>
-
#include <iostream>
-
-
using namespace std;
-
-
typedef long long __int64;
-
-
inline __int64 rdtsc() {
-
__int64 hi, lo;
-
__asm__ __volatile__(
-
"xorl %%eax, %%eax;\n\t"
-
"push %%rbx;"
-
"cpuid\n\t"
-
::
-
:"%rax", "%rbx", "%rcx", "%rdx");
-
__asm__ __volatile__(
-
"rdtsc;"
-
: "=a" (lo), "=d" (hi)
-
::);
-
__asm__ __volatile__(
-
"xorl %%eax, %%eax; cpuid;"
-
"pop %%rbx;"
-
::
-
:"%rax", "%rbx", "%rcx", "%rdx");
-
-
return (__int64)hi << 32 | lo;
-
}
-
-
#define COUNT 20
-
-
void tester(__int64 overhead[], __int64 clocks[])
-
{
-
-
__int64 s, f, t1, t2;
-
int x_i; int x_it = 0;
-
-
// set locals
-
int iterations = 20;
-
int i;
-
long values[iterations];
-
long x,y,z;
-
-
x = 1;
-
y = 1;
-
-
// LOOP CONTROL CODE
-
for(x_it = 0; x_it < COUNT; x_it++){
-
-
for(x_i = 0; x_i < 3; x_i++){
-
t1 = rdtsc();
-
t2 = rdtsc();
-
overhead[x_it] = t2 - t1;
-
}
-
-
s = rdtsc();
-
-
// CODE START
-
-
// standard loop
-
asm("nop");
-
for(i = 0; i < iterations; i++){
-
z = x + y;
-
x = y;
-
y = z;
-
values[i] = z;
-
}
-
asm("nop");
-
-
// CODE END
-
-
f = rdtsc();
-
-
// reset locals values
-
x = 1;
-
y = 1;
-
-
clocks[x_it] = (f - s) - overhead[x_it];
-
-
}
-
-
for(i = 0; i < iterations; i++){
-
cout << values[i] << endl;
-
}
-
-
}
-
-
-
int main() {
-
-
__int64 overhead[COUNT], clocks[COUNT];
-
-
// run test
-
tester(overhead, clocks);
-
-
// print results
-
for(int i = 0; i < COUNT; i++){
-
cout << "Overhead:" << overhead[i] << "\tClocks: " << clocks[i] << endl;
-
}
-
-
return (EXIT_SUCCESS);
-
}
The basic idea is we define a number for how many tests to perform, then loop over the code this many times. On each iteration we call our inline rdtsc() routine which is written in assembly. It’s important to note that with modern processors we need to serialize the instruction stream (via CPUID), as well as issue clobbers for the effected registers from the CPUID call. We’ll also measure the overhead of calling CPUID and subtract this from the final result. In the end this routine allow us to get a good feel for how many CPU clocks our code consumed.
Finally, as a general point of interest, context switches present in all modern operating systems can and will effect the clock count of very large functions. Thus, in our case we make sure to only benchmark small sections of code.
With our counting routine ready for action the first thing I wanted to look at was the base performance. Running through the first 8 iterations shows:
Overhead:440 Clocks: 363
Overhead:440 Clocks: 286
Overhead:440 Clocks: 407
Overhead:440 Clocks: 308
Overhead:440 Clocks: 286
Overhead:440 Clocks: 286
Overhead:440 Clocks: 286
Overhead:440 Clocks: 286
In other words, after a cache warm up period we stabilize at 286 cycles.
We’ll then add the -O2 optimization flag to the compiler directives (in NetBeans, open your projects options page and select the C++ Compiler options > Additional Options) and see what we get:
Overhead:429 Clocks: 121
Overhead:429 Clocks: 99
Overhead:429 Clocks: 66
Overhead:429 Clocks: 55
Overhead:429 Clocks: 66
Overhead:429 Clocks: 55
Overhead:429 Clocks: 66
Overhead:429 Clocks: 66
To make sense of this we’ll check out the assembly for each:
First, the unoptimized version:
-
movl $0, -40(%rbp)
-
jmp .L7
-
.L8:
-
movq -88(%rbp), %rax
-
movq -80(%rbp), %rdx
-
leaq (%rdx,%rax), %rax
-
movq %rax, -96(%rbp)
-
movq -88(%rbp), %rax
-
movq %rax, -80(%rbp)
-
movq -96(%rbp), %rax
-
movq %rax, -88(%rbp)
-
movl -40(%rbp), %edx
-
movq -112(%rbp), %rax
-
movslq %edx,%rdx
-
movq -96(%rbp), %rcx
-
movq %rcx, (%rax,%rdx,8)
-
addl $1, -40(%rbp)
-
.L7:
-
movl -40(%rbp), %eax
-
cmpl -36(%rbp), %eax
-
setl %al
-
testb %al, %al
-
jne .L8
And the -O2 optimized version:
-
xorl %eax, %eax
-
movl $1, %edx
-
movl $1, %ebx
-
.p2align 4,,10
-
.p2align 3
-
.L6:
-
leaq (%rdx,%rbx), %rcx
-
movq %rdx, %rbx
-
movq %rcx, (%r12,%rax)
-
addq $8, %rax
-
movq %rcx, %rdx
-
cmpq $80, %rax
-
jne .L6
The first thing that jumps out is the loop counting–whereas in the unoptimized version we count using the 16-bit registers against main memory, the optimized version adds ‘logic’ to the process, using cmpq to check an immediate value from a register (%rax).
This explains some of the speedup, but not all. The get to bottom of why the optimized code is so much smaller (faster) we can take a huge hint from the number of raw instructions, of which the optimized code of course has far fewer.
Part of this is a side-effect of the 64-bit architecture we’re compiling too, as %rdx and %rbx are leaq’d to %rcx to produce code which stores loop values in registers at all times, as opposed to the unoptimized code which continually pushes values to the call stack. This eliminates almost all of the movq’s which certainly helps, but far more importantly, we no longer access main memory.
All that said, it still doesn’t explain everything in terms of raw performance.
The big hint comes from recompiling the application to use only 4 Fibonacci loops. It’s a bit hard to see, so I’ve added an intermediate asm(“nop”) in the loop so it becomes clear:
-
#NO_APP
-
movq $2, (%r12)
-
#APP
-
# 72 "main.cpp" 1
-
nop
-
# 0 "" 2
-
#NO_APP
-
movq $3, 8(%r12)
-
#APP
-
# 72 "main.cpp" 1
-
nop
-
# 0 "" 2
-
#NO_APP
-
movq $5, 16(%r12)
-
#APP
-
# 72 "main.cpp" 1
-
nop
-
# 0 "" 2
-
#NO_APP
-
movq $8, 24(%r12)
-
#APP
gcc has unrolled the loop for us. Of course we only have so many registers, so one more loop to make 5 means we lose the unroll, and we’re back to standard loop code.
This is related to another important rule -O2 brings to the mix, which is if we remove the call to save our values in values[n], -O2 literally removes our code from the final assembly file:
-
for(i = 0; i < iterations; i++){
-
z = x + y;
-
x = y;
-
y = z;
-
//values[i] = z;
-
}
Generates:
-
# 67 "main.cpp" 1
-
nop
-
# 0 "" 2
-
# 74 "main.cpp" 1
-
nop
If however, we add a: cout << y; call after our loop, we generate:
-
xorl %eax, %eax
-
.p2align 4,,10
-
.p2align 3
-
.L6:
-
leaq (%rsi,%rdi), %rdx
-
addl $1, %eax
-
movq %rsi, %rdi
-
cmpl $10, %eax
-
movq %rdx, %rsi
-
jne .L6
In other words, the secret to the speed-up is in -O2 mode gcc adds just enough code to get the correct result, but nothing more. We’ll see this time and again when using optimizations, gcc stripping away the fat and making intelligent decisions based on your actual code, as opposed to blindly creating assembly that fits the most general scenario.
That said, nothings stopping us from rewriting the code using hand-crafted assembly to match the -O2 version, though with -O2 being 3 characters away, one could ask why bother…though we will anyway in the next post.