We can talk directly to our machines in C++ by inserting assembly code via the asm() or extended assembly __asm__ commands.
Doing so allows us to exploit potential performance gains, learn more about our machines, and is in all cases, a fascinating exercise in the lowest level of computer programming.
In this post I want to share a few tips and tricks I’ve learned when using AT&T style syntax on Linux/gcc, and also explain how basic assembly coding works. We’ll start by creating a simple function and checking out the gcc created assembly code. We’ll then take this apart and step through the code to understand how it works.
To start then, the first thing to do is look at what a basic function call looks like to gcc in assembly. The function we’ll use is as simple as it gets:
-
void add(int a, int b)
-
{
-
int z;
-
z = a + b;
-
}
-
-
int main() {
-
-
add(1,2);
-
return 0;
-
}
In Netbeans or your favorite IDE, set a breakpoint @ add(1,2). Now run the code in debugging mode and open the : Window > Debugging > Disassembly mode.
We should see something similar too:
-
! add(1,2);
-
main()
-
main+33: mov $0×2,%esi
-
main+38: mov $0×1,%edi
-
main+43: callq 0x40076f <_Z3addii>
+33 and +38 push our parameters to the esi and edi registers. +43 calls the callq function which will jump us to the function (and also preps the stack). When we single step a few places, the disassembly window will change to that functions code and show:
-
!{
-
_Z3addii()
-
_Z3addii+0: push %rbp
-
_Z3addii+1: mov %rsp,%rbp
-
_Z3addii+4: mov %edi,-0×14(%rbp)
-
_Z3addii+7: mov %esi,-0×18(%rbp)
-
! int z;
-
! z = a + b;
-
_Z3addii+10: mov -0×18(%rbp),%eax
-
_Z3addii+13: mov -0×14(%rbp),%edx
-
_Z3addii+16: lea (%rdx,%rax,1),%eax
-
_Z3addii+19: mov %eax,-0×4(%rbp)
-
!}
The code reads as follows: we set up the base pointer and stack in +0 and +1. In +4 and +7 we take the two values placed into the esi and edi registers in the main() function and push them onto the stack relative to the base pointer (rbp). As these are ints, they are 4 bytes ‘apart’.
This is where things get a bit interesting. As we are dealing with x86-64 we have to remember that rax and eax are related, in that eax holds the lower 32 bits of rax’s full value (in x86-64 it’s helpful to note that any register with a Rxx is 64-bit, Exx denotes 32-bit). Thus, you’ll notice that in +16 we’re using two 64-bit registers to perform the addition via lea using rdx (the so-called ‘third’ argument register), and rax.
How did values get into rax and rdx? This is a valid question, as in +10 and +13 we spend cycles pushing our variables into eax and edx (edx is a general purpose register), Why do this when we do not use them anyway?
To understand this, it helps to know that a) this is non-optimized code, and b) there is a very specific method to how parameters should be passed, and reading this helps explain why things work the way they do! (see p. 17). We go into greater detail on this topic in the next post.
At this point we’re ready to perform our addition. However, we do so using lightly optimized assembly code via the lea instruction (Load Effective Address). The assembly we’ll write in the next post will be different, but it’s helpful to understand how this lea call works.
Just as in C++ we can use the address of operator to grab the address of a variable, we use the lea instruction to do the same in assembly.
This bit of code demonstrates:
-
-
#include <stdlib.h>
-
#include <iostream>
-
-
using namespace std;
-
-
int main() {
-
-
int m = 5;
-
int *g;
-
-
__asm__ __volatile__(
-
"lea %1, %0; \n"
-
:"=p" (g)
-
:"m" (m)
-
:"memory"
-
);
-
-
cout << "Address of m: \t" << &m << endl;
-
cout << "Result of lea: \t" << g << endl;
-
-
-
return 0;
-
}
The output from this program is:
-
Address of m: 0x7fff5c9f5344
-
Result of lea: 0x7fff5c9f5344
-
Press [Enter] to close the terminal …
In other words, lea loads the address of m and stores it in g. A quick check of &m shows the values match, as we would expect.
So how does this relate to adding? The purpose of lea is to load memory addresses, and as part of that duty it can calculate offsets. Thus, we exploit this adding behavior by, instead of passing memory values, passing raw values (though still via registers). It will then sum these registers and place the result into a destination register. As this completes in one instruction, we effectively get the addition and move in one cycle–a savings over the mov + add instruction combination. In short, it’s a simple and elegant optimization gcc performs for us.
The following code shows this technique deliberately used:
-
int main() {
-
-
int m = 5;
-
int mx = 12;
-
int g;
-
-
__asm__ __volatile__(
-
"leal (%2, %1, 1), %0; \n"
-
:"=r" (g)
-
:"r" (m), "r" (mx)
-
:"memory"
-
);
-
-
cout << "Addition=" << g << endl;
-
-
-
return 0;
-
}
We could have also exchanged the second parameter “r” with a direct value (2), instead of the mx constraint, which would add 5+2, for example.
Further, as you may suspect, the C++ address of operator (&) makes direct use of lea. Consider:
-
int test = 10;
-
int *p_test = &test;
Which in assembly appears as:
-
! int test = 10;
-
main+105: movl $0xa,-0×28(%rbp)
-
! int *p_test = &test;
-
main+112: lea -0×28(%rbp),%rax
-
main+116: mov %rax,-0×68(%rbp)
Here we see lea in a more ‘natural’ state, loading the effective address of our variable and stashing it into %rax. It’s important to note that in +112, the -0×28 is an offset, and the parenthesis indicate a memory location.
** It should be said at this point that while we instinctively know that C++ and raw assembly are intertwined, it’s fun and instructive to see how concepts we’ve learned in C++ manifest themselves in assembly. **
With that done we assign the value of the addition to our variable in +19, then call leave and ret to clean up the stacks and return to the calling function (main).
That’s it for now. In the next post we’ll implement our own add function, and instead of throwing assembly code out rapid style, actually explain it in greater detail.