<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NIX/WIN/WEB &#187; 64-bit assembly</title>
	<atom:link href="http://www.formboss.net/blog/tag/64-bit-assembly/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.formboss.net/blog</link>
	<description>Modern Web Application Development</description>
	<lastBuildDate>Thu, 02 Feb 2012 18:43:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>SSE And Inline Assembly Example</title>
		<link>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/</link>
		<comments>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/#comments</comments>
		<pubDate>Mon, 04 Apr 2011 04:42:53 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Qt]]></category>
		<category><![CDATA[64-bit assembly]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[Performance Computing]]></category>
		<category><![CDATA[SSE]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1111</guid>
		<description><![CDATA[In previous posts we&#8217;ve covered Inline Assembly and SSE Intrinsics coding. In this post we&#8217;ll merge these concepts by creating a version of the CMYK to RGB conversion code strictly in raw SSE and assembly. The upshot is you&#8217;ll see how we can take existing, real-world C++ code and use GCC&#8217;s Extended Assembly syntax to [...]]]></description>
			<content:encoded><![CDATA[<p>In previous posts we&#8217;ve covered <a title="Inline Assembly" href="http://www.formboss.net/blog/2010/10/gcc-inline-assembly-loop-structures/" target="_blank">Inline Assembly</a> and <a title="SSE intrinsics" href="http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/" target="_blank">SSE Intrinsics coding</a>.</p>
<p>In this post we&#8217;ll merge these concepts by creating a version of the CMYK to RGB conversion code strictly in <strong>raw SSE </strong>and <strong>assembly</strong>. The upshot is you&#8217;ll see how we can take existing, real-world C++ code and use GCC&#8217;s <a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html">Extended Assembly</a> syntax to interweave raw assembly code for potential performance gains.</p>
<p>This means this tutorial is not just about extended assembly or sse coding, it&#8217;s about using both in a real-world application. We&#8217;ll learn many concepts including data retrieval, loop processing, SSE processor instructions, floating point number representation, and much more!</p>
<p><span id="more-1111"></span></p>
<p>Let&#8217;s start our tutorial by taking a quick look at the core algorithm logic we&#8217;ll be implementing (for a more in-depth refresher this is covered in some detail <a href="http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/">here </a>):</p>
<pre class="brush: plain; title: ; notranslate">
c = 1.0 - (bits / 255f)
m = 1.0 - (bits / 255f)
y = 1.0 - (bits / 255f)

k = the min of c/m/y

if k != 1
c = (c - k) / (1 - k)
m = (m - k) / (1 - k)
y = (y - k) / (1 - k)

c = c * 255
m = m * 255
y = y * 255
k = k * 255
</pre>
<p>The data source (the incoming image data) is a call to Qt&#8217;s <a title="Qt Bits() Function Call" href="http://doc.qt.nokia.com/4.7/qimage.html#bits" target="_blank">QImage::bits()</a> function, which returns a pointer to a uchar array containing the raw image data-stream.</p>
<p>The destination, that is, the RGB converted data, is a heap-based uchar array created via:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
uchar *cmyk_temp = new uchar[(wt * ht) * 4];
</pre>
<p>In other words we already have a source and a destination set in C++. (To put it simply, the point of this tutorial is to write assembly to mange the data in-between these two places.) This means the first big task in creating an assembly version is knowing how we &#8216;hook&#8217; into those existing data structures.</p>
<p>A key point in this regard is that while we&#8217;ll manage loops and data access in our assembly code, we do <em>not</em> want to think about managing the stack. Thus, we can think of these two arrays as places we&#8217;ll get the address to via pointers, but never create and manage in assembly on our own.</p>
<p>So how do we access this data from GCC Extended Assembly?</p>
<p>The answer is we, in extended assembly syntax, use these pointer variables as <strong>input </strong>and <strong>output </strong>parameters. (please see <a href="http://www.formboss.net/blog/2010/06/c-64bit-inline-assembly-primer-part-1/">here </a>for a brief primer on GCC inline assembly)</p>
<p>For example, lets say we have the following C++:</p>
<pre class="brush: plain; title: ; notranslate">
int a = 1;

__asm__ __volatile__(

&quot;mov %0, %%ebx \n\t&quot;

: /* output parameters */

: &quot;m&quot;(a)

: ebx /* clobbers */

);
</pre>
<p>This code says that we&#8217;ll take the <strong>int a</strong> variable and push it to <strong>ebx</strong>. The key is behind the scenes CGG will actually rewrite this mov instruction to something like:</p>
<pre class="brush: plain; title: ; notranslate">
mov -0x20(%rbx), %ebx
</pre>
<p>In other words, we let GCC manage the stack. Using Extended Assembly like this means we don&#8217;t care where on the stack int a comes from, we just want to make sure we have access to it. The extended assembly syntax allows for this easy manipulation of data.</p>
<p>For our code, this means the start of our implementation will actually be a series of C++ variable declarations that we&#8217;ll end up passing into the __asm__ call.</p>
<p>One item of note here is along with simple arrays pointers and ints, we also declare two<strong> __m128</strong> objects for easy double quad-word storage of constants we&#8217;ll need for our calculations, those being vectors of 1.0f and 255.0f.</p>
<p>&#8220;Talking&#8221; to these  __m128 items is accomplished in much the same as our example above, only now we use the <strong>movaps </strong>mnemonic as in:</p>
<pre class="brush: plain; title: ; notranslate">
&quot;movaps %5, %%xmm14 \n\t&quot; // 255.0﻿﻿
</pre>
<p>In other words, bytes, chars, floats, __m128&#8242;s &#8212; we can create whatever we need and pass it to the assembly routine, which means we don&#8217;t need to worry about the stack.</p>
<h3>Register Pressure</h3>
<p>This takes us to one of the main goals of this exercise: <strong>use as many CPU registers as possible during the conversion process</strong>. This means one of the explicit assumptions of this code is it only runs on 64-bit machines. That is, it&#8217;s hard-coded to use the full register set available to x86_64.</p>
<p>This means we have an extended range of 64-bit (quad-word) general purpose registers (r8-r15), as well as the full set of 128-bit (double quad-word) SSE registers, xmm0-xmm15.</p>
<p>Obviously the assumption here is the fewer trips to main memory we can make, the faster our code will be.</p>
<p>And so, the first &#8216;preparatory&#8217; part of our assembly code is spent mapping various constants to known registers so we can refer back to them as often as needed without making expensive trips to main memory.</p>
<p>The interesting bit here is not all of our mov&#8217;s are for the same purpose. Some moves, as in above, are to store 128 bit values to an SSE register. Others are for setting up bit masks, and still others are set up for loops counters.</p>
<p>Of course key is that this logic happens outside of our main conversion loop. Once we enter the loop we do as <strong>much </strong>as we can to avoid reads and writes to main memory.</p>
<p>With that said, let&#8217;s jump strait into the code and see how it works!</p>
<h3>Part 1 &#8211; Initialization.</h3>
<p>In this section we perform the essential task of initializing the variables we&#8217;ll pass to the extended asm routine. </p>
<p>Note the <strong>*bits</strong> array, this is the source of the values being used. This is an example of a standard link to a variable, though we also create a few float arrays and zero out their initial values for easier debugging.</p>
<p>The most important bit in this block is the <strong>_pack_lookup_table</strong> float array. The creation of this item (a classic <a href="http://en.wikipedia.org/wiki/Lookup_table">lookup table</a>) is to reduce the possible overhead of the uchar to float conversions we must make.</p>
<p>The idea is simple: because we only have 256 possible input values, instead of converting each in turn at run-time, lets just create a lookup table and map the input values to the IEE floating point representation created in the loop (whose values are calculated at compile, <strong>not </strong>run-time). This has a huge potential benefit as each pixel will require 3 conversions, and when you&#8217;re dealing with 8 million pixels per image these int to float conversions can really add up.</p>
<p>To be clear though, converting ints to floats on modern hardware isn&#8217;t a huge deal (around 8-16 cycles on most CPU&#8217;S), but this is still a handy way to learn about assembly coding. We will see in Part 4 however, that our lookup table may not be the best possible solution because of the extra memory trips we end up making. </p>
<p>Alas, this is also a good excuse to learn a bit more about assembly level addressing modes, which we&#8217;ll see a good deal of in many different places.</p>
<p>As such, the code below is just standard C++.</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
int t = 1;
int l = (char)t;

uchar *bits = imin.bits();

int count = (wt * ht) * 4; // link with loop counter byte size

int k_min_value;

float * _pack_loop_array = new float[4];
// zero out array for easier debugging
_pack_loop_array[0] = 0;
_pack_loop_array[1] = 0;
_pack_loop_array[2] = 0;
_pack_loop_array[3] = 0;

float * _mask_array = new float[4];
// zero out array for easier debugging
_mask_array[0] = 0;
_mask_array[1] = 0;
_mask_array[2] = 0;
_mask_array[3] = 0;

float * _pack_lookup_table = new float[256];
for(int i = 0; i &lt;= 256; i++){
    _pack_lookup_table[i] = (float)i;
}

// mmx constants - use m128 to ensure aligned data
__m128 m_1 = _mm_set_ps1(1.0f);
__m128 m_255 = _mm_set_ps1(255.0f);
</pre>
<h3>Part 2 &#8211; Initial __asm__ and Constant Creation</h3>
<p>In this section we dive into the heart of our routines assembly code. </p>
<p>A good deal of space is for initialising loop constants and mapping values to registers&#8211;pretty standard stuff.</p>
<p>We do have one interesting bit though, which is the creation of the 128-bit sign-flip mask.</p>
<p>We&#8217;ve done this because in IEE single-precision floating point representation, the HO bit (bit 31) is <em>always</em> the sign bit. When the bit&#8217;s 0 the value is positive, when it&#8217;s a 1 negative.</p>
<p>We can exploit this fact to easily flip the sign of a float, or in our case, the packed floats. Problem is, in order to create a 32-bit string in the form of the proper sign-flip mask (0&#215;80000000) we would have to resort to all sorts of trickery in C++. That is, how do you create the bit pattern of our mask without the compiler trying to turn it into something else, such as a float, int, or char? It sounds like an easy problem to address, but it isn&#8217;t. </p>
<p>Thus, as the value needs to be created anyway, we&#8217;ll just do so in assembly, where defining and populating memory locations with arbitrary bit-strings is easy.</p>
<p>In the end we push this mask value to xmm4 where it remains constant throughout the life of the conversion process.</p>
<pre class="brush: plain; title: ; notranslate">
__asm__ __volatile__(

    // set sse constants

    &quot;movaps %5, %%xmm14 \n\t&quot; // 255.0
    &quot;movaps %4, %%xmm15 \n\t&quot; // 1.0

    // init sign SSE sign-flip mask
    &quot;xorps %%xmm4, %%xmm4 \n\t&quot;

    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;mov $0x80000000, %%r11 \n\t&quot;
    &quot;mov %9, %%rbx \n\t&quot;

    // populate mask values

    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;

    // populate sse reg
    &quot;movaps (%%rbx), %%xmm4 \n\t&quot;

    // init loop constants

    &quot;xor %%rax, %%rax \n\t&quot; /* init i array counter */

    &quot;xor %%rdx, %%rdx \n\t&quot; /* set array counter upper bounds */
    &quot;mov %2, %%edx \n\t&quot;

    &quot;xor %%rcx, %%rcx \n\t&quot; /* get base address of source array */
    &quot;mov %1, %%rcx \n\t&quot;

    &quot;xor %%rbx, %%rbx \n\t&quot; /* get base address of dest array */
    &quot;mov %3, %%rbx \n\t&quot;

    &quot;xor %%r14, %%r14 \n\t&quot; /* get base address of _pack_loopup array */
    &quot;mov %7, %%r14 \n\t&quot;

    &quot;xor %%r9, %%r9 \n\t&quot; /* L_PACK_LOOP Bounds Check ($0x10) */
    &quot;mov $0x10, %%r9 \n\t&quot; // init with decimal 16 -  4 bytes x 4 loops

    &quot;xor %%r10, %%r10 \n\t&quot; /* get base address of _pack_loop_array */
    &quot;mov %6, %%r10 \n\t&quot;

    // init logic constants

    &quot;xor %%r13, %%r13 \n\t&quot;
    &quot;mov $0x1, %%r13 \n\t&quot; // 1 value for k_min compare
</pre>
<h3>Part 3 &#8211; Initial Loop Logic</h3>
<p>The following block of code mainly just resets r8 and r15 for loop counting and addressing purposes.</p>
<p>Again, the point of this exercise is to use as many registers as possible. Part of this means we also need to carefully manage how and where registers are used.</p>
<p>Moving (copying of course) rax to r15 for example, means r15 points to important locations we need to index from, but can be incremented without fear of &#8220;breaking&#8221; the main index register, which for our code is rax. We&#8217;ll see this type of activity in a few other places as well.</p>
<p>The point being of course that we&#8217;re not hitting main memory, just registers. </p>
<pre class="brush: plain; title: ; notranslate">
&quot;.L_MAIN_LOOP:&quot; // convert each value to float, push to xmm0

    // subroutine - created packed data from single loop

    &quot;xor %%r8, %%r8 \n\t&quot; // inner loop counter
    &quot;mov $0x0, %%r8 \n\t&quot;

    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;xor %%r12, %%r12 \n\t&quot;

    &quot;xor %%r15, %%r15 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot; // PACK_INDEX
</pre>
<h3>Part 4 &#8211; Converting Ints To Floats &#8211; Two methods</h3>
<p>With the main loop initialisation done we&#8217;re now free to start processing data. We start by grabbing the current CMYK value from the source array and placing it into r11, then masking it to leave only the LO byte.</p>
<p>This sets us up for the next step, which is converting the int values into proper floats for the xmm registers. Let&#8217;s look at two ways of accomplishing this step:</p>
<p><strong>Method 1 &#8211; Memory Dependent And Slightly Slower &#8211; A Lookup Table In Action</strong><br />
Per above, the first step in both methods is to move and mask the raw int value to r11. Crucially this number will always be between 0-255. We exploit this fact in the <strong>mov</strong> call where r14 is the base address of the <strong>_pack_lookup_table</strong> array we created in C++, and r11, (also always between 0 and 255), acts as the index.</p>
<p>This is a common optimization known as a lookup table. In essence, instead of converting ints to floats (at a cost of around 8 cycles per conversion on my machine), we instead use simple memory moves to place a pre-computer floating point bit-string value created via the <strong>_pack_lookup_table</strong> array.  </p>
<p>This is an intellectually pleasing way to handle the conversion step but it comes with an unfortunate side-effect. The problem is x86_64 provides no instruction for moving r32 values directly into xmm registers. We can however, move memory values to xmm, only this presents another problem: In order for our lookup table to work we access main memory twice: once for the lookup table value, then once again to store the floating point bit string in an aligned __m128 memory location (which is later used as the argument to <strong>movaps</strong>). Such repeated memory access can be devastatingly slow, and when compared to just doing a relatively speedy direct conversion, becomes tough to justify.</p>
<p>All told, in my tests the lookup table code <em>would</em> be a touch faster than GCC&#8217;s output if it were not for these <strong>mov</strong> instructions. Without those institutions though, we&#8217;d have no lookup table!</p>
<pre class="brush: plain; title: ; notranslate">
&quot;.L_PACK_LOOP:&quot;

    &quot;nop \n\t&quot;

    &quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

    &quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

    // use lookup table to push pre-converted float val
    &quot;mov (%%r14, %%r11, 4), %%r12 \n\t&quot; // indexed value

    // this is slow!
    &quot;mov %%r12, (%%r10, %%r8, 1) \n\t&quot; // push to _pack_loop_array

    &quot;add $0x1, %%r15 \n\t&quot; // 1 pixel per loop iteration

    &quot;add $0x4, %%r8 \n\t&quot;

    &quot;cmp %%r8, %%r9 \n\t&quot;

    &quot;jne .L_PACK_LOOP \n\t&quot;

&quot;xor %%r8, %%r8 \n\t&quot; // clear r8 index for push to xmm

&quot;movaps (%%r10, %%r8, 1), %%xmm0 \n\t&quot; /* move packed values into mmx */
</pre>
<p>You&#8217;ll notice the last instruction sets our 128-bit xmm0 register with the value of the pack_loop_array.</p>
<p>All told this works, and again, is somewhat intellectually pleasing. The question is can this be made faster? The answer is yes it can, and the secret is to avoid memory access at all costs. Unfortunately, this means we must rid ourselves of the lookup table, as described in Method 2&#8230;</p>
<p><strong>Method 2 &#8211; Using Hardware Conversion &#038; Avoiding Memory Access</strong><br />
The second attempt at this logic turns out to be better in terms of performance, even though we ditch the lookup table. </p>
<p>The basic idea is instead of worrying about the conversion costs, we embrace them and use an unrolled conversion block where each value is simply run through <strong>CVTSI2SS</strong>, then shuffled to make room for the next value.<em> It should be said that the unroll logic provided no discernible speed-up&#8211;the improvements come from fewer costly memory access steps</em>. </p>
<pre class="brush: plain; title: ; notranslate">
// c

&quot;add $0x2, %%r15 \n\t&quot; // index for c value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;

&quot;shufps $0xC4, %%xmm0, %%xmm0 \n\t&quot; // flip #2 and #4 11000110

// m

&quot;sub $0x1, %%r15 \n\t&quot; // index for m value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;

&quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot; // flip last two to get m 11100001

// y

&quot;sub $0x1, %%r15 \n\t&quot; // index for y value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;
</pre>
<p>The big difference here is memory access is kept to a bare minimum. With this simple change my implementation runs around the same speed as GCC&#8217;s -02.</p>
<h3>Part 5 &#8211; CMY Conversion Step 1</h3>
<p>Now that we have a series of 4 floating point numbers in xmm0 (our packed data) we can begin the conversion process. </p>
<p>The code starts with what has to be one of the more satisfying parts of SSE coding. As we now have a 128-bit packed string we can divide and subtract our constant values from the working value in two easy calls to <strong>divps</strong> and <strong>subps</strong>. It&#8217;s getting a lot of work done in two quick instructions, which is always a good feeling!</p>
<p>Things gets a bit more interesting after that though. At this point we most likely have a series of negative numbers, something that doesn&#8217;t play well with the next step of the algorithm. </p>
<p>This is because we need to find the lowest value of bunch to create our k value, but when we have the possibility of negative numbers this task becomes impossible.</p>
<p>Thus, before we can advance we need to flip the 31st bit of each packed float to 0 so the value comparison is valid and we find the &#8216;true&#8217; lowest value. Think of this as the assembly equivalent of an <strong>abs()</strong> call. </p>
<p>This is where our mask comes in. The idea of masking is simple enough, though unfortunately it becomes more of a chore when dealing with packed values. </p>
<p>The basic problem is their is no &#8216;default&#8217; mask we can create that will safely mask only the 31st bit of the 4 packed values without also knowing <em>which</em> values in the packed string are already positive. This is because <strong>XOR</strong> will flip negative to positive, but also positive to negative. We don&#8217;t want that of course, we only want the former.</p>
<p>Thus, we first need to modify our mask (known as a selection mask), to ignore positive values so they aren&#8217;t accidentally converted to negative values. </p>
<p>To do so means we must copy our mask and working xmm0 value to new registers. We then call <strong>psrad</strong> against the xmm0 copy to create a selection mask where values that have the 31st bit set (meaning it&#8217;s negative), become all 1&#8242;s ($0xffffffff). When then <strong>logically AND</strong> this value with the default selection mask copy to turn the double-words with positive values back to 0. </p>
<p>This new, modified mask can now be safely <strong>xor</strong>&#8216;d with the original value to only flip negatives to positives.</p>
<pre class="brush: plain; title: ; notranslate">
// perform cmy conversion

// divide by 255
&quot;divps %%xmm14, %%xmm0 \n\t&quot;

// subtract 1
&quot;subps %%xmm15, %%xmm0 \n\t&quot;

// mask back to positive values for min processing

// copy mask for modifier change into xmm5 (xmm4 is mask constant)
&quot;movaps %%xmm4, %%xmm5 \n\t&quot;

// copy raw value for min to xmm6
&quot;movaps %%xmm0, %%xmm6 \n\t&quot;

// create selection mask modifier
&quot;psrad $0x31, %%xmm6 \n\t&quot; // for every negative value, makes item all $0xffffffff

// modify mask copy (xmm5) to mask only negative values
&quot;andps %%xmm6, %%xmm5 \n\t&quot;

// now mask sign bits to 0 with modified mask
&quot;xorps %%xmm5, %%xmm0 \n\t&quot;
</pre>
<h3>Part 6 &#8211; CMY Conversion &#8211; Find Min</h3>
<p>Finding the minimum value in assembly is very similar in practise to the last section where we manipulate masks, only now we&#8217;re also shuffling values around too.</p>
<p>Granted we also had to shuffle values around in the SSE intrinsics version as well, but here we need to think a bit harder about our bit-masks, as we no longer have the benefit of using Intel&#8217;s handy <strong>_MM_SHUFFLE</strong> helper. This type of coding is, I can safely say, a bit more complex. </p>
<p>After we find the minimum value we call <strong>cmpeqps</strong> and then store the LO result in <strong>k_min_value</strong>, which as we can see from the first code block is a simple int we pass into the asm block as a memory value. This may seem like a wasted step, but if we look back a bit we can see that <strong>movss</strong> can only write to a memory location. </p>
<p>We then use this value in a cmp block and if &#8216;true&#8217;, perform a bit of extra processing on our image data.</p>
<pre class="brush: plain; title: ; notranslate">
// find min value
&quot;movaps %%xmm0, %%xmm1 \n\t&quot;
&quot;shufps $0x4E, %%xmm0, %%xmm1 \n\t&quot; // reorder values (mask first)

&quot;minps %%xmm0, %%xmm1 \n\t&quot; // find first min, put in xmm1

&quot;movaps %%xmm1, %%xmm2 \n\t&quot;
&quot;shufps $0xB1, %%xmm1, %%xmm2 \n\t&quot;

&quot;minps %%xmm1, %%xmm2 \n\t&quot; // min (k) in xmm2

// process min logic (is min value == 1? - true = cmpeqps creates mask of all 1's)
&quot;movaps %%xmm15, %%xmm12 \n\t&quot; // move 1's to xmm12
&quot;cmpeqps %%xmm2, %%xmm12 \n\t&quot; // mask now in xmm12

&quot;movss %%xmm12, %8 \n\t&quot; // move mask to memory value for cmp

&quot;cmp %8, %%r13 \n\t&quot; // is k == 1?

&quot;jne .L_MULTIPLY_ALL \n\t&quot; // anything but 1, skip this block

// save 1 - k value
&quot;movaps %%xmm15, %%xmm13 \n\t&quot; // xmm13 is temp storage for 1.0f
&quot;subps %%xmm2, %%xmm13 \n\t&quot; // 1-k in xmm13

// subtract k from all
&quot;subps %%xmm2, %%xmm0 \n\t&quot;

// c-k / 1-k
&quot;divps %%xmm13, %%xmm0 \n\t&quot;

&quot;.L_MULTIPLY_ALL:&quot;

&quot;mulps %%xmm14, %%xmm0 \n\t&quot; // cmy * 255
&quot;mulps %%xmm14, %%xmm2 \n\t&quot; // k * 255
</pre>
<h3>Part 7 &#8211; Export Values</h3>
<p>With the conversion process done we can now write our values back to the destination buffer.</p>
<p>At this point the question becomes: how do we move these packed floats back into an unsigned char array?</p>
<p>This is a surprisingly serious question, as their is (strangely) no native machine instruction for doing so.</p>
<p>Thus, we need to be a bit creative. To that end, the general solution is we extract the LO double-word from the source vector converting it to a <strong>truncated int</strong> in the process. We then grab the LO byte of <em>that </em>value and store it in an indexed position in the destination buffer. Next we shuffle the vector to place a new value in the LO double-word and repeat the process. In the end we <em>store each value back in turn</em> as opposed to sending all back to main memory in one shot (which would be far better in terms of performance). </p>
<p>As far as the exact implementation in our code: to start, the first instruction is for performance reasons, in that when we call <strong>CVTTSS2SI </strong>to covert our float values to truncated ints, they come back as double-words, but we only want bytes. As there are no such instructions for this type of conversion, it becomes apparent their must be <em>some</em> way to get byte values back. Sure enough, <strong>CVTTSS2SI </strong>stores the LO byte of the conversion result into the destination register&#8217;s 8 bit &#8216;shadow&#8217; register (which is to say, just the LO single byte of the full ecx register&#8211;this is not a &#8216;different&#8217; register by any means).</p>
<p>Thus, in our code we push our index register <strong>rcx </strong>to <strong>r11</strong> so it can be restored, because if we use <strong>rcx </strong>as the destination register for <strong>CVTTSS2SI</strong>, the LO byte value of the result is saved to the &#8216;shadow register&#8217; <em>cl</em>, which is exactly the value we need (an unsigned int between 0 and 255). We must do this in rcx as r11 has no such 8 bit equivalent.  </p>
<p>The next instruction is to push the main loop counter, <strong>rax</strong>, to <strong>r15</strong>. Again, it&#8217;s all about saving registers, which means because <strong>rax </strong>contains the same offset address as our other arrays, we can use that along with <strong>rbx </strong>to create our indexed address for the output array population step.</p>
<p>Of course this means we need to manipulate this register with the proper byte offset our values need to be pushed to, which is the reason for the <strong>add</strong> and <strong>sub</strong> calls.  </p>
<p>We do this as the LittleCMS engine process our output buffer is created filled for expects CMYK byte order, but our xmm0 values are currently in YMCK. Instead of issuing shuffles we simply increment and decrement our <strong>r15 </strong>register with the appropriate byte offsets depending on which color we&#8217;re writing to the destination array. This is how we rearrange YMCK into CMYK, which is simple and effective.</p>
<pre class="brush: plain; title: ; notranslate">
// conversion done, export values

&quot;mov %%rcx, %%r11 \n\t&quot; // save rcx so we can convert dwords to words using cl

&quot;mov %%rax, %%r15 \n\t&quot; // pointer for insert values (r15)

// y - truncate and export values to array
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // y

&quot;add $0x2, %%r15 \n\t&quot; // add 2 for y value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

// m
&quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot; // flip last two to get m
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // m

&quot;sub $0x1, %%r15 \n\t&quot; // subtract 1 for m value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

// c
&quot;shufps $0x4E, %%xmm0, %%xmm0 \n\t&quot; // flip #2 and #4
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // c

&quot;mov %%cl, (%%rbx, %%rax, 1) \n\t&quot; // rax is base address, no need to sub

// k
&quot;CVTTSS2SI %%xmm2, %%rcx \n\t&quot; // m

&quot;add $0x2, %%r15 \n\t&quot; // add 3 for k value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

&quot;mov %%r11, %%rcx \n\t&quot; // restore rcx
</pre>
<h3>Part 8 &#8211; Closing The Loop and __asm__ Logic</h3>
<p>The closing assembly bits are simply to increment our loop counter.</p>
<p>The last part is of course the extended assembly closing bit where we set input, output, and clobbers.</p>
<pre class="brush: plain; title: ; notranslate">
    &quot;addq $0x4, %%rax \n\t&quot; // increment main loop counter
    &quot;cmpq %%rdx, %%rax \n\t&quot;

    &quot;jne .L_MAIN_LOOP \n\t&quot;

    : &quot;=m&quot; (cmyk_temp)  /* destination */
    : &quot;m&quot; (bits), &quot;m&quot; (count), &quot;m&quot; (cmyk_temp), &quot;m&quot; (m_1), &quot;m&quot; (m_255), &quot;m&quot; (_pack_loop_array), &quot;m&quot; (_pack_lookup_table), &quot;m&quot; (k_min_value), &quot;m&quot;(_mask_array) /* source */
    : &quot;ebx&quot;, &quot;ecx&quot;, &quot;edx&quot;, &quot;memory&quot; /* clobbers */

); // asm

delete _pack_lookup_table;
delete _pack_loop_array;
delete _mask_array;
</pre>
<h3>Performance Notes</h3>
<p>One of the main reasons I was interested in pursuing this task was to see if I could write code that had some parity with GCC&#8217;s -02 output. In all I&#8217;m relatively happy, with performance generally being only a few hundred milliseconds slower over the course of 120 images:</p>
<pre class="brush: plain; title: ; notranslate">
== || Total Process Time Elapsed (milliseconds): 21228 SSE 2 (unrolled intrinsics)
== || Total Process Time Elapsed (milliseconds): 23252 Raw Assembly
== || Total Process Time Elapsed (milliseconds): 21344 Raw Assembly
== || Total Process Time Elapsed (milliseconds): 21125 SSE 1 (intrinsics)
== || Total Process Time Elapsed (milliseconds): 20747 SSE 2 (unrolled intrinsics)
</pre>
<p>As fun as creating this code was, the lesson here is if you can use intrinsics do it. There is simply no reason to slave over this type of coding if you can possibly avoid it, as in the end you&#8217;ll have a very hard time besting your compiler anyway. Yes you may get lucky every now and then, but my guess would be most of the time, not so much. Intrinsics rock, use em!</p>
<p>The one thing I would say though is there <em>are</em> some holes in the current intrinsic line-up which make raw assembly attractive. One glaring example is the inability to deal with packed values in a horizontal fashion. That is, it should be very simple to find the min of four packed floats using one intrinsic&#8211;but no such intrinsic exists. There are <em>plans </em>for one, but nothing as of yet.</p>
<p>It would also be nice to have a native machine instruction to perform packed sign flips. If we did the above code could be a touch shorter. </p>
<p>Finally, it has to be said this algorithm is really not the best for SSE optimizations. This is because at it&#8217;s heart the vectors we create are non-uniform which means we end up treating each vector as 4 separate values instead of a single block. Actually, this <em>severely limits</em> us in terms of performance gains, which is why we&#8217;ll see non-SSE versions of GCC&#8217;s -O2 come very close if not best the SSE versions. </p>
<p>The good news is the techniques covered in this tutorial are still valid. Just keep in mind not all data structures are well suited for SSE. </p>
<h3>General hints</h3>
<p>Along the way I tried to keep note of the various idiosyncrasies of this type of coding. </p>
<p>Of course the best piece of advice is to just see what your own assembler is doing in a debug session.</p>
<p>The problem with this approach however, is that unlike Intel syntax, GCC&#8217;s Gas syntax adds many rules and formatting requirements that are not present in raw assembly output. For example, you&#8217;ll notice that Gas requires that we end each line with \n\t.</p>
<p>With that said, here are a few hints:</p>
<p><strong>CGG Reversed Syntax</strong><br />
Ha ha. But seriously, GCC (AT&#038;T syntax) reverses the order of operands in specific ways, meaning many pieces of documentation you&#8217;ll read won&#8217;t work unless we first reverse them. It&#8217;s important to note it&#8217;s not just the operands in two operand instructions, but also 3 operand ones too.</p>
<p>A good example is this guy from my code:</p>
<pre class="brush: plain; title: ; notranslate">shufps $0xe4,%xmm1,%xmm0
</pre>
<p>In Intel documentation it looks like:</p>
<pre class="brush: plain; title: ; notranslate">
SHUFPS xmm1, xmm2/m128, imm8
</pre>
<p>Thus, not only are operands reversed (source/destination), <strong>so too is the immediate value</strong>. </p>
<p><strong>Using 64-bit? Use 64-bit Register Names</strong><br />
This one sounds obvious, but it can be easy to forget that this won&#8217;t work:</p>
<pre class="brush: plain; title: ; notranslate">&quot;mov %%r15, %%eax \n\t&quot;</pre>
<p>Whereas this will:</p>
<pre class="brush: plain; title: ; notranslate">&quot;mov %%r15, %%rax \n\t&quot;</pre>
<h3>Conclusion</h3>
<p>And that&#8217;s it, hopefully you&#8217;ve learned a bit more about assembly coding : )</p>
<p>If you have any questions, please post &#8216;em below!</p>
<h3>Full Code Listing</h3>
<pre class="brush: plain; title: ; notranslate">
__asm__ __volatile__(
    &quot;movaps %5, %%xmm14 \n\t&quot;
    &quot;movaps %4, %%xmm15 \n\t&quot;
    &quot;xorps %%xmm4, %%xmm4 \n\t&quot;
    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;mov $0x80000000, %%r11 \n\t&quot;
    &quot;mov %9, %%rbx \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;movaps (%%rbx), %%xmm4 \n\t&quot;
    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%rdx, %%rdx \n\t&quot;
    &quot;mov %2, %%edx \n\t&quot;
    &quot;xor %%rcx, %%rcx \n\t&quot;
    &quot;mov %1, %%rcx \n\t&quot;
    &quot;xor %%rbx, %%rbx \n\t&quot;
    &quot;mov %3, %%rbx \n\t&quot;
    &quot;xor %%r14, %%r14 \n\t&quot;
    &quot;mov %7, %%r14 \n\t&quot;
    &quot;xor %%r9, %%r9 \n\t&quot;
    &quot;mov $0x10, %%r9 \n\t&quot;
    &quot;xor %%r10, %%r10 \n\t&quot;
    &quot;mov %6, %%r10 \n\t&quot;
    &quot;xor %%r13, %%r13 \n\t&quot;
    &quot;mov $0x1, %%r13 \n\t&quot;
    &quot;.L_MAIN_LOOP:&quot;
    &quot;xor %%r8, %%r8 \n\t&quot;
    &quot;mov $0x0, %%r8 \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;xor %%r12, %%r12 \n\t&quot;
    &quot;xor %%r15, %%r15 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot;
    &quot;.L_PACK_LOOP:&quot;
    &quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot;
    &quot;and $0x00000000000000ff, %%r11 \n\t&quot;
    &quot;mov (%%r14, %%r11, 4), %%r12 \n\t&quot;
    &quot;mov %%r12, (%%r10, %%r8, 1) \n\t&quot;
    &quot;add $0x1, %%r15 \n\t&quot;
    &quot;add $0x4, %%r8 \n\t&quot;
    &quot;cmp %%r8, %%r9 \n\t&quot;
    &quot;jne .L_PACK_LOOP \n\t&quot;
    &quot;xor %%r8, %%r8 \n\t&quot;
    &quot;movaps (%%r10, %%r8, 1), %%xmm0 \n\t&quot;
    &quot;divps %%xmm14, %%xmm0 \n\t&quot;
    &quot;subps %%xmm15, %%xmm0 \n\t&quot;
    &quot;movaps %%xmm4, %%xmm5 \n\t&quot;
    &quot;movaps %%xmm0, %%xmm6 \n\t&quot;
    &quot;psrad $0x31, %%xmm6 \n\t&quot;
    &quot;andps %%xmm6, %%xmm5 \n\t&quot;
    &quot;xorps %%xmm5, %%xmm0 \n\t&quot;
    &quot;movaps %%xmm0, %%xmm1 \n\t&quot;
    &quot;shufps $0x4E, %%xmm0, %%xmm1 \n\t&quot;
    &quot;minps %%xmm0, %%xmm1 \n\t&quot;
    &quot;movups %%xmm1, %%xmm2 \n\t&quot;
    &quot;shufps $0xB1, %%xmm1, %%xmm2 \n\t&quot;
    &quot;minps %%xmm1, %%xmm2 \n\t&quot;
    &quot;movaps %%xmm15, %%xmm12 \n\t&quot;
    &quot;cmpeqps %%xmm2, %%xmm12 \n\t&quot;
    &quot;movss %%xmm12, %8 \n\t&quot;
    &quot;cmp %8, %%r13 \n\t&quot;
    &quot;jne .L_MULTIPLY_ALL \n\t&quot;
    &quot;movaps %%xmm15, %%xmm13 \n\t&quot;
    &quot;subps %%xmm2, %%xmm13 \n\t&quot;
    &quot;subps %%xmm2, %%xmm0 \n\t&quot;
    &quot;divps %%xmm13, %%xmm0 \n\t&quot;
    &quot;.L_MULTIPLY_ALL:&quot;
    &quot;mulps %%xmm14, %%xmm0 \n\t&quot;
    &quot;mulps %%xmm14, %%xmm2 \n\t&quot;
    &quot;mov %%rcx, %%r11 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;add $0x2, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;sub $0x1, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;shufps $0x4E, %%xmm0, %%xmm0 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%rax, 1) \n\t&quot;
    &quot;CVTTSS2SI %%xmm2, %%rcx \n\t&quot;
    &quot;add $0x2, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;mov %%r11, %%rcx \n\t&quot;
    &quot;addq $0x4, %%rax \n\t&quot; // increment main loop counter
    &quot;cmpq %%rdx, %%rax \n\t&quot;
    &quot;jne .L_MAIN_LOOP \n\t&quot;

    : &quot;=m&quot; (cmyk_temp)  /* destination */
    : &quot;m&quot; (bits), &quot;m&quot; (count), &quot;m&quot; (cmyk_temp), &quot;m&quot; (m_1), &quot;m&quot; (m_255), &quot;m&quot; (_pack_loop_array), &quot;m&quot; (_pack_lookup_table), &quot;m&quot; (k_min_value), &quot;m&quot;(_mask_array) /* source */
    : &quot;rax&quot;, &quot;rbx&quot;, &quot;rcx&quot;, &quot;rdx&quot;, &quot;r9&quot;, &quot;r10&quot;, &quot;r11&quot;, &quot;r12&quot;, &quot;r13&quot;, &quot;r14&quot;, &quot;r15&quot;, &quot;memory&quot; /* clobbers */

); // asm
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C++ 64-bit Inline Assembly Primer – Part 2</title>
		<link>http://www.formboss.net/blog/2010/07/c-64-bit-inline-assembly-primer-%e2%80%93-part-2/</link>
		<comments>http://www.formboss.net/blog/2010/07/c-64-bit-inline-assembly-primer-%e2%80%93-part-2/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 04:25:43 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[64-bit assembly]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[c++]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=541</guid>
		<description><![CDATA[In this series we examine the relationship and implementations of C++ and raw assembly code. In this post we create our own add function in gcc extended assembly. In the previous post we wrote a short C++ program that loaded two numbers. Despite its simplicity in C++, we saw how the assembly version was comprised [...]]]></description>
			<content:encoded><![CDATA[<p>In this series we examine the relationship and implementations of C++ and raw assembly code. In this post we create our own add function in gcc extended assembly.</p>
<p>In the previous post we wrote a short C++ program that loaded two numbers. Despite its simplicity in C++, we saw how the assembly version was comprised of several dozen individual instructions in a rather cryptic format. Much of this complexity stems from the fact that in our sample program we called a function to perform our addition. Calling functions means dealing with a stack, base pointers, and the setup and maintenance of that stack. It means dealing with memory offsets, relative positions, and several other factors. The good news is that at this point we can safely ignore these details. In fact, we will do well to ignore them and focus on just the core competencies of function implementation code. In other words, we&#8217;ll let gcc create the function shells, calls, and stack management, we&#8217;ll focus on the core logic.</p>
<p><span id="more-541"></span></p>
<p>The general idea is that we want to write assembly that adds two numbers together. There is a direct assembly instruction for doing so, <strong>add</strong>, and in this post we&#8217;ll implement it.</p>
<p>One of the key things to remember about writing assembly is that we&#8217;re dealing with very few abstractions. In C++:</p>
<div class="geshi no cpp">
<ol>
<li class="li1">
<div class="de1"><span class="kw4">int</span> t <span class="sy1">=</span> <span class="nu0">10</span>;</div>
</li>
</ol>
</div>
<p>Has a very specific meaning, but its <em>true</em> meaning to the machine it runs on is hidden. It&#8217;s only after gcc compiles the code that int, t, and 10 mean anything to the processor. In assembly we free ourselves from these abstractions and deal with the very direct process of moving bits around.</p>
<p>Thus, our first step in writing assembly to add two numbers is to realize that the add instruction we&#8217;ll be using expects two parameters, neither of which can be memory locations. They can however, be registers. Thus, our first steps in code must be to load two registers with the values we want to add.</p>
<p>To load a value into a register we use the <strong>mov</strong> instruction, which for us in AT&#038;T syntax takes the form of:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1"><span class="kw1">mov</span> $0&#215;2, %%rax<span class="co1">;</span></div>
</li>
</ol>
</div>
<p>This moves the immediate value (2) into the %rax register. Please keep in mind %rax is a 64-bit register, it would be %eax on 32-bit systems.</p>
<p>We then push our second number:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1"><span class="kw1">mov</span> $0&#215;1, %%rcx<span class="co1">;</span></div>
</li>
</ol>
</div>
<p>And finally, call the add instruction, passing in our two registers:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1"><span class="kw1">add</span> %%rcx, %%rax<span class="co1">;</span></div>
</li>
</ol>
</div>
<p>The whole function in extended assembly looks like:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;mov $0&#215;2, %%rax; &nbsp; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;mov $0&#215;1, %%rcx; &nbsp; &nbsp; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;add %%rcx, %%rax; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :<span class="st0">&quot;rax&quot;</span>, <span class="st0">&quot;rcx&quot;</span><span class="br0">&#41;</span><span class="co1">;</span></div>
</li>
</ol>
</div>
<p>Should we run this in our IDE, we would watch the rax and rcx register&#8217;s end up with 0&#215;1 and 0&#215;3, respectively. </p>
<p>Thing is, this is not a very useful construct, as immediate values means we&#8217;ve hard coded the return in by default. It&#8217;s much more realistic to accept parameters.</p>
<p>To do so, we take advantage of gcc extended syntax like so:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">int</span> rax = <span class="nu0">2</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">int</span> rcx = <span class="nu0">1</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; cout &lt;&lt; <span class="st0">&quot;(rax) before: &quot;</span> &lt;&lt; rax &lt;&lt; endl<span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;mov %1, %%rax; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;mov %2, %%rcx; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;add %%rcx, %0; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : <span class="st0">&quot;=m&quot;</span> <span class="br0">&#40;</span>rax<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : <span class="st0">&quot;m&quot;</span> <span class="br0">&#40;</span>rax<span class="br0">&#41;</span>, <span class="st0">&quot;m&quot;</span> <span class="br0">&#40;</span>rcx<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :<span class="st0">&quot;rax&quot;</span>, <span class="st0">&quot;rcx&quot;</span><span class="br0">&#41;</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; cout &lt;&lt; <span class="st0">&quot;(rax) after: &quot;</span> &lt;&lt; rax &lt;&lt; endl<span class="co1">;</span></div>
</li>
</ol>
</div>
<p>Will output:</p>
<div class="geshi no ini">
<ol>
<li class="li1">
<div class="de1"><span class="br0">&#40;</span>rax<span class="br0">&#41;</span> before: <span class="nu0">2</span></div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#40;</span>rax<span class="br0">&#41;</span> after: <span class="nu0">3</span></div>
</li>
<li class="li1">
<div class="de1">Press <span class="re0"><span class="br0">&#91;</span>Enter<span class="br0">&#93;</span></span> to close the terminal &#8230;</div>
</li>
</ol>
</div>
<p>While we now have plenty of control over implementation, it&#8217;s still a bit heavy on syntax. Thus, we can further optimize by letting gcc have full control over register assignment:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">int</span> rax = <span class="nu0">2</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">int</span> rcx = <span class="nu0">1</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; cout &lt;&lt; <span class="st0">&quot;(rax) before: &quot;</span> &lt;&lt; rax &lt;&lt; endl<span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;add %2, %0; &nbsp;\n\t&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : <span class="st0">&quot;=r&quot;</span> <span class="br0">&#40;</span>rax<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : <span class="st0">&quot;r&quot;</span> <span class="br0">&#40;</span>rax<span class="br0">&#41;</span>, <span class="st0">&quot;r&quot;</span> <span class="br0">&#40;</span>rcx<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :<span class="br0">&#41;</span><span class="co1">;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; cout &lt;&lt; <span class="st0">&quot;(rax) after: &quot;</span> &lt;&lt; rax &lt;&lt; endl<span class="co1">;</span></div>
</li>
</ol>
</div>
<p>Here we let gcc decide which registers to use, which leads to slightly faster code generation, but we may lose the exacting control we need.</p>
<p>At this point we now have a functioning addition routine. Of course we already know it&#8217;s not the most efficient, as we could/should be using lea. This is because while our actual hand-written assembly is shorter, we still need to place values in registers and use the two-step addition instruction.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2010/07/c-64-bit-inline-assembly-primer-%e2%80%93-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

