<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NIX/WIN/WEB &#187; C++</title>
	<atom:link href="http://www.formboss.net/blog/category/programming-in-c-plus-plus/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.formboss.net/blog</link>
	<description>Modern Web Application Development</description>
	<lastBuildDate>Thu, 02 Feb 2012 18:43:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>GCC Vs. LLVM &#8211; Simple Test Code Optimizations</title>
		<link>http://www.formboss.net/blog/2011/09/gcc-vs-llvm-simple-low-level-optimizations/</link>
		<comments>http://www.formboss.net/blog/2011/09/gcc-vs-llvm-simple-low-level-optimizations/#comments</comments>
		<pubDate>Thu, 29 Sep 2011 05:04:13 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[Clang]]></category>
		<category><![CDATA[GCC]]></category>
		<category><![CDATA[LLVM]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1412</guid>
		<description><![CDATA[As a companion piece to this test suite on Phoronix I took the following *very simple* code and ran it though GCC 4.6 (Fedora 15) and the Clang/LLVM 2.8 Suite to check performance and various optimizations being performed: This code is purpose built to thwart basic compiler optimizations. In particular the printf() call at the [...]]]></description>
			<content:encoded><![CDATA[<p>As a companion piece to this test suite on <a href="http://www.phoronix.com/scan.php?page=article&amp;item=gcc_46_llvm29&amp;num=1">Phoronix</a> I took the following *very simple* code and ran it though GCC 4.6 (Fedora 15) and the Clang/LLVM 2.8 Suite to check performance and various optimizations being performed:</p>
<pre class="brush: cpp; title: ; notranslate">
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

int test_noRef(int value)
{
	return value + 1;
}

int test_ref(int &amp;value){
	return value += 1;
}

int main(int argc, char *argv[]){

	int t = 1;

	for(int i = 0; i &lt; 100000000; i++){
		//t = test_noRef(t);
		test_ref(t);
	}

	// final result = n loops + 1
	printf(&quot;%d\n&quot;, t);

}
</pre>
<p><span id="more-1412"></span></p>
<p>This code is purpose built to thwart basic compiler optimizations. In particular the printf() call at the end prevents the compilers from simply skipping the entire function call block.</p>
<p>While simplistic, the basic idea is to see what optimizations each compiler performs and at what optimization level. Note that the LLVM/Clang suite can easily be installed on Fedora 15 by searching for LLVM. You&#8217;ll want to install LLVM <em>and </em>Clang:</p>
<p><a href="http://www.formboss.net/blog/2011/09/gcc-vs-llvm-simple-low-level-optimizations/llvm-clang-install/" rel="attachment wp-att-1414"><img class="alignnone size-full wp-image-1414" title="llvm-clang-install" src="http://www.formboss.net/blog/wp-content/uploads/2011/09/llvm-clang-install.png" alt="" width="675" height="558" /></a></p>
<h2>No Optimizations</h2>
<p>We start with a basic compile with no optimizations; in each case we&#8217;ll focus on test_ref() and test_noRef(). Please note I&#8217;ve removed the stack management code from these listings as they differ very little by compiler.</p>
<pre class="brush: plain; title: ; notranslate">
g++ main.cpp -S
clang main.cpp -S
</pre>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
_Z10test_noRefi:
.LFB0:
	movl	%edi, -4(%rbp)
	movl	-4(%rbp), %eax
	addl	$1, %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
</pre>
<p>The noRef call in GCC moves our parameter to eax, adds 1, and cleans up the stack.</p>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
_Z10test_noRefi:
.Leh_func_begin0:
	pushq	%rbp
.Ltmp0:
	movq	%rsp, %rbp
.Ltmp1:
	movl	%edi, -4(%rbp)
	movl	-4(%rbp), %edi
	addl	$1, %edi
	movl	%edi, %eax
	popq	%rbp
	ret
</pre>
<p>The LLMV code for the noRef() function is quite similar. The main difference is LLVM uses edi for the addition step, then pushes the result to eax before returning.</p>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
_Z8test_refRi:
.LFB1:
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	movl	(%rax), %eax
	leal	1(%rax), %edx
	movq	-8(%rbp), %rax
	movl	%edx, (%rax)
	movq	-8(%rbp), %rax
	movl	(%rax), %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
</pre>
<p>The reference using version performs the same basic logic, but instead of using movl we issue movq, which means our parameters are twice as large. That inefficiency aside we do use an optimization for adding the 1 to our input argument via the leal instruction.</p>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
_Z8test_refRi:
.Leh_func_begin1:
	pushq	%rbp
.Ltmp3:
	movq	%rsp, %rbp
.Ltmp4:
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	movl	(%rdi), %eax
	addl	$1, %eax
	movl	%eax, (%rdi)
	popq	%rbp
	ret
</pre>
<p>LLVM&#8217;s code is a bit cleaner, focusing on pulling the value from the stack, performing the addl, and retuning the result.</p>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
main:
.L5:
	leaq	-8(%rbp), %rax
	movq	%rax, %rdi
	call	_Z8test_refRi
	addl	$1, -4(%rbp)
.L4:
	cmpl	$99999999, -4(%rbp)
	setle	%al
	testb	%al, %al
	jne	.L5
	movl	-8(%rbp), %eax
	movl	%eax, %esi
	movl	$.LC0, %edi
	movl	$0, %eax
	call	printf
	movl	$0, %eax
	leave
</pre>
<p>As with the function calls I&#8217;ve removed the activation record code. LLVM&#8217;s main function definition is essentially the same, so we won&#8217;t cover it here.</p>
<p>A quick benchmark between the two shows what we would expect from the extra fluff in GCC&#8217;s implementation:</p>
<p><strong>GCC</strong><br />
real 0m0.407s<br />
user 0m0.405s<br />
sys 0m0.002s</p>
<p><strong>LLVM</strong><br />
real 0m0.396s<br />
user 0m0.394s<br />
sys 0m0.000s</p>
<h2>Level 1 Optimizations</h2>
<p>It&#8217;s a bit unfair to draw comparisons without some optimizations turned on, so let&#8217;s do so now:</p>
<pre class="brush: plain; title: ; notranslate">
g++ main.cpp -O1 -S
clang main.cpp -O1 -S
</pre>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
_Z10test_noRefi:
.LFB19:
	leal	1(%rdi), %eax
	ret
</pre>
<p>Already we can see a huge difference in GCC&#8217;s output, which whittles this call down to a simple Load Effective Address and return.</p>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
_Z10test_noRefi:
	leal	1(%rdi), %eax
	popq	%rbp
	ret
</pre>
<p>As expected LLVM performs the same lea optimization as GCC.</p>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
_Z8test_refRi:
.LFB20:
	movl	(%rdi), %eax
	addl	$1, %eax
	movl	%eax, (%rdi)
	ret
</pre>
<p>The reference-using version still has to contend with the indirection call via movl ($rdi), %eax, but still a good improvement.</p>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
.Ltmp4:
	movl	(%rdi), %eax
	incl	%eax
	movl	%eax, (%rdi)
	popq	%rbp
	ret
</pre>
<p>LLVM is using incl to perform our addition call, GCC uses addl. A quick check on Google shows Intel <a href="http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2009-January/000547.html">recommends the GCC way</a>.</p>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
	movl	$1, 12(%rsp)
	movl	$100000000, %ebx
.L4:
	leaq	12(%rsp), %rdi
	call	_Z8test_refRi
	subl	$1, %ebx
	jne	.L4
</pre>
<p>The main() function has been reduced to the basics, counting down from our loop initialization value and calling the test_ref() function.</p>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
.Ltmp8:
	movl	$1, -20(%rbp)
	movl	$100000000, %ebx
	leaq	-20(%rbp), %r14
	.align	16, 0x90
.LBB2_1:
	movq	%r14, %rdi
	callq	_Z8test_refRi
	decl	%ebx
	jne	.LBB2_1
</pre>
<p>LLVM&#8217;s main function loop has also been optimized, though again we&#8217;re using decl instead of subl.</p>
<p>Although the two compilers perform similar optimizations here, lets examine how LLVM&#8217;s structured our reference value for the loop.</p>
<p>Right before the main loop it calls:</p>
<pre class="brush: plain; title: ; notranslate">leaq    -20(%rbp), %r14</pre>
<p>Which maps the address at that base-pointer location to r14. In essence its created an alias. Why it&#8217;s done this make sense when we see the rest of the loop.</p>
<p>The first call of every iteration is:</p>
<pre class="brush: plain; title: ; notranslate">movq	%r14, %rdi</pre>
<p>So we pass the address in %r14 to rdi and in the function call we have:</p>
<pre class="brush: plain; title: ; notranslate">movl    (%rdi), %eax</pre>
<p>Which essentially means dereference the value in rdi and push to eax for the incl call. That gets us our added value but it&#8217;s the next call where the magic happens, where we tie the whole thing together:</p>
<pre class="brush: plain; title: ; notranslate">movl    %eax, (%rdi)</pre>
<p>Now we&#8217;ve moved the value of eax to the <em>address location</em> in rdi. Since that address and the one in %r14 are one in the same, the next call in the main loop to:</p>
<pre class="brush: plain; title: ; notranslate">movq    %r14, %rdi</pre>
<p>Means that his freshly increased value from the function call being pointed to by %r14 is once again pushed to rdi, which in turn is dereferenced in the function call and incremented.</p>
<p>That&#8217;s a pretty neat bit of optimization so let&#8217;s see how it pans out in terms of performance:</p>
<p><strong>GCC</strong><br />
real 0m0.165s<br />
user 0m0.163s<br />
sys 0m0.001s</p>
<p><strong>LLVM</strong><br />
real 0m0.163s<br />
user 0m0.161s<br />
sys 0m0.001s</p>
<p>It&#8217;s much the same, as GCC performs a similar trick using %rdi and the stack.</p>
<h2>Level 2 Optimizations</h2>
<p>Finally, let&#8217;s enable the highest level of optimization:</p>
<pre class="brush: plain; title: ; notranslate">
g++ main.cpp -O2 -S
clang main.cpp -O2 -S
</pre>
<p>[GCC]</p>
<pre class="brush: plain; title: ; notranslate">
main:
.LFB21:
	.cfi_startproc
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$100000001, %esi
	movl	$.LC0, %edi
	xorl	%eax, %eax
	call	printf
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
</pre>
<p>[LLVM]</p>
<pre class="brush: plain; title: ; notranslate">
.Ltmp7:
	movl	$.L.str, %edi
	movl	$100000001, %esi
	xorb	%al, %al
	callq	printf
	xorl	%eax, %eax
	popq	%rbp
	ret
</pre>
<p>No need to show anything other than main() because it doesn&#8217;t matter.</p>
<p>Both LLVM and GCC have done what we&#8217;d expect with such a silly bit of code. Both compilers analyzed the code and came to the same conclusion: the final output value to prinf() can be calculated directly within the compiler, so no need to call anything. Just push 100000001 to %esi for printf() and call it a day : )</p>
<h3>In Conclusion</h3>
<p>While the code tested was simple, it&#8217;s interesting to see how the actual assembly differs. It&#8217;s clear that this simple test LLVM produces faster binaries, though not by much.</p>
<p>The other important bit is at least on Fedora, installing and using GCC and LLVM couldn&#8217;t be simpler.</p>
<h3>Links</h3>
<p><a href="http://llvm.org/docs/GettingStarted.html#tutorial4">Installing and Using LLVM</a></p>
<p><a href="http://llvm.org/demo/index.cgi">LLVM Online Compiler Demo</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/09/gcc-vs-llvm-simple-low-level-optimizations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Benchmarks: JavaScript vs. PHP vs. HPHP vs C++</title>
		<link>http://www.formboss.net/blog/2011/07/benchmarks-javascript-vs-php-vs-hphp-vs-c/</link>
		<comments>http://www.formboss.net/blog/2011/07/benchmarks-javascript-vs-php-vs-hphp-vs-c/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 20:04:01 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Linux/Web Servers]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1228</guid>
		<description><![CDATA[Just a quick spattering of some benchmarks I ran while testing a program I&#8217;m developing internally. The first three are of the fannkuch benchmark found on the sunspider test site, only ported over to C++ and PHP in addition to the JS version. The browser used for all tests was Firefox 5. The HPHP listing [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick spattering of some benchmarks I ran while testing a program I&#8217;m developing internally.</p>
<p>The first three are of the <a title="Sunspider Benchmark" href="http://www.webkit.org/perf/sunspider-0.9/access-fannkuch.html" target="_blank">fannkuch benchmark found on the sunspider test site</a>, only ported over to C++ and PHP in addition to the JS version. The browser used for all tests was Firefox 5.</p>
<p>The HPHP listing is a compiled <a title="Hip Hop PHP" href="https://github.com/facebook/hiphop-php/wiki/" target="_blank">Hip Hop PHP</a> version, which is interesting as it shows the relative difference between it and vanilla PHP.</p>
<p>The last item, Benchmark.php, is the same benchmark file you&#8217;ll find in a <a title="PHP Source" href="http://www.php.net/downloads.php" target="_blank">PHP source download</a>.</p>
<p>See the full benchmarks after the jump!</p>
<p><span id="more-1228"></span></p>
<p><strong>access-fannkuch &#8211; n=9</strong></p>
<p>JavaScript: 0.204</p>
<p>PHP: 1.624</p>
<p>HPHP: 0.538</p>
<p>C++ .20</p>
<p><strong>access-fannkuch &#8211; n=10</strong></p>
<p>JavaScript: 3.062</p>
<p>PHP: 19.058</p>
<p>HPHP: 6.395</p>
<p>C++ .189</p>
<p><strong>access-fannkuch &#8211; n=11</strong></p>
<p>JavaScript: 40.297</p>
<p>PHP: ~4 minutes</p>
<p>HPHP: 23.713</p>
<p>C++ 2.265</p>
<p><strong>Benchmark.php</strong></p>
<p>PHP: 2.542</p>
<p>HPHP: 0.547</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/07/benchmarks-javascript-vs-php-vs-hphp-vs-c/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>QT SDK 1.1.2 -Install The QODBC Driver</title>
		<link>http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/</link>
		<comments>http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/#comments</comments>
		<pubDate>Sun, 03 Jul 2011 19:17:58 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Qt]]></category>
		<category><![CDATA[QODBC]]></category>
		<category><![CDATA[QODBC Driver]]></category>
		<category><![CDATA[Qt SDK]]></category>
		<category><![CDATA[SQL Server Qt]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1271</guid>
		<description><![CDATA[In the most recent versions of the Qt SDK it appears as if default support for all databases except SQLite has been removed. The Qt documentation provides a short write-up on how to re-enable this driver, but it leaves out some important details. To enable support for the QODBC driver follow these steps: 1. From [...]]]></description>
			<content:encoded><![CDATA[<p>In the most recent versions of the Qt SDK it appears as if default support for all databases <em>except </em>SQLite has been removed.</p>
<p>The Qt documentation provides a short write-up <a title="Qt Documentation" href="http://doc.qt.nokia.com/4.7/sql-driver.html#how-to-build-the-odbc-plugin-on-windows" target="_blank">on how to re-enable this driver</a>, but it leaves out some important details.</p>
<p>To enable support for the QODBC driver follow these steps:</p>
<p>1. From the Qt Creator Application Launch <strong>Start Updater</strong>:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/start-updater/" rel="attachment wp-att-1272"><img class="alignnone size-full wp-image-1272" title="start-updater" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/start-updater.png" alt="" width="459" height="228" /></a></p>
<p><em>Continued after the jump&#8230;</em></p>
<p><span id="more-1271"></span></p>
<p>2. Expand Qt SDK &gt; Miscellaneous &gt; Qt Sources &gt; Qt 4.7.3 Sources and click <strong>Next </strong>to install the Qt Sources:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/sources/" rel="attachment wp-att-1273"><img class="alignnone size-full wp-image-1273" title="sources" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/sources.png" alt="" width="590" height="469" /></a></p>
<p>3. When that finishes from the <strong>Start Menu</strong> Launch:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/start-menu-item/" rel="attachment wp-att-1274"><img class="alignnone size-full wp-image-1274" title="start-menu-item" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/start-menu-item.png" alt="" width="261" height="154" /></a></p>
<p>4. In the new <strong>command prompt window</strong> navigate (using cd) to:</p>
<pre>C:\QtSDK\QtSources\4.7.3\src\plugins\sqldrivers\odbc</pre>
<p>Of course your path may be different, and in my case, I have used the offline installer. That said, the key is the online installer should be very similar, and in either case we want to hit the <strong>QtSources </strong>directory, and then within that, the <em>plugins</em>, <em>sqldrivers</em>, <em>odbc</em> folder.</p>
<p>Please note their is a similar set of files in:</p>
<pre>C:\QtSDK\QtSources\4.7.3\src\sql\drivers\odbc</pre>
<p>We do not want that. I say this as in an unfamiliar source tree it&#8217;s easy to get turned around if we&#8217;re not careful.</p>
<p>We should now have the following in the command prompt window:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/command-1/" rel="attachment wp-att-1275"><img class="alignnone size-full wp-image-1275" title="command-1" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/command-1.png" alt="" width="707" height="362" /></a></p>
<p>5. Issue the command:</p>
<pre>qmake</pre>
<p>Nothing will change in the command-line window, but if you look at the folder we&#8217;re in, several files have been created:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/new-files/" rel="attachment wp-att-1276"><img class="alignnone size-full wp-image-1276" title="new-files" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/new-files.png" alt="" width="772" height="462" /></a></p>
<p>This happened because the folder, at the start, contained a .pro file, which when run against <strong>qmake</strong> has created platform specific build files we&#8217;ll use in the next step.</p>
<p>6. Issue the commands:</p>
<pre>nmake release</pre>
<p>Followed by:</p>
<pre>namke install release</pre>
<p>If we check our mingw/plugins/sqldrivers folder we now have the newly created driver:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/new-sql-driver/" rel="attachment wp-att-1283"><img class="alignnone size-full wp-image-1283" title="new-sql-driver" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/new-sql-driver.png" alt="" width="559" height="241" /></a></p>
<p>Now when we launch Qt Creator and build a project, SQL Server/ODBC support should be ready for use.</p>
<p>Of course it should be added in order to <em>load</em> the driver for a project we must add:</p>
<pre>sql-plugins += odbc</pre>
<p>&#8230;to that projects .pro file.</p>
<h3>Caveats</h3>
<p>Of course this instruction set assumes we have nmake available to us which may, to be honest, only be the case if we also have Visual Studio or a Windows SDK such as <a title="Windows SDK" href="http://www.microsoft.com/download/en/details.aspx?displaylang=en&amp;id=3138">Windows 7 SDK</a> installed and with a correct system PATH defined to nmake.</p>
<p>I say this as in theory what <em>should </em>happen when we launch the Qt command line shortcut is we <em>should </em>be able to just run <strong>make</strong>. I cannot get his to work however, which leads me to believe I must be missing something.</p>
<p>Thus, a good first check when running through this tutorial will be to check to make sure you can run <strong>nmake </strong>and get output other than a message saying nmake is not a recognized command. For example, I get:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/where-nmake/" rel="attachment wp-att-1286"><img class="alignnone size-full wp-image-1286" title="where-nmake" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/where-nmake.png" alt="" width="611" height="39" /></a></p>
<p>I get this as again, I have the Windows 7 SDK and Visual Studio 2010 Express installed on my system, and also have added the PATH to nmake in my System Variables:</p>
<p><a href="http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/system-path/" rel="attachment wp-att-1287"><img class="alignnone size-full wp-image-1287" title="system-path" src="http://www.formboss.net/blog/wp-content/uploads/2011/07/system-path.png" alt="" width="445" height="510" /></a></p>
<p>No matter, another popular target for extension management will no doubt be MySQL, to which <a title="MySQL Qt Driver Install" href="http://www.pikopong.com/blog/2010/04/11/how-to-enable-mysql-support-in-qt-sdk-for-windows/" target="_blank">this tutorial</a> may be useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/07/qt-sdk-1-1-2-install-the-qodbc-driver/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Sunspider Benchmark in C++</title>
		<link>http://www.formboss.net/blog/2011/04/sunspider-benchmark-in-c/</link>
		<comments>http://www.formboss.net/blog/2011/04/sunspider-benchmark-in-c/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 20:27:54 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[benchmark]]></category>
		<category><![CDATA[fannkuch]]></category>
		<category><![CDATA[sunspider]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1168</guid>
		<description><![CDATA[The Sunspider JavaScript Benchmark is a popular test of Web Browser performance. As great as the newest browsers are I though it would be interesting to take a random test and port it to C++ for the sake of comparison. I decided the rather short fannkuch test was a good candidate. The results: Not surprisingly [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.webkit.org/perf/sunspider-0.9.1/sunspider-0.9.1/driver.html">Sunspider JavaScript Benchmark</a> is a <a href="http://ie.microsoft.com/testdrive/benchmarks/sunspider/default.html">popular test</a> of Web Browser performance.</p>
<p>As great as the newest browsers are I though it would be interesting to take a random test and port it to C++ for the sake of comparison.</p>
<p>I decided the rather short <a href="http://www.webkit.org/perf/sunspider-0.9/access-fannkuch.html">fannkuch </a>test was a good candidate. The results:</p>
<p><a rel="attachment wp-att-1169" href="http://www.formboss.net/blog/2011/04/sunspider-benchmark-in-c/fannkuch-benchmark/"><img class="alignnone size-full wp-image-1169" title="fannkuch-benchmark" src="http://www.formboss.net/blog/wp-content/uploads/2011/04/fannkuch-benchmark.png" alt="" width="680" height="422" /></a></p>
<p>Not surprisingly the C++ version is 18x faster than Firefox and 8x faster than Chrome. For the curious, the ported C++ code is after the jump.</p>
<p>I should mention one thing changed from the stock Sunspider site (e.g., the link above), is the number of iterations was upped from 8 to 10. At 8 iterations the C++ version was sub-millisecond, meaning I&#8217;d have had to roll a custom assembly timer to get a benchmark value. That said, when we lower the iterations down to 9, the stack heavy algorithm starts to benefit the browsers more, with Chrome coming in only 3 times slower and Firefox 11. </p>
<p>Lower the iteration count to the &#8220;stock&#8221; 8 and Firefox actually catches up to Chrome, with both browsers reporting ~21ms. Interesting. </p>
<p><span id="more-1168"></span></p>
<p>The C++ Code</p>
<pre class="brush: cpp; title: ; notranslate">
int MainWindow::Fannkuch(int n)
{

    int check = 0;
    int perm[n];
    int perm1[n];
    int count[n];
    int maxPerm[n];
    int maxFlipsCount = 0;
    int m = n - 1;

    for (int i = 0; i &lt; n; i++){
        perm1[i] = i;
    }

    int r = n;

    while (true) {

        // write-out the first 30 permutations
        if (check &lt; 30){
            int s = 0;
            for(int i=0; i&lt;n; i++) s += (perm1[i]+1);
            check++;
        }

        while (r != 1) { count[r - 1] = r; r--; }

        if (!(perm1[0] == 0 || perm1[m] == m)) {
            for (int i = 0; i &lt; n; i++) perm[i] = perm1[i];
            int flipsCount = 0;
            int k;

            while (!((k = perm[0]) == 0)) {
                int k2 = (k + 1) &gt;&gt; 1;
                for (int i = 0; i &lt; k2; i++) {
                    int temp = perm[i]; perm[i] = perm[k - i]; perm[k - i] = temp;
                }
                flipsCount++;
            }

            if (flipsCount &gt; maxFlipsCount) {
                maxFlipsCount = flipsCount;
                for (int i = 0; i &lt; n; i++) maxPerm[i] = perm1[i];
            }
        }

        while (true) {
            if (r == n) return maxFlipsCount;

            int perm0 = perm1[0];
            int i = 0;

            while (i &lt; r) {
                int j = i + 1;
                perm1[i] = perm1[j];
                i = j;
            }

            perm1[r] = perm0;
            count[r] = count[r] - 1;

            if (count[r] &gt; 0) break;
            r++;
        }
    }

}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/04/sunspider-benchmark-in-c/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SSE And Inline Assembly Example</title>
		<link>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/</link>
		<comments>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/#comments</comments>
		<pubDate>Mon, 04 Apr 2011 04:42:53 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Qt]]></category>
		<category><![CDATA[64-bit assembly]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[Performance Computing]]></category>
		<category><![CDATA[SSE]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=1111</guid>
		<description><![CDATA[In previous posts we&#8217;ve covered Inline Assembly and SSE Intrinsics coding. In this post we&#8217;ll merge these concepts by creating a version of the CMYK to RGB conversion code strictly in raw SSE and assembly. The upshot is you&#8217;ll see how we can take existing, real-world C++ code and use GCC&#8217;s Extended Assembly syntax to [...]]]></description>
			<content:encoded><![CDATA[<p>In previous posts we&#8217;ve covered <a title="Inline Assembly" href="http://www.formboss.net/blog/2010/10/gcc-inline-assembly-loop-structures/" target="_blank">Inline Assembly</a> and <a title="SSE intrinsics" href="http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/" target="_blank">SSE Intrinsics coding</a>.</p>
<p>In this post we&#8217;ll merge these concepts by creating a version of the CMYK to RGB conversion code strictly in <strong>raw SSE </strong>and <strong>assembly</strong>. The upshot is you&#8217;ll see how we can take existing, real-world C++ code and use GCC&#8217;s <a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html">Extended Assembly</a> syntax to interweave raw assembly code for potential performance gains.</p>
<p>This means this tutorial is not just about extended assembly or sse coding, it&#8217;s about using both in a real-world application. We&#8217;ll learn many concepts including data retrieval, loop processing, SSE processor instructions, floating point number representation, and much more!</p>
<p><span id="more-1111"></span></p>
<p>Let&#8217;s start our tutorial by taking a quick look at the core algorithm logic we&#8217;ll be implementing (for a more in-depth refresher this is covered in some detail <a href="http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/">here </a>):</p>
<pre class="brush: plain; title: ; notranslate">
c = 1.0 - (bits / 255f)
m = 1.0 - (bits / 255f)
y = 1.0 - (bits / 255f)

k = the min of c/m/y

if k != 1
c = (c - k) / (1 - k)
m = (m - k) / (1 - k)
y = (y - k) / (1 - k)

c = c * 255
m = m * 255
y = y * 255
k = k * 255
</pre>
<p>The data source (the incoming image data) is a call to Qt&#8217;s <a title="Qt Bits() Function Call" href="http://doc.qt.nokia.com/4.7/qimage.html#bits" target="_blank">QImage::bits()</a> function, which returns a pointer to a uchar array containing the raw image data-stream.</p>
<p>The destination, that is, the RGB converted data, is a heap-based uchar array created via:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
uchar *cmyk_temp = new uchar[(wt * ht) * 4];
</pre>
<p>In other words we already have a source and a destination set in C++. (To put it simply, the point of this tutorial is to write assembly to mange the data in-between these two places.) This means the first big task in creating an assembly version is knowing how we &#8216;hook&#8217; into those existing data structures.</p>
<p>A key point in this regard is that while we&#8217;ll manage loops and data access in our assembly code, we do <em>not</em> want to think about managing the stack. Thus, we can think of these two arrays as places we&#8217;ll get the address to via pointers, but never create and manage in assembly on our own.</p>
<p>So how do we access this data from GCC Extended Assembly?</p>
<p>The answer is we, in extended assembly syntax, use these pointer variables as <strong>input </strong>and <strong>output </strong>parameters. (please see <a href="http://www.formboss.net/blog/2010/06/c-64bit-inline-assembly-primer-part-1/">here </a>for a brief primer on GCC inline assembly)</p>
<p>For example, lets say we have the following C++:</p>
<pre class="brush: plain; title: ; notranslate">
int a = 1;

__asm__ __volatile__(

&quot;mov %0, %%ebx \n\t&quot;

: /* output parameters */

: &quot;m&quot;(a)

: ebx /* clobbers */

);
</pre>
<p>This code says that we&#8217;ll take the <strong>int a</strong> variable and push it to <strong>ebx</strong>. The key is behind the scenes CGG will actually rewrite this mov instruction to something like:</p>
<pre class="brush: plain; title: ; notranslate">
mov -0x20(%rbx), %ebx
</pre>
<p>In other words, we let GCC manage the stack. Using Extended Assembly like this means we don&#8217;t care where on the stack int a comes from, we just want to make sure we have access to it. The extended assembly syntax allows for this easy manipulation of data.</p>
<p>For our code, this means the start of our implementation will actually be a series of C++ variable declarations that we&#8217;ll end up passing into the __asm__ call.</p>
<p>One item of note here is along with simple arrays pointers and ints, we also declare two<strong> __m128</strong> objects for easy double quad-word storage of constants we&#8217;ll need for our calculations, those being vectors of 1.0f and 255.0f.</p>
<p>&#8220;Talking&#8221; to these  __m128 items is accomplished in much the same as our example above, only now we use the <strong>movaps </strong>mnemonic as in:</p>
<pre class="brush: plain; title: ; notranslate">
&quot;movaps %5, %%xmm14 \n\t&quot; // 255.0﻿﻿
</pre>
<p>In other words, bytes, chars, floats, __m128&#8242;s &#8212; we can create whatever we need and pass it to the assembly routine, which means we don&#8217;t need to worry about the stack.</p>
<h3>Register Pressure</h3>
<p>This takes us to one of the main goals of this exercise: <strong>use as many CPU registers as possible during the conversion process</strong>. This means one of the explicit assumptions of this code is it only runs on 64-bit machines. That is, it&#8217;s hard-coded to use the full register set available to x86_64.</p>
<p>This means we have an extended range of 64-bit (quad-word) general purpose registers (r8-r15), as well as the full set of 128-bit (double quad-word) SSE registers, xmm0-xmm15.</p>
<p>Obviously the assumption here is the fewer trips to main memory we can make, the faster our code will be.</p>
<p>And so, the first &#8216;preparatory&#8217; part of our assembly code is spent mapping various constants to known registers so we can refer back to them as often as needed without making expensive trips to main memory.</p>
<p>The interesting bit here is not all of our mov&#8217;s are for the same purpose. Some moves, as in above, are to store 128 bit values to an SSE register. Others are for setting up bit masks, and still others are set up for loops counters.</p>
<p>Of course key is that this logic happens outside of our main conversion loop. Once we enter the loop we do as <strong>much </strong>as we can to avoid reads and writes to main memory.</p>
<p>With that said, let&#8217;s jump strait into the code and see how it works!</p>
<h3>Part 1 &#8211; Initialization.</h3>
<p>In this section we perform the essential task of initializing the variables we&#8217;ll pass to the extended asm routine. </p>
<p>Note the <strong>*bits</strong> array, this is the source of the values being used. This is an example of a standard link to a variable, though we also create a few float arrays and zero out their initial values for easier debugging.</p>
<p>The most important bit in this block is the <strong>_pack_lookup_table</strong> float array. The creation of this item (a classic <a href="http://en.wikipedia.org/wiki/Lookup_table">lookup table</a>) is to reduce the possible overhead of the uchar to float conversions we must make.</p>
<p>The idea is simple: because we only have 256 possible input values, instead of converting each in turn at run-time, lets just create a lookup table and map the input values to the IEE floating point representation created in the loop (whose values are calculated at compile, <strong>not </strong>run-time). This has a huge potential benefit as each pixel will require 3 conversions, and when you&#8217;re dealing with 8 million pixels per image these int to float conversions can really add up.</p>
<p>To be clear though, converting ints to floats on modern hardware isn&#8217;t a huge deal (around 8-16 cycles on most CPU&#8217;S), but this is still a handy way to learn about assembly coding. We will see in Part 4 however, that our lookup table may not be the best possible solution because of the extra memory trips we end up making. </p>
<p>Alas, this is also a good excuse to learn a bit more about assembly level addressing modes, which we&#8217;ll see a good deal of in many different places.</p>
<p>As such, the code below is just standard C++.</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
int t = 1;
int l = (char)t;

uchar *bits = imin.bits();

int count = (wt * ht) * 4; // link with loop counter byte size

int k_min_value;

float * _pack_loop_array = new float[4];
// zero out array for easier debugging
_pack_loop_array[0] = 0;
_pack_loop_array[1] = 0;
_pack_loop_array[2] = 0;
_pack_loop_array[3] = 0;

float * _mask_array = new float[4];
// zero out array for easier debugging
_mask_array[0] = 0;
_mask_array[1] = 0;
_mask_array[2] = 0;
_mask_array[3] = 0;

float * _pack_lookup_table = new float[256];
for(int i = 0; i &lt;= 256; i++){
    _pack_lookup_table[i] = (float)i;
}

// mmx constants - use m128 to ensure aligned data
__m128 m_1 = _mm_set_ps1(1.0f);
__m128 m_255 = _mm_set_ps1(255.0f);
</pre>
<h3>Part 2 &#8211; Initial __asm__ and Constant Creation</h3>
<p>In this section we dive into the heart of our routines assembly code. </p>
<p>A good deal of space is for initialising loop constants and mapping values to registers&#8211;pretty standard stuff.</p>
<p>We do have one interesting bit though, which is the creation of the 128-bit sign-flip mask.</p>
<p>We&#8217;ve done this because in IEE single-precision floating point representation, the HO bit (bit 31) is <em>always</em> the sign bit. When the bit&#8217;s 0 the value is positive, when it&#8217;s a 1 negative.</p>
<p>We can exploit this fact to easily flip the sign of a float, or in our case, the packed floats. Problem is, in order to create a 32-bit string in the form of the proper sign-flip mask (0&#215;80000000) we would have to resort to all sorts of trickery in C++. That is, how do you create the bit pattern of our mask without the compiler trying to turn it into something else, such as a float, int, or char? It sounds like an easy problem to address, but it isn&#8217;t. </p>
<p>Thus, as the value needs to be created anyway, we&#8217;ll just do so in assembly, where defining and populating memory locations with arbitrary bit-strings is easy.</p>
<p>In the end we push this mask value to xmm4 where it remains constant throughout the life of the conversion process.</p>
<pre class="brush: plain; title: ; notranslate">
__asm__ __volatile__(

    // set sse constants

    &quot;movaps %5, %%xmm14 \n\t&quot; // 255.0
    &quot;movaps %4, %%xmm15 \n\t&quot; // 1.0

    // init sign SSE sign-flip mask
    &quot;xorps %%xmm4, %%xmm4 \n\t&quot;

    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;mov $0x80000000, %%r11 \n\t&quot;
    &quot;mov %9, %%rbx \n\t&quot;

    // populate mask values

    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;

    // populate sse reg
    &quot;movaps (%%rbx), %%xmm4 \n\t&quot;

    // init loop constants

    &quot;xor %%rax, %%rax \n\t&quot; /* init i array counter */

    &quot;xor %%rdx, %%rdx \n\t&quot; /* set array counter upper bounds */
    &quot;mov %2, %%edx \n\t&quot;

    &quot;xor %%rcx, %%rcx \n\t&quot; /* get base address of source array */
    &quot;mov %1, %%rcx \n\t&quot;

    &quot;xor %%rbx, %%rbx \n\t&quot; /* get base address of dest array */
    &quot;mov %3, %%rbx \n\t&quot;

    &quot;xor %%r14, %%r14 \n\t&quot; /* get base address of _pack_loopup array */
    &quot;mov %7, %%r14 \n\t&quot;

    &quot;xor %%r9, %%r9 \n\t&quot; /* L_PACK_LOOP Bounds Check ($0x10) */
    &quot;mov $0x10, %%r9 \n\t&quot; // init with decimal 16 -  4 bytes x 4 loops

    &quot;xor %%r10, %%r10 \n\t&quot; /* get base address of _pack_loop_array */
    &quot;mov %6, %%r10 \n\t&quot;

    // init logic constants

    &quot;xor %%r13, %%r13 \n\t&quot;
    &quot;mov $0x1, %%r13 \n\t&quot; // 1 value for k_min compare
</pre>
<h3>Part 3 &#8211; Initial Loop Logic</h3>
<p>The following block of code mainly just resets r8 and r15 for loop counting and addressing purposes.</p>
<p>Again, the point of this exercise is to use as many registers as possible. Part of this means we also need to carefully manage how and where registers are used.</p>
<p>Moving (copying of course) rax to r15 for example, means r15 points to important locations we need to index from, but can be incremented without fear of &#8220;breaking&#8221; the main index register, which for our code is rax. We&#8217;ll see this type of activity in a few other places as well.</p>
<p>The point being of course that we&#8217;re not hitting main memory, just registers. </p>
<pre class="brush: plain; title: ; notranslate">
&quot;.L_MAIN_LOOP:&quot; // convert each value to float, push to xmm0

    // subroutine - created packed data from single loop

    &quot;xor %%r8, %%r8 \n\t&quot; // inner loop counter
    &quot;mov $0x0, %%r8 \n\t&quot;

    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;xor %%r12, %%r12 \n\t&quot;

    &quot;xor %%r15, %%r15 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot; // PACK_INDEX
</pre>
<h3>Part 4 &#8211; Converting Ints To Floats &#8211; Two methods</h3>
<p>With the main loop initialisation done we&#8217;re now free to start processing data. We start by grabbing the current CMYK value from the source array and placing it into r11, then masking it to leave only the LO byte.</p>
<p>This sets us up for the next step, which is converting the int values into proper floats for the xmm registers. Let&#8217;s look at two ways of accomplishing this step:</p>
<p><strong>Method 1 &#8211; Memory Dependent And Slightly Slower &#8211; A Lookup Table In Action</strong><br />
Per above, the first step in both methods is to move and mask the raw int value to r11. Crucially this number will always be between 0-255. We exploit this fact in the <strong>mov</strong> call where r14 is the base address of the <strong>_pack_lookup_table</strong> array we created in C++, and r11, (also always between 0 and 255), acts as the index.</p>
<p>This is a common optimization known as a lookup table. In essence, instead of converting ints to floats (at a cost of around 8 cycles per conversion on my machine), we instead use simple memory moves to place a pre-computer floating point bit-string value created via the <strong>_pack_lookup_table</strong> array.  </p>
<p>This is an intellectually pleasing way to handle the conversion step but it comes with an unfortunate side-effect. The problem is x86_64 provides no instruction for moving r32 values directly into xmm registers. We can however, move memory values to xmm, only this presents another problem: In order for our lookup table to work we access main memory twice: once for the lookup table value, then once again to store the floating point bit string in an aligned __m128 memory location (which is later used as the argument to <strong>movaps</strong>). Such repeated memory access can be devastatingly slow, and when compared to just doing a relatively speedy direct conversion, becomes tough to justify.</p>
<p>All told, in my tests the lookup table code <em>would</em> be a touch faster than GCC&#8217;s output if it were not for these <strong>mov</strong> instructions. Without those institutions though, we&#8217;d have no lookup table!</p>
<pre class="brush: plain; title: ; notranslate">
&quot;.L_PACK_LOOP:&quot;

    &quot;nop \n\t&quot;

    &quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

    &quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

    // use lookup table to push pre-converted float val
    &quot;mov (%%r14, %%r11, 4), %%r12 \n\t&quot; // indexed value

    // this is slow!
    &quot;mov %%r12, (%%r10, %%r8, 1) \n\t&quot; // push to _pack_loop_array

    &quot;add $0x1, %%r15 \n\t&quot; // 1 pixel per loop iteration

    &quot;add $0x4, %%r8 \n\t&quot;

    &quot;cmp %%r8, %%r9 \n\t&quot;

    &quot;jne .L_PACK_LOOP \n\t&quot;

&quot;xor %%r8, %%r8 \n\t&quot; // clear r8 index for push to xmm

&quot;movaps (%%r10, %%r8, 1), %%xmm0 \n\t&quot; /* move packed values into mmx */
</pre>
<p>You&#8217;ll notice the last instruction sets our 128-bit xmm0 register with the value of the pack_loop_array.</p>
<p>All told this works, and again, is somewhat intellectually pleasing. The question is can this be made faster? The answer is yes it can, and the secret is to avoid memory access at all costs. Unfortunately, this means we must rid ourselves of the lookup table, as described in Method 2&#8230;</p>
<p><strong>Method 2 &#8211; Using Hardware Conversion &#038; Avoiding Memory Access</strong><br />
The second attempt at this logic turns out to be better in terms of performance, even though we ditch the lookup table. </p>
<p>The basic idea is instead of worrying about the conversion costs, we embrace them and use an unrolled conversion block where each value is simply run through <strong>CVTSI2SS</strong>, then shuffled to make room for the next value.<em> It should be said that the unroll logic provided no discernible speed-up&#8211;the improvements come from fewer costly memory access steps</em>. </p>
<pre class="brush: plain; title: ; notranslate">
// c

&quot;add $0x2, %%r15 \n\t&quot; // index for c value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;

&quot;shufps $0xC4, %%xmm0, %%xmm0 \n\t&quot; // flip #2 and #4 11000110

// m

&quot;sub $0x1, %%r15 \n\t&quot; // index for m value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;

&quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot; // flip last two to get m 11100001

// y

&quot;sub $0x1, %%r15 \n\t&quot; // index for y value

&quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot; // get source array value

&quot;and $0x00000000000000ff, %%r11 \n\t&quot; // mask unwanted bits

&quot;CVTSI2SS %%r11, %%xmm0 \n\t&quot;
</pre>
<p>The big difference here is memory access is kept to a bare minimum. With this simple change my implementation runs around the same speed as GCC&#8217;s -02.</p>
<h3>Part 5 &#8211; CMY Conversion Step 1</h3>
<p>Now that we have a series of 4 floating point numbers in xmm0 (our packed data) we can begin the conversion process. </p>
<p>The code starts with what has to be one of the more satisfying parts of SSE coding. As we now have a 128-bit packed string we can divide and subtract our constant values from the working value in two easy calls to <strong>divps</strong> and <strong>subps</strong>. It&#8217;s getting a lot of work done in two quick instructions, which is always a good feeling!</p>
<p>Things gets a bit more interesting after that though. At this point we most likely have a series of negative numbers, something that doesn&#8217;t play well with the next step of the algorithm. </p>
<p>This is because we need to find the lowest value of bunch to create our k value, but when we have the possibility of negative numbers this task becomes impossible.</p>
<p>Thus, before we can advance we need to flip the 31st bit of each packed float to 0 so the value comparison is valid and we find the &#8216;true&#8217; lowest value. Think of this as the assembly equivalent of an <strong>abs()</strong> call. </p>
<p>This is where our mask comes in. The idea of masking is simple enough, though unfortunately it becomes more of a chore when dealing with packed values. </p>
<p>The basic problem is their is no &#8216;default&#8217; mask we can create that will safely mask only the 31st bit of the 4 packed values without also knowing <em>which</em> values in the packed string are already positive. This is because <strong>XOR</strong> will flip negative to positive, but also positive to negative. We don&#8217;t want that of course, we only want the former.</p>
<p>Thus, we first need to modify our mask (known as a selection mask), to ignore positive values so they aren&#8217;t accidentally converted to negative values. </p>
<p>To do so means we must copy our mask and working xmm0 value to new registers. We then call <strong>psrad</strong> against the xmm0 copy to create a selection mask where values that have the 31st bit set (meaning it&#8217;s negative), become all 1&#8242;s ($0xffffffff). When then <strong>logically AND</strong> this value with the default selection mask copy to turn the double-words with positive values back to 0. </p>
<p>This new, modified mask can now be safely <strong>xor</strong>&#8216;d with the original value to only flip negatives to positives.</p>
<pre class="brush: plain; title: ; notranslate">
// perform cmy conversion

// divide by 255
&quot;divps %%xmm14, %%xmm0 \n\t&quot;

// subtract 1
&quot;subps %%xmm15, %%xmm0 \n\t&quot;

// mask back to positive values for min processing

// copy mask for modifier change into xmm5 (xmm4 is mask constant)
&quot;movaps %%xmm4, %%xmm5 \n\t&quot;

// copy raw value for min to xmm6
&quot;movaps %%xmm0, %%xmm6 \n\t&quot;

// create selection mask modifier
&quot;psrad $0x31, %%xmm6 \n\t&quot; // for every negative value, makes item all $0xffffffff

// modify mask copy (xmm5) to mask only negative values
&quot;andps %%xmm6, %%xmm5 \n\t&quot;

// now mask sign bits to 0 with modified mask
&quot;xorps %%xmm5, %%xmm0 \n\t&quot;
</pre>
<h3>Part 6 &#8211; CMY Conversion &#8211; Find Min</h3>
<p>Finding the minimum value in assembly is very similar in practise to the last section where we manipulate masks, only now we&#8217;re also shuffling values around too.</p>
<p>Granted we also had to shuffle values around in the SSE intrinsics version as well, but here we need to think a bit harder about our bit-masks, as we no longer have the benefit of using Intel&#8217;s handy <strong>_MM_SHUFFLE</strong> helper. This type of coding is, I can safely say, a bit more complex. </p>
<p>After we find the minimum value we call <strong>cmpeqps</strong> and then store the LO result in <strong>k_min_value</strong>, which as we can see from the first code block is a simple int we pass into the asm block as a memory value. This may seem like a wasted step, but if we look back a bit we can see that <strong>movss</strong> can only write to a memory location. </p>
<p>We then use this value in a cmp block and if &#8216;true&#8217;, perform a bit of extra processing on our image data.</p>
<pre class="brush: plain; title: ; notranslate">
// find min value
&quot;movaps %%xmm0, %%xmm1 \n\t&quot;
&quot;shufps $0x4E, %%xmm0, %%xmm1 \n\t&quot; // reorder values (mask first)

&quot;minps %%xmm0, %%xmm1 \n\t&quot; // find first min, put in xmm1

&quot;movaps %%xmm1, %%xmm2 \n\t&quot;
&quot;shufps $0xB1, %%xmm1, %%xmm2 \n\t&quot;

&quot;minps %%xmm1, %%xmm2 \n\t&quot; // min (k) in xmm2

// process min logic (is min value == 1? - true = cmpeqps creates mask of all 1's)
&quot;movaps %%xmm15, %%xmm12 \n\t&quot; // move 1's to xmm12
&quot;cmpeqps %%xmm2, %%xmm12 \n\t&quot; // mask now in xmm12

&quot;movss %%xmm12, %8 \n\t&quot; // move mask to memory value for cmp

&quot;cmp %8, %%r13 \n\t&quot; // is k == 1?

&quot;jne .L_MULTIPLY_ALL \n\t&quot; // anything but 1, skip this block

// save 1 - k value
&quot;movaps %%xmm15, %%xmm13 \n\t&quot; // xmm13 is temp storage for 1.0f
&quot;subps %%xmm2, %%xmm13 \n\t&quot; // 1-k in xmm13

// subtract k from all
&quot;subps %%xmm2, %%xmm0 \n\t&quot;

// c-k / 1-k
&quot;divps %%xmm13, %%xmm0 \n\t&quot;

&quot;.L_MULTIPLY_ALL:&quot;

&quot;mulps %%xmm14, %%xmm0 \n\t&quot; // cmy * 255
&quot;mulps %%xmm14, %%xmm2 \n\t&quot; // k * 255
</pre>
<h3>Part 7 &#8211; Export Values</h3>
<p>With the conversion process done we can now write our values back to the destination buffer.</p>
<p>At this point the question becomes: how do we move these packed floats back into an unsigned char array?</p>
<p>This is a surprisingly serious question, as their is (strangely) no native machine instruction for doing so.</p>
<p>Thus, we need to be a bit creative. To that end, the general solution is we extract the LO double-word from the source vector converting it to a <strong>truncated int</strong> in the process. We then grab the LO byte of <em>that </em>value and store it in an indexed position in the destination buffer. Next we shuffle the vector to place a new value in the LO double-word and repeat the process. In the end we <em>store each value back in turn</em> as opposed to sending all back to main memory in one shot (which would be far better in terms of performance). </p>
<p>As far as the exact implementation in our code: to start, the first instruction is for performance reasons, in that when we call <strong>CVTTSS2SI </strong>to covert our float values to truncated ints, they come back as double-words, but we only want bytes. As there are no such instructions for this type of conversion, it becomes apparent their must be <em>some</em> way to get byte values back. Sure enough, <strong>CVTTSS2SI </strong>stores the LO byte of the conversion result into the destination register&#8217;s 8 bit &#8216;shadow&#8217; register (which is to say, just the LO single byte of the full ecx register&#8211;this is not a &#8216;different&#8217; register by any means).</p>
<p>Thus, in our code we push our index register <strong>rcx </strong>to <strong>r11</strong> so it can be restored, because if we use <strong>rcx </strong>as the destination register for <strong>CVTTSS2SI</strong>, the LO byte value of the result is saved to the &#8216;shadow register&#8217; <em>cl</em>, which is exactly the value we need (an unsigned int between 0 and 255). We must do this in rcx as r11 has no such 8 bit equivalent.  </p>
<p>The next instruction is to push the main loop counter, <strong>rax</strong>, to <strong>r15</strong>. Again, it&#8217;s all about saving registers, which means because <strong>rax </strong>contains the same offset address as our other arrays, we can use that along with <strong>rbx </strong>to create our indexed address for the output array population step.</p>
<p>Of course this means we need to manipulate this register with the proper byte offset our values need to be pushed to, which is the reason for the <strong>add</strong> and <strong>sub</strong> calls.  </p>
<p>We do this as the LittleCMS engine process our output buffer is created filled for expects CMYK byte order, but our xmm0 values are currently in YMCK. Instead of issuing shuffles we simply increment and decrement our <strong>r15 </strong>register with the appropriate byte offsets depending on which color we&#8217;re writing to the destination array. This is how we rearrange YMCK into CMYK, which is simple and effective.</p>
<pre class="brush: plain; title: ; notranslate">
// conversion done, export values

&quot;mov %%rcx, %%r11 \n\t&quot; // save rcx so we can convert dwords to words using cl

&quot;mov %%rax, %%r15 \n\t&quot; // pointer for insert values (r15)

// y - truncate and export values to array
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // y

&quot;add $0x2, %%r15 \n\t&quot; // add 2 for y value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

// m
&quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot; // flip last two to get m
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // m

&quot;sub $0x1, %%r15 \n\t&quot; // subtract 1 for m value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

// c
&quot;shufps $0x4E, %%xmm0, %%xmm0 \n\t&quot; // flip #2 and #4
&quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot; // c

&quot;mov %%cl, (%%rbx, %%rax, 1) \n\t&quot; // rax is base address, no need to sub

// k
&quot;CVTTSS2SI %%xmm2, %%rcx \n\t&quot; // m

&quot;add $0x2, %%r15 \n\t&quot; // add 3 for k value insert
&quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;

&quot;mov %%r11, %%rcx \n\t&quot; // restore rcx
</pre>
<h3>Part 8 &#8211; Closing The Loop and __asm__ Logic</h3>
<p>The closing assembly bits are simply to increment our loop counter.</p>
<p>The last part is of course the extended assembly closing bit where we set input, output, and clobbers.</p>
<pre class="brush: plain; title: ; notranslate">
    &quot;addq $0x4, %%rax \n\t&quot; // increment main loop counter
    &quot;cmpq %%rdx, %%rax \n\t&quot;

    &quot;jne .L_MAIN_LOOP \n\t&quot;

    : &quot;=m&quot; (cmyk_temp)  /* destination */
    : &quot;m&quot; (bits), &quot;m&quot; (count), &quot;m&quot; (cmyk_temp), &quot;m&quot; (m_1), &quot;m&quot; (m_255), &quot;m&quot; (_pack_loop_array), &quot;m&quot; (_pack_lookup_table), &quot;m&quot; (k_min_value), &quot;m&quot;(_mask_array) /* source */
    : &quot;ebx&quot;, &quot;ecx&quot;, &quot;edx&quot;, &quot;memory&quot; /* clobbers */

); // asm

delete _pack_lookup_table;
delete _pack_loop_array;
delete _mask_array;
</pre>
<h3>Performance Notes</h3>
<p>One of the main reasons I was interested in pursuing this task was to see if I could write code that had some parity with GCC&#8217;s -02 output. In all I&#8217;m relatively happy, with performance generally being only a few hundred milliseconds slower over the course of 120 images:</p>
<pre class="brush: plain; title: ; notranslate">
== || Total Process Time Elapsed (milliseconds): 21228 SSE 2 (unrolled intrinsics)
== || Total Process Time Elapsed (milliseconds): 23252 Raw Assembly
== || Total Process Time Elapsed (milliseconds): 21344 Raw Assembly
== || Total Process Time Elapsed (milliseconds): 21125 SSE 1 (intrinsics)
== || Total Process Time Elapsed (milliseconds): 20747 SSE 2 (unrolled intrinsics)
</pre>
<p>As fun as creating this code was, the lesson here is if you can use intrinsics do it. There is simply no reason to slave over this type of coding if you can possibly avoid it, as in the end you&#8217;ll have a very hard time besting your compiler anyway. Yes you may get lucky every now and then, but my guess would be most of the time, not so much. Intrinsics rock, use em!</p>
<p>The one thing I would say though is there <em>are</em> some holes in the current intrinsic line-up which make raw assembly attractive. One glaring example is the inability to deal with packed values in a horizontal fashion. That is, it should be very simple to find the min of four packed floats using one intrinsic&#8211;but no such intrinsic exists. There are <em>plans </em>for one, but nothing as of yet.</p>
<p>It would also be nice to have a native machine instruction to perform packed sign flips. If we did the above code could be a touch shorter. </p>
<p>Finally, it has to be said this algorithm is really not the best for SSE optimizations. This is because at it&#8217;s heart the vectors we create are non-uniform which means we end up treating each vector as 4 separate values instead of a single block. Actually, this <em>severely limits</em> us in terms of performance gains, which is why we&#8217;ll see non-SSE versions of GCC&#8217;s -O2 come very close if not best the SSE versions. </p>
<p>The good news is the techniques covered in this tutorial are still valid. Just keep in mind not all data structures are well suited for SSE. </p>
<h3>General hints</h3>
<p>Along the way I tried to keep note of the various idiosyncrasies of this type of coding. </p>
<p>Of course the best piece of advice is to just see what your own assembler is doing in a debug session.</p>
<p>The problem with this approach however, is that unlike Intel syntax, GCC&#8217;s Gas syntax adds many rules and formatting requirements that are not present in raw assembly output. For example, you&#8217;ll notice that Gas requires that we end each line with \n\t.</p>
<p>With that said, here are a few hints:</p>
<p><strong>CGG Reversed Syntax</strong><br />
Ha ha. But seriously, GCC (AT&#038;T syntax) reverses the order of operands in specific ways, meaning many pieces of documentation you&#8217;ll read won&#8217;t work unless we first reverse them. It&#8217;s important to note it&#8217;s not just the operands in two operand instructions, but also 3 operand ones too.</p>
<p>A good example is this guy from my code:</p>
<pre class="brush: plain; title: ; notranslate">shufps $0xe4,%xmm1,%xmm0
</pre>
<p>In Intel documentation it looks like:</p>
<pre class="brush: plain; title: ; notranslate">
SHUFPS xmm1, xmm2/m128, imm8
</pre>
<p>Thus, not only are operands reversed (source/destination), <strong>so too is the immediate value</strong>. </p>
<p><strong>Using 64-bit? Use 64-bit Register Names</strong><br />
This one sounds obvious, but it can be easy to forget that this won&#8217;t work:</p>
<pre class="brush: plain; title: ; notranslate">&quot;mov %%r15, %%eax \n\t&quot;</pre>
<p>Whereas this will:</p>
<pre class="brush: plain; title: ; notranslate">&quot;mov %%r15, %%rax \n\t&quot;</pre>
<h3>Conclusion</h3>
<p>And that&#8217;s it, hopefully you&#8217;ve learned a bit more about assembly coding : )</p>
<p>If you have any questions, please post &#8216;em below!</p>
<h3>Full Code Listing</h3>
<pre class="brush: plain; title: ; notranslate">
__asm__ __volatile__(
    &quot;movaps %5, %%xmm14 \n\t&quot;
    &quot;movaps %4, %%xmm15 \n\t&quot;
    &quot;xorps %%xmm4, %%xmm4 \n\t&quot;
    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;mov $0x80000000, %%r11 \n\t&quot;
    &quot;mov %9, %%rbx \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;add $0x4, %%rax \n\t&quot;
    &quot;mov %%r11,(%%rbx, %%rax, 1) \n\t&quot;
    &quot;movaps (%%rbx), %%xmm4 \n\t&quot;
    &quot;xor %%rax, %%rax \n\t&quot;
    &quot;xor %%rdx, %%rdx \n\t&quot;
    &quot;mov %2, %%edx \n\t&quot;
    &quot;xor %%rcx, %%rcx \n\t&quot;
    &quot;mov %1, %%rcx \n\t&quot;
    &quot;xor %%rbx, %%rbx \n\t&quot;
    &quot;mov %3, %%rbx \n\t&quot;
    &quot;xor %%r14, %%r14 \n\t&quot;
    &quot;mov %7, %%r14 \n\t&quot;
    &quot;xor %%r9, %%r9 \n\t&quot;
    &quot;mov $0x10, %%r9 \n\t&quot;
    &quot;xor %%r10, %%r10 \n\t&quot;
    &quot;mov %6, %%r10 \n\t&quot;
    &quot;xor %%r13, %%r13 \n\t&quot;
    &quot;mov $0x1, %%r13 \n\t&quot;
    &quot;.L_MAIN_LOOP:&quot;
    &quot;xor %%r8, %%r8 \n\t&quot;
    &quot;mov $0x0, %%r8 \n\t&quot;
    &quot;xor %%r11, %%r11 \n\t&quot;
    &quot;xor %%r12, %%r12 \n\t&quot;
    &quot;xor %%r15, %%r15 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot;
    &quot;.L_PACK_LOOP:&quot;
    &quot;mov (%%rcx, %%r15, 1), %%r11 \n\t&quot;
    &quot;and $0x00000000000000ff, %%r11 \n\t&quot;
    &quot;mov (%%r14, %%r11, 4), %%r12 \n\t&quot;
    &quot;mov %%r12, (%%r10, %%r8, 1) \n\t&quot;
    &quot;add $0x1, %%r15 \n\t&quot;
    &quot;add $0x4, %%r8 \n\t&quot;
    &quot;cmp %%r8, %%r9 \n\t&quot;
    &quot;jne .L_PACK_LOOP \n\t&quot;
    &quot;xor %%r8, %%r8 \n\t&quot;
    &quot;movaps (%%r10, %%r8, 1), %%xmm0 \n\t&quot;
    &quot;divps %%xmm14, %%xmm0 \n\t&quot;
    &quot;subps %%xmm15, %%xmm0 \n\t&quot;
    &quot;movaps %%xmm4, %%xmm5 \n\t&quot;
    &quot;movaps %%xmm0, %%xmm6 \n\t&quot;
    &quot;psrad $0x31, %%xmm6 \n\t&quot;
    &quot;andps %%xmm6, %%xmm5 \n\t&quot;
    &quot;xorps %%xmm5, %%xmm0 \n\t&quot;
    &quot;movaps %%xmm0, %%xmm1 \n\t&quot;
    &quot;shufps $0x4E, %%xmm0, %%xmm1 \n\t&quot;
    &quot;minps %%xmm0, %%xmm1 \n\t&quot;
    &quot;movups %%xmm1, %%xmm2 \n\t&quot;
    &quot;shufps $0xB1, %%xmm1, %%xmm2 \n\t&quot;
    &quot;minps %%xmm1, %%xmm2 \n\t&quot;
    &quot;movaps %%xmm15, %%xmm12 \n\t&quot;
    &quot;cmpeqps %%xmm2, %%xmm12 \n\t&quot;
    &quot;movss %%xmm12, %8 \n\t&quot;
    &quot;cmp %8, %%r13 \n\t&quot;
    &quot;jne .L_MULTIPLY_ALL \n\t&quot;
    &quot;movaps %%xmm15, %%xmm13 \n\t&quot;
    &quot;subps %%xmm2, %%xmm13 \n\t&quot;
    &quot;subps %%xmm2, %%xmm0 \n\t&quot;
    &quot;divps %%xmm13, %%xmm0 \n\t&quot;
    &quot;.L_MULTIPLY_ALL:&quot;
    &quot;mulps %%xmm14, %%xmm0 \n\t&quot;
    &quot;mulps %%xmm14, %%xmm2 \n\t&quot;
    &quot;mov %%rcx, %%r11 \n\t&quot;
    &quot;mov %%rax, %%r15 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;add $0x2, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;shufps $0xE1, %%xmm0, %%xmm0 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;sub $0x1, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;shufps $0x4E, %%xmm0, %%xmm0 \n\t&quot;
    &quot;CVTTSS2SI %%xmm0, %%rcx \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%rax, 1) \n\t&quot;
    &quot;CVTTSS2SI %%xmm2, %%rcx \n\t&quot;
    &quot;add $0x2, %%r15 \n\t&quot;
    &quot;mov %%cl, (%%rbx, %%r15, 1) \n\t&quot;
    &quot;mov %%r11, %%rcx \n\t&quot;
    &quot;addq $0x4, %%rax \n\t&quot; // increment main loop counter
    &quot;cmpq %%rdx, %%rax \n\t&quot;
    &quot;jne .L_MAIN_LOOP \n\t&quot;

    : &quot;=m&quot; (cmyk_temp)  /* destination */
    : &quot;m&quot; (bits), &quot;m&quot; (count), &quot;m&quot; (cmyk_temp), &quot;m&quot; (m_1), &quot;m&quot; (m_255), &quot;m&quot; (_pack_loop_array), &quot;m&quot; (_pack_lookup_table), &quot;m&quot; (k_min_value), &quot;m&quot;(_mask_array) /* source */
    : &quot;rax&quot;, &quot;rbx&quot;, &quot;rcx&quot;, &quot;rdx&quot;, &quot;r9&quot;, &quot;r10&quot;, &quot;r11&quot;, &quot;r12&quot;, &quot;r13&quot;, &quot;r14&quot;, &quot;r15&quot;, &quot;memory&quot; /* clobbers */

); // asm
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aesop &#8211; A Hip Hop PHP UI</title>
		<link>http://www.formboss.net/blog/2011/03/aesop-a-hip-hop-php-ui/</link>
		<comments>http://www.formboss.net/blog/2011/03/aesop-a-hip-hop-php-ui/#comments</comments>
		<pubDate>Mon, 14 Mar 2011 08:01:26 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Linux/Web Servers]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Qt]]></category>
		<category><![CDATA[Hip Hop GUI]]></category>
		<category><![CDATA[Hip Hop UI]]></category>
		<category><![CDATA[HipHOP PHP Front End]]></category>
		<category><![CDATA[HipHop PHP UI]]></category>
		<category><![CDATA[Hop Hop PHP UI]]></category>
		<category><![CDATA[HPHP]]></category>
		<category><![CDATA[HPHP GUI]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=992</guid>
		<description><![CDATA[Download the Complete (and free, as in open source free) Aesop + HPHP files from right here. PHP is my favorite language, bare none. It&#8217;s simple, elegant, and fun to use. Problem is, for highly trafficked sites it&#8217;s a touch slow and can be quite memory hungry. If you&#8217;re Facebook this can lead to problems, [...]]]></description>
			<content:encoded><![CDATA[<div style="background-color: #ffffdd; border: 1px dotted #cccccc; padding: 10px; margin-bottom: 10px;"><strong>Download the Complete (and free, as in open source free) Aesop + HPHP files from <a title="Open Source Files" href="http://www.formboss.net/open-source" target="_blank">right here.</a></strong></div>
<p>PHP is my favorite language, bare none. It&#8217;s simple, elegant, and fun to use. Problem is, for highly trafficked sites it&#8217;s a touch slow and can be quite memory hungry.<em><strong> </strong></em>If you&#8217;re Facebook this can lead to problems, which is why they invented <strong>Hip Hop PHP</strong> (<strong>HPHP</strong>), a collection of tools and technology that turns our slow and hungry PHP code into lean and mean C++.<em><strong><br />
</strong></em></p>
<h3>Ok, So Just How Fast Is It?</h3>
<p>As a quick  comparison I created a simple FormBoss form and ran Apache Bench (<strong>ab</strong> from the command line), to get a sense  of the speed difference between Apache 2.2 and HPHP.</p>
<p>The top two tests are when running our simple .php files, the bottom test is when serving a simple 62 byte .xml file with 100 concurrent users:</p>
<p><a rel="attachment wp-att-1064" href="http://www.formboss.net/blog/2011/03/aesop-a-hip-hop-php-ui/apache-v-hphp/"><img class="alignnone size-full wp-image-1064" title="apache-v-hphp" src="http://www.formboss.net/blog/wp-content/uploads/2011/03/apache-v-hphp.png" alt="" width="672" height="328" /></a></p>
<p><span style="font-size: 12px; font-style: italic; color: #aaaaaa;">**It&#8217;s important to note these numbers will be lower when running though a network and calling a database. Also, while other servers like Cherokee can be twice as fast as Apache, HPHP is still nearly twice as fast again.<br />
</span></p>
<p>So yes, qualifications aside, HPHP is very fast indeed.</p>
<p>Sure these numbers are fantastic, but using HPHP means compiling the source from scratch and then using a series of command-line switches to run and manage the compiled PHP code.</p>
<p>No longer&#8211;In my spare time I&#8217;ve created and now released an open-source front-end UI to HPHP.</p>
<p>Read on to learn more, or just <a title="HPHP UI - Aesop" href="http://www.formboss.net/open-source">download the files!</a></p>
<p><span id="more-992"></span></p>
<h3>A quick bit of history and why we need Aesop</h3>
<p>Back in early 2010 Facebook announced a wonderful new technology called <a title="Hop Hop PHP Announced" href="http://developers.facebook.com/blog/post/358/" target="_blank">Hip Hop PHP (HPHP)</a>. The idea was simple: PHP is a fantastic scripting language, but its loosely typed roots and interpreted nature make it a good deal slower and memory hungry than other web technologies. As Facebook was already deeply invested in the PHP they didn&#8217;t want to give it up with out a fight. Thus, why not write a series of tools and applications that turn PHP into native C++?</p>
<p>That&#8217;s what they did of course, and apparently for an entity like FaceBook the savings are <a title="Hip Hop PHP Results" href="https://github.com/facebook/hiphop-php/wiki/" target="_blank">well worth it</a>. That doesn&#8217;t mean that we can&#8217;t enjoy it as well, because along with the announcement that a large percentage of FaceBook&#8217;s traffic was already being server via HPHP, they were also releasing the whole mess as open-source software.</p>
<p>Bravo of course, what a fantastically nice thing to do. The only issue was that in order to get Hip Hop to run we need to compile it from source, then issue a series of command-line instructions to compile and run our applications. This isn&#8217;t the end of the world of course, but often times when trying to compile existing PHP source we&#8217;ll run into a bevy of errors and issues &#8212; not HPHP&#8217;s fault, but the existing PHP sources&#8217;. Thus, it&#8217;s a load of trial and error at first, which considerably diminishes the fun this type of work should be.</p>
<p>Nonsense I said, we can fix that. Lets create an application that removes the burden of working with the command line so we can instead focus on our code.</p>
<h3>The Aesop UI</h3>
<p>I was <a title="Hip Hop PHP" href="http://www.formboss.net/blog/2010/03/hiphop-php-benchmark/" target="_blank">rightly excited</a> when HPHP project was first released, and so even though it took me longer than I had wanted to start <em>this </em>project, I finally present a UI to HPHP.</p>
<p>The idea of course is to make the process of managing and using HPHP simple and possibly even fun, but above all useful.</p>
<p>To that end Aesop consists of three main functions and goals:</p>
<p><strong>Pre-compile and package HPHP -</strong> The idea here is simple. To get us up and running as quickly as possible I&#8217;ve packaged the HPHP binaries in the full download version of Aesop. This means we no longer have to worry about compiling HPHP, which saves time and possible headache. We needn&#8217;t use this version though, we can still download a minimal, Aesop only version complete with full source code.</p>
<p><strong>Compile</strong> &#8211; Compiling HPHP applications from the command line means well, using a command line, setting environmental variables, and so on. Thus, we remove this burden and instead focus on making sure our code-base is acceptable to HPHP.</p>
<p>Aesop builds the file lists HPHP requires for code-compilation, sets environment variables, displays errors, and also manages the results of HPHP for the other main function of Aesop&#8230;</p>
<p><strong>Manage Servers</strong> &#8211; Compiling code is only half the story. Once compiled HPHP has literally created a server-in-a-box for us. It&#8217;s a single executable that contains your source converted to C++, a web server, and the full PHP run-time. Aesop manages these executable for us in a nice list format, allowing us to start, stop and delete them at will. Of course setting their properties using GUI controls or a more advanced interface to a so-called HDF file is included as well.</p>
<h3>Who is this good for?</h3>
<p>I would imagine the first group will be the tinker-types like me who are curious to see what all the fuss is about. My true hope however, is that after a while we can turn this into a true production-level appliance. This means anyone who needs the highest performance PHP stack possible.</p>
<p>To that end there are a few issues still on the table as of the first release:</p>
<ul>
<li>Even easier installation (e.g. packages for DEB and RPM)</li>
<li>Static binding of the Qt libraries version.</li>
<li>More robust process management for the compiler fork.</li>
<li>A full command line interface for running headless.</li>
<li>Support for HPHi</li>
<li>Better support (more GUI options) for ./hphp</li>
<li>Better support for Fedora</li>
</ul>
<p>So yes I, (we) have work to do, but I think even as is the application provides an improved experience over the command line.</p>
<p>So what are the requirements, how do we get it, run it, and provide feedback?</p>
<h3>Requirements</h3>
<p>a) <strong>64-bit Linux.</strong> HPHP requires a 64-bit OS, which in turn means Aesop was built for 64-bit as well. Also, HPHP requires several dependencies which are shown below in <em>apt-ge</em>t format. Fedora/Red Hat users will need to change the command to match your distro&#8217;s naming conventions and package names (e.g. <em>yum install [...]</em>).</p>
<p>b) <strong>The Qt 4 Libraries</strong>. Instructions for getting them are below.</p>
<h3>Getting It</h3>
<p>Download the files from the FormBoss Open Source page <a title="Open Source Files" href="http://www.formboss.net/open-source" target="_blank">right here</a>.</p>
<p>Please note we have two versions: The small one is just the Aesop source code, the large one is the Aesop source and a pre-compiled version of HPHP. Most users will want the full (large) version, as everything is packaged up and ready to be run. If however, you already have a compiled version of HPHP you may just want the smaller, source only download.</p>
<h3>Running It</h3>
<p>In the main download directory file we&#8217;ll find &#8216;<strong>Docs/Instructions.txt</strong>&#8216; file with full details. The short version however:</p>
<p>a) For Ubuntu we&#8217;ll run the following command to install HPHP&#8217;s dependencies:</p>
<p><span style="color: #888888;"><code>sudo apt-get install git-core cmake g++ libboost-dev  libmysqlclient-dev libxml2-dev libmcrypt-dev libicu-dev openssl  binutils-dev libcap-dev libgd2-xpm-dev zlib1g-dev libtbb-dev libonig-dev  libpcre3-dev autoconf libtool libcurl4-openssl-dev libboost-system-dev  libboost-program-options-dev libboost-filesystem-dev wget memcached  libreadline-dev libncurses-dev libmemcached-dev libicu-dev libbz2-dev  libc-client2007e-dev php5-mcrypt php5-imagick</code></span></p>
<p>Fedora users need to install the same packages, but will need to use <em>yum install</em> and modify the names of some packages to match your distributions naming conventions. <a title="Fedora 12 and HPHP" href="http://www.ioncannon.net/programming/918/building-hiphop-php-for-fedora-12-on-64-bit-and-32-bit-systems/" target="_blank">This link</a> has some solid tips on doing so.</p>
<p><span style="color: #888888;"><span style="color: #000000;">b) Open the <em>Ubuntu Software Center</em> and search for and install <strong>Qt Creator</strong>. This installs the Qt Libraries we need to run Aesop. Word is the Qt libs will be installed <a title="Qt Libraries" href="http://www.desktoplinux.com/news/NS7179748543.html" target="_blank">by default</a> in Ubuntu 11.04. In any case, this also installs the most excellent Qt Creator and SDK, which means you may be that much more likely to contribute to this project in some way&#8230;</span></span></p>
<p><span style="color: #888888;"><span style="color: #000000;">Fedora users will need to download Qt directly from <a title="Qt Downloads" href="http://qt.nokia.com/products/">Nokia&#8217;s site</a>.<br />
</span></span></p>
<p>c) Extract the program files:</p>
<p>&gt; For the full download (Aesop.tar.gz), simply extract the whole archive to your <strong>/Documents</strong> directory.<br />
&gt; <strong>Extract and place One-level up</strong> from where we&#8217;ve installed HPHP if you&#8217;ve downloaded the Aesop only archive (<em>Aesop+Source.tar.gz</em>).</p>
<p>d) Make sure <strong>Run-Aesop.sh</strong> is executable. Right click and under <em>Permissions</em> &gt; <em>Allow executing of file as program.</em></p>
<p>e) Double-click Run-Aesop.sh. The application should start, provided we&#8217;ve installed the Qt 4 Libraries.</p>
<p>f) We can also run Aesop in root mode by double clicking <strong>Run-Aesop-As-Root.sh</strong> and then selecting &#8216;<em>Run in Terminal</em>&#8216;. This is for sites that require port 80 to run.</p>
<p>g) We can test out a sample strait away under the second tab: <em>Run Compiled Code</em> &gt; <em>Sample</em></p>
<h3>Providing Feedback</h3>
<p>This is what I&#8217;m most interested in : ) Drop a comment on this post and I&#8217;ll be happy to provide any help and feedback I can.</p>
<h3>Code Limitations</h3>
<p>HPHP is quite versatile, though we do need to be aware of a few HPHP limitations:</p>
<ol>
<li>The MySQLi and MSSQL extensions are not supported. We must use the PDO or MySQL. That said, compiled code <em>can </em>have calls that refer to those drivers (such as in a database class file), just don&#8217;t call any functions of those extensions and we&#8217;re fine.</li>
<li>FTP and a few other extensions are not supported. See <a title="HPHP Extensions" href="https://github.com/facebook/hiphop-php/wiki/Unimplemented-Functions" target="_blank">here</a> for more.</li>
<li>PHP 5.3 is not currently supported.</li>
<li>Any code with eval() is not supported. This means, for example, that <strong>phpBB 3</strong> is not going to work. It compiles&#8230;it just doesn&#8217;t run.</li>
</ol>
<h3>Finally&#8230;</h3>
<p>I created this app in my spare time, quite literally during unit testing for the FormBoss Build 700 release. Aesop is hearty but by no means bug-free. I would love to see this situation resolved by anyone looking to work their own magic with the source!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/03/aesop-a-hip-hop-php-ui/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Connecting to an MSSQL Database in Qt</title>
		<link>http://www.formboss.net/blog/2011/01/connecting-to-an-mssql-database-in-qt/</link>
		<comments>http://www.formboss.net/blog/2011/01/connecting-to-an-mssql-database-in-qt/#comments</comments>
		<pubDate>Mon, 17 Jan 2011 02:33:48 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Qt]]></category>
		<category><![CDATA[Qt and Microsoft SQL Server]]></category>
		<category><![CDATA[Qt and MSSQL]]></category>
		<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=923</guid>
		<description><![CDATA[In this post we&#8217;ll review code for connecting a 32-bit Qt application to an SQL Server 2008 R2 instance running on 64-bit Windows 7. In order to create the connection between our SQL Server instance and Qt we&#8217;ll use an ODBC DSN. Create The DSN While creating the DSN is relatively strait forward using: Start [...]]]></description>
			<content:encoded><![CDATA[<p>In this post we&#8217;ll review code for connecting a 32-bit Qt application to an SQL Server 2008 R2 instance running on 64-bit Windows 7.</p>
<p><span id="more-923"></span></p>
<p>In order to create the connection between our SQL Server instance and Qt we&#8217;ll use an ODBC DSN.</p>
<h2>Create The DSN</h2>
<p>While creating the DSN is relatively strait forward using:</p>
<p>Start menu &gt; Administrative Tools &gt; Data Sources (ODBC)</p>
<p>&#8230;we need to be aware that even though we may be on a 64-bit OS, Qt is most likely building 32-bit applications (especially if we just used the default Qt SDK installer). We can easily check this by running our application (or creating a quick test one) and opening the Task Manager:</p>
<p><img class="alignnone size-full wp-image-924" title="32-bit-software" src="http://www.formboss.net/blog/wp-content/uploads/2011/01/32-bit-software.png" alt="" width="564" height="283" /></p>
<p>This fact is important because the Data Sources (ODBC) application we normally run from the Administration Tools area is the 64-bit version, which in turn creates 64-bit DNS objects. Unfortunately these are not compatible with Qt&#8217;s 32-bit applications.</p>
<p>Thus, the first tip is that if running a 64-bit version of Windows and building 32-bit Qt apps, our DSN needs to be created <a title="MSN Blog Page" href="http://blogs.msdn.com/b/farukcelik/archive/2008/10/17/why-my-32-bit-applications-cannot-see-the-odbc-dsns-that-i-created-on-my-64-bit-machine.aspx" target="_blank">via a different version</a> of the DNS manager. We can locate this program at:</p>
<pre class="brush: plain; title: ; notranslate">C:\Windows\SysWOW64\odbcad32.exe</pre>
<p>When we launch this the application looks the same as the 64-bit counterpart, though you will notice any existing 64-bit DSN items will not be shown. In fact, if this is the first time you&#8217;ve launched the app it will not contain <em>any </em>user created DSN objects. This is normal, as remember we are only viewing 32-bit DSN objects, and up until this point we have been created 64-bit ones. </p>
<p>Go ahead and create your System or User DSN now.</p>
<h2>The Qt Code</h2>
<p>With the DSN created we can hop into our Qt authoring environment and connect to the database. For the purposes of this demonstration we will simply create a default Qt GUI application using the new project wizard in Qt Creator.</p>
<p>When the wizard finishes creating the basic application files, open the mainwindow.cpp class and modify the code to be the following:</p>
<pre class="brush: plain; title: ; notranslate">
#include &quot;mainwindow.h&quot;
#include &quot;ui_mainwindow.h&quot;

#include &lt;QDebug&gt;

#include &quot;datasource.h&quot;

MainWindow::MainWindow(QWidget *parent) :
    QMainWindow(parent),
    ui(new Ui::MainWindow)
{
    ui-&gt;setupUi(this);

    datasource *ds = new datasource();
    bool testConnect = ds-&gt;connect();

    if(testConnect){
        // attempt to populate data
        ds-&gt;executeQuery();
    }

}

MainWindow::~MainWindow()
{
    delete ui;
}
</pre>
<p>Again, this code is from a basic Qt GUI application with the only modification being the call to our datasource class. The header file for this class is defined as:</p>
<pre class="brush: plain; title: ; notranslate">
#ifndef DATASOURCE_H
#define DATASOURCE_H

#include &lt;QSqlDatabase&gt;

class datasource
{
public:
    datasource();

    bool connect();
    void executeQuery();
};

#endif // DATASOURCE_H
</pre>
<p>The implementation file is as follows: </p>
<pre class="brush: plain; title: ; notranslate">
#include &quot;datasource.h&quot;

#include &lt;QDebug&gt;

#include &lt;QSqlError&gt;
#include &lt;QSqlQuery&gt;

#include &lt;QVariant&gt;

#include &lt;QMessageBox&gt;

datasource::datasource()
{
}

bool datasource::connect()
{
    //define the database driver as QODBC...
    QSqlDatabase db = QSqlDatabase::addDatabase(&quot;QODBC&quot;);
    // and then to connect just pass the DSN name
    db.setDatabaseName(&quot;BankForms&quot;);
    if(!db.open()){
        QMessageBox::critical(0, QObject::tr(&quot;Database Error&quot;),
                              db.lastError().text());
        return false;
    } else {
        return true;
    }
}

void datasource::executeQuery()
{
    // execute a query
    QSqlQuery query;
    query.exec(&quot;SELECT * FROM department_names&quot;);

    while(query.next()){
        QString dept = query.value(4).toString();
        qDebug() &lt;&lt; dept;
    }
}
</pre>
<p>As you can see the code is deceptively simple. As we&#8217;ve created the proper connection details in the DSN including user name and password, our application only needs to know the DSN name via the call to: setDatabaseName().</p>
<p>With this code the application should, so long as we&#8217;ve created the DSN using the proper architecture, query the table as we define and return the results back in the while() block.</p>
<p>The key here then is that with a properly constructed DSN (which is what the Datasources app will of course create for us), we needn&#8217;t bother piecing together our own DSN string, nor do we need to have separate calls to QSqlDatabase ::setHostName etc. Just pass a DSN name and away we go : )</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2011/01/connecting-to-an-mssql-database-in-qt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C++ Understanding Pointers In Assembly</title>
		<link>http://www.formboss.net/blog/2010/12/understanding-pointers-in-assembly/</link>
		<comments>http://www.formboss.net/blog/2010/12/understanding-pointers-in-assembly/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 18:22:27 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[C++ Pointers]]></category>
		<category><![CDATA[Dereference]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Null Pointer]]></category>
		<category><![CDATA[Pointers]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=857</guid>
		<description><![CDATA[One of the joys of working with C++ is the ability to get &#8216;down to the metal&#8217; and talk to the hardware directly. Not only can this direct-access improve performance, as a pedagogical tool it allows us to peek under the hood at raw assembly high-level code produces. This can have the effect of demystifying [...]]]></description>
			<content:encoded><![CDATA[<p>One of the joys of working with C++ is the ability to get &#8216;down to the metal&#8217; and talk to the hardware directly. Not only can this direct-access improve performance, as a pedagogical tool it allows us to peek under the hood at raw assembly high-level code produces. This can have the effect of demystifying many aspects of programming, as while high-level languages make coding more efficient, they hide much of how things <em>actually</em> work.</p>
<p>One such example of this is with pointers. In C++ we learn that pointers are objects that hold a reference to some other objects memory location. We&#8217;re taught to dereference and pass pointers to functions, how to avoid stray and dangling pointers, and if we&#8217;re former Java developers, what all those <strong>Null Pointer</strong> errors actually meant.</p>
<p>Of course despite this deeper understanding even C++ hides what pointers truly are and how they&#8217;re represented and manipulated in hardware.</p>
<p>In this post then I want to quickly look at pointers from the standpoint of assembly language so next time you use one, you&#8217;ll have a better idea of what one actually is at the most basic level. This may even make learning what pointers are easier for newcomers.</p>
<p>I also, of course, want to talk performance.</p>
<p><span id="more-857"></span></p>
<h2>A Simple Example</h2>
<p>Let&#8217;s take a look at a basic stack-based pointer implementation.</p>
<pre class="brush: cpp; title: ; notranslate">

int a = 10;
int b = 20;
int c = 0;

int *ap;
ap = &amp;a;

int *bp;
bp = &amp;b;

c = *ap + *bp;
</pre>
<p>We create three int variables, two for holding start values for an addition and 1 to hold the result. We then create two pointers for the start value variables and assign each their respective start var memory addresses.</p>
<p>Simple enough, and when we dereference each to obtain our c vars value everything works as it should.</p>
<p>So what&#8217;s really happening behind the scenes?</p>
<h2>Pointers In Assembly</h2>
<p>In this example pointers are actually similar to normal stack variables. The similarity comes from the fact that just like <strong>a</strong> and <strong>b</strong>, <strong>ap</strong> and <strong>bp</strong> are placed on the stack when created. The difference is the values placed on the stack come from <strong>lea</strong> assembly instructions, which for all intents and purposes can be considered an extra step the compiler has to perform.</p>
<p>Further, when we want to <em>act</em> on pointers via a C++ dereference, we may incur a penalty of having to coerce the pointers value via indirection. In other words, instead of grabbing an immediate value, we first need to grab the pointer to find our what it points to, then get that pointed to value for our operation.</p>
<h2>The Raw Pointer Assembly</h2>
<p>Lets have a look at the assembly (in AT&#038;T syntax) Apple&#8217;s GCC has created for the code above:</p>
<pre class="brush: plain; title: ; notranslate">
0x0000000100000b1b  &lt;+0118&gt;  movl   $0xa,-0x1c(%rbp)
0x0000000100000b22  &lt;+0125&gt;  movl   $0x14,-0x20(%rbp)
0x0000000100000b29  &lt;+0132&gt;  movl   $0x0,-0x24(%rbp)
0x0000000100000b30  &lt;+0139&gt;  lea    -0x1c(%rbp),%rax
0x0000000100000b34  &lt;+0143&gt;  mov    %rax,-0x50(%rbp)
0x0000000100000b38  &lt;+0147&gt;  lea    -0x20(%rbp),%rax
0x0000000100000b3c  &lt;+0151&gt;  mov    %rax,-0x58(%rbp)
0x0000000100000b40  &lt;+0155&gt;  mov    -0x50(%rbp),%rax
0x0000000100000b44  &lt;+0159&gt;  mov    (%rax),%edx
0x0000000100000b46  &lt;+0161&gt;  mov    -0x58(%rbp),%rax
0x0000000100000b4a  &lt;+0165&gt;  mov    (%rax),%eax
0x0000000100000b4c  &lt;+0167&gt;  lea    (%rdx,%rax,1),%eax
0x0000000100000b4f  &lt;+0170&gt;  mov    %eax,-0x24(%rbp)
</pre>
<p>Lets remove the address and offsets, add comments, and overlay the C++ code:</p>
<pre class="brush: plain; title: ; notranslate">
int a = 10;
movl   $0xa,-0x1c(%rbp)		; load a var

int b = 20;
movl   $0x14,-0x20(%rbp)	; load b var

int c = 0;
movl   $0x0,-0x24(%rbp)		; load c var

int *ap;
ap = &amp;a;
lea    -0x1c(%rbp),%rax		; load address of a to %rax
mov    %rax,-0x50(%rbp)		; push ap to the stack

int *bp;
bp = &amp;b;
lea    -0x20(%rbp),%rax		; load address of b to %rax
mov    %rax,-0x58(%rbp)		; push bp to stack

c = *ap + *bp;
mov    -0x50(%rbp),%rax		; push ap to %rax
mov    (%rax),%edx			; load value at address to %edx (a dereference)
mov    -0x58(%rbp),%rax		; push bp to %rax
mov    (%rax),%eax			; load value at address to %eax (a second dereference)
lea    (%rdx,%rax,1),%eax	; add 'dereferenced' values, place in %eax
mov    %eax,-0x24(%rbp)		; push result to c
</pre>
<p>The interesting bits here are:</p>
<p>a) How we use <strong>lea</strong> to load the addresses of <strong>a</strong> and <strong>b</strong>, among other things. As you can see, the <strong>lea</strong> instruction is the key to how a compiler transforms a <strong>*ap</strong> declaration into working assembly. The good news is <strong>lea</strong> is a very fast and common instruction, and knowing why it&#8217;s used in this case makes reading assembly a whole lot easier.</p>
<p>b) We see how the dereference step (indirection) works. On the pointer creation side we use <strong>lea</strong> to grab the memory location of a stack offset, placing these values into -0&#215;50 and -0&#215;58.</p>
<p>When we want to <em>use</em> (dereference) these values we perform two steps. First we push the memory location to to a general purpose register, then use an assembly language construct of:</p>
<p>(register containing memory address), % destination register (or memory location) for &#8216;real&#8217; value</p>
<p>Which says: &#8220;grab the <em>value</em> from that memory address an place into the specified register&#8221;.</p>
<p>Hence, <em>indirection</em>. We grab a value, but can only do so <em>indirectly</em> as the first instruction only contains the memory address, not the raw value.</p>
<p>c) Finally, note how the addition step is carried out. <strong>lea</strong>, as well as being used to grab memory addresses, can also be used to add two fields together in a single instruction. In this example we do just that, and yes, this is an automatic compiler optimisation even when no optimisations are turned on.</p>
<h2>Performance Considerations</h2>
<p>Pointers do incur a performance penalty in some cases, as this simple rewrite to not use pointers shows (where I&#8217;ve mixed the C++ code with assembly):</p>
<pre class="brush: plain; title: ; notranslate">

int a = 10;
movl   $0xa,-0x1c(%rbp)

int b = 20;
movl   $0x14,-0x20(%rbp)

int c = 0;
movl   $0x0,-0x24(%rbp)

c = a + b;
mov   -0x20(%rbp),%eax
add   -0x1c(%rbp),%eax
mov   %eax,-0x24(%rbp)
</pre>
<p>All of the indirection steps add up, as a general rule doubling each access in terms of instructions. This is obvious just from a simple instruction count, with the addition being performed in 3 instructions in the non-pointer version vs. 6.</p>
<h2>The Price of Something New</h2>
<p>Using pointers in the manner shown above is questionable. The reason why is pointers are more traditionally used to access and manage heap memory. The method above, as we have seen, can just as easily be implemented without using pointers. As using pointers carries a penalty of indirection their use in this scenario is thus questionable. </p>
<p>When we add the C++ new operator things get more realistic, and of course interesting. </p>
<p>To see why, consider a more traditional pointer implementation using the C++ new operator:</p>
<pre class="brush: cpp; title: ; notranslate">
int *a = new int;
int *b = new int;
int *c = new int;

*a = 10;
*b = 20;
*c = 0;

*c = *a + *b;
</pre>
<p>In the stack-based example the overhead of indirection was minimal, as the extra instructions are already very fast. Using new changes the game considerably.</p>
<p>Each invocation of new requires an operating system call which carries with it all the associated overhead of a standard function call, as well as a specialized form of overhead called a <em>user mode switch</em>. That is, we switch from user mode to kernel mode, as only your operating system can directly allocate memory when running from a non-privileged process.</p>
<p>To put this in perspective, the stack example runs in 16 cycles, the <em>new</em> version in 432. </p>
<p>We can see these operating system calls in the assembly via the callq instructions:</p>
<pre class="brush: plain; title: ; notranslate">
0x0000000100000ae7  &lt;+0118&gt;  mov    $0x4,%edi
0x0000000100000aec  &lt;+0123&gt;  callq  0x100000d3c &lt;dyld_stub__Znwm&gt;
0x0000000100000af1  &lt;+0128&gt;  mov    %rax,-0x40(%rbp)
0x0000000100000af5  &lt;+0132&gt;  mov    $0x4,%edi
0x0000000100000afa  &lt;+0137&gt;  callq  0x100000d3c &lt;dyld_stub__Znwm&gt;
0x0000000100000aff  &lt;+0142&gt;  mov    %rax,-0x48(%rbp)
0x0000000100000b03  &lt;+0146&gt;  mov    $0x4,%edi
0x0000000100000b08  &lt;+0151&gt;  callq  0x100000d3c &lt;dyld_stub__Znwm&gt;
0x0000000100000b0d  &lt;+0156&gt;  mov    %rax,-0x50(%rbp)
0x0000000100000b11  &lt;+0160&gt;  mov    -0x40(%rbp),%rax
0x0000000100000b15  &lt;+0164&gt;  movl   $0xa,(%rax)
0x0000000100000b1b  &lt;+0170&gt;  mov    -0x48(%rbp),%rax
0x0000000100000b1f  &lt;+0174&gt;  movl   $0x14,(%rax)
0x0000000100000b25  &lt;+0180&gt;  mov    -0x50(%rbp),%rax
0x0000000100000b29  &lt;+0184&gt;  movl   $0x0,(%rax)
0x0000000100000b2f  &lt;+0190&gt;  mov    -0x40(%rbp),%rax
0x0000000100000b33  &lt;+0194&gt;  mov    (%rax),%edx
0x0000000100000b35  &lt;+0196&gt;  mov    -0x48(%rbp),%rax
0x0000000100000b39  &lt;+0200&gt;  mov    (%rax),%eax
0x0000000100000b3b  &lt;+0202&gt;  add    %eax,%edx
0x0000000100000b3d  &lt;+0204&gt;  mov    -0x50(%rbp),%rax
0x0000000100000b41  &lt;+0208&gt;  mov    %edx,(%rax)
</pre>
<p>One very important point is to note how we&#8217;re still primarily dealing with the stack. This can change from a performance standpoint when we start dealing with larger heap offsets. </p>
<h2>Locality</h2>
<p>Heap-based pointers can incur a locality cost, specifically, spacial locality. That is to say, when we keep values on the stack the compiler can issue shorter, faster instruction to access those memory locations. With larger offsets of thousands or hundreds of thousands of bytes, longer (and thus slower) instructions have to be used to access our elements. </p>
<p>To see this in action consider the following code:</p>
<pre class="brush: cpp; title: ; notranslate">
int * int_array = new int[1024*1024*1];
int_array[0] = 10;
int_array[1] = 20;

int c = int_array[0] + int_array[1];
</pre>
<p>We issue 1 megabyte of heap storage for an int array which we then access for our addition step using the array offset operator []. As you may already know, the [] operator is just an indexed pointer whose address is the <em>n</em> index (offset) of our object&#8217;s base address. Thus, using [<em>n</em>] is simply another form of indirection, one where we issue offsets from a base address to retrieve some real value in memory.</p>
<p>The cost of this indirection <em>would</em> be similar to our first example, only now that we&#8217;re using the heap the offsets to these objects can be very far away compared to the stack access we had been enjoying. While a cache will mitigate many of these penalties, accessing many non-contiguous memory locations will always be slower than contiguous loads if cache misses occur.</p>
<p>The good news is modern processors, operating systems, and memory architectures are so sophisticated that heap access is still very fast, which means the real penalty generally comes from the invocation of new and its associated <a href="http://en.wikipedia.org/wiki/Context_switch">user mode switch</a>. </p>
<p>To that end, the cost of the new invocation will vary based on he size of the request and operating system, but the general rule I&#8217;ve formulated is anything over 512k on OS X 10.6 will carry with it a penalty of over 24,000 cycles for smaller values (under a few megabytes), and more for larger requests (35,000 cycles for 100 megabytes, for example). I would imagine other modern operating systems show similar results.</p>
<p>Thus, a good rule of thumb to follow is use new sparingly, and <em>never</em> do so in a loop if you can avoid it.</p>
<p>If you must allocate memory in a performance-critical loop you&#8217;ll probably want to create your <a href="http://www.scribd.com/doc/3499563/Building-your-own-memory-manager-for-C-C-projects">own memory manager</a>.</p>
<h2>Function Pointers</h2>
<p>The core logic of loading an effective address (the lea instruction) to create our pointer is just as valid for a specialized version of the pointer, the <em>function pointer</em>. </p>
<p>Consider the following code:</p>
<pre class="brush: cpp; title: ; notranslate">
#include &lt;iostream&gt;

using namespace std;

void under_30(int);
void over_30(int);

int main () {

	void (*fp)(int);

	int t = 10;

	fp = (t &lt; 30) ? under_30 : over_30;

	fp(t);

    return 0;
}

void under_30(int n)
{
	cout &lt;&lt; &quot;Under 30&quot;;
}

void over_30(int n)
{
	cout &lt;&lt; &quot;Over 30&quot;;
}
</pre>
<p>The relevant assembly is as follows:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0014</span>&gt; &nbsp;movl &nbsp; $0xa,-0&#215;4<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span> <span class="co1">; push 10 to the stack</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0021</span>&gt; &nbsp;cmpl &nbsp; $0x1d,-0&#215;4<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span> <span class="co1">; compare 30 to t</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0025</span>&gt; &nbsp;<span class="kw1">jg</span> &nbsp; &nbsp; 0x<span class="re1">100000b</span>6c &lt;main<span class="nu0">+40</span>&gt; <span class="co1">; if greater jump to main+40 </span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; if compare falls through lea under_30 to %rax and push to the stack.</span></div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; (the address of under_30 is found relative to the instruction pointer) </span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0027</span>&gt; &nbsp;<span class="kw1">lea</span> &nbsp; &nbsp;0x16c<span class="br0">&#40;</span>%rip<span class="br0">&#41;</span>,%rax &nbsp; &nbsp; &nbsp; &nbsp;# 0x100000cd2 &lt;_Z8under_30i&gt;</div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0034</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;%rax,-0&#215;18<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0038</span>&gt; &nbsp;<span class="kw1">jmp</span> &nbsp; &nbsp;0x<span class="re1">100000b</span><span class="nu0">77</span> &lt;main<span class="nu0">+51</span>&gt; <span class="co1">; jmp to the function call</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; if we jumped here, load the effective address of over_30</span></div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; using an offset from the instruction pointer</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0040</span>&gt; &nbsp;<span class="kw1">lea</span> &nbsp; &nbsp;0&#215;139<span class="br0">&#40;</span>%rip<span class="br0">&#41;</span>,%rax &nbsp; &nbsp; &nbsp; &nbsp;# 0x100000cac &lt;_Z7over_30i&gt;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; we then push this address to -0&#215;18</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0047</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;%rax,-0&#215;18<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; with one of the two function calls now in -0&#215;18,</span></div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; push the address to %rax for the function call</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0051</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;-0&#215;18<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>,%rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; prep the stack</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0055</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;%rax,-0&#215;10<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0059</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;-0&#215;4<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>,%<span class="kw3">edi</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0062</span>&gt; &nbsp;<span class="kw1">mov</span> &nbsp; &nbsp;-0&#215;10<span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>,%rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co1">; now call the function via it&#39;s address (another form of indirection)</span></div>
</li>
<li class="li1">
<div class="de1">&lt;<span class="nu0">+0066</span>&gt; &nbsp;callq &nbsp;*%rax</div>
</li>
</ol>
</div>
<p>While somewhat more complicated than the earlier examples, the point is to show that once again we use indirection to work with addresses and the values (addresses) they contain instead of direct values.</p>
<h2>Conclusion</h2>
<p>To sum up then, in hardware pointers are similar to standard variables, only they carry with them the extra cost of indirection and more commonly, user mode switches which run several tens of thousands of CPU cycles.</p>
<p>Of course in an object oriented environment these potential drawbacks are almost always offset by the performance gains of passing addresses of objects vs the entire object. Pointer access is one of the main reasons why the C family of languages are the preferred languages of choice for performance computing.</p>
<p>Thus, while hopefully you have a better understanding of pointers, you&#8217;re also aware that we should not overuse them if possible. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2010/12/understanding-pointers-in-assembly/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SSE Intrinsics Tutorial</title>
		<link>http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/</link>
		<comments>http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[Intrinsics]]></category>
		<category><![CDATA[SIMD Performance]]></category>
		<category><![CDATA[SSE]]></category>
		<category><![CDATA[Vectorizing Code]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=743</guid>
		<description><![CDATA[UPDATE: For those interested, I&#8217;ve created a full-on assembly/SSE version here. SSE SIMD Programming is a fascinating subject, but also one that can be a bit difficult to approach. In this post I&#8217;m going to create a SIMD version on my RGB-&#62;CMYK algorithm, and in the process, show a bunch of handy tricks for working [...]]]></description>
			<content:encoded><![CDATA[<div style="background-color: rgb(255, 255, 221); border: 1px dotted rgb(204, 204, 204); padding: 10px; margin-bottom: 10px;"><strong>UPDATE: For those interested, I&#8217;ve created a full-on assembly/SSE version <a href="http://www.formboss.net/blog/2011/04/sse-and-inline-assembly-example/">here</a>.</strong></div>
<p>SSE SIMD Programming is a fascinating subject, but also one that can be a bit difficult to approach. In this post I&#8217;m going to create a SIMD version on my RGB-&gt;CMYK algorithm, and in the process, show a bunch of handy tricks for working with SIMD.</p>
<p>This post deals with some of the problems and challenges we face when implementing SIMD code, paying close attention to intrinsics, basic SIMD code setup, and buffer type conversion.</p>
<p><span id="more-743"></span></p>
<p>The first aspect to consider when creating SIMD code is how you want to handle the actual coding process. We have two main options. The first is to code in raw assembly. This is a solid approach, but means we, as the coder, have to deal with stack variables, stack management, loop code, and several other advanced topics.</p>
<p>As an alternative we can use Intrinsics, which are pre-configured assembly routines that allow us to stay in the native C++ or C coding environment.</p>
<p>For this post we&#8217;ll use Intrinsics.</p>
<p>The good news is using Intrinsics means simply including a header file in your code, as Intrinsics are not a library, but rather a function of the compiler. Which header file we&#8217;ll include depends on the SSE version we need, which in turn, depends on the target <a title="Intel Compiler Options" href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/" target="_blank">architecture/processor</a>. In general, include the proper header as your code needs demand:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">

#ifdef __MMX__
#include &lt;mmintrin.h&gt;
#endif

#ifdef __SSE__
#include &lt;xmmintrin&gt;
#endif

#ifdef __SSE2__
#include &lt;emmintrin&gt;
#endif

#ifdef __SSE3__
#include &lt;pmmintrin&gt;
#endif

#ifdef __SSSE3__
#include &lt;tmmintrin.h&gt;
#endif

#if defined (__SSE4_2__) || defined (__SSE4_1__)
#include &lt;smmintrin&gt;
#endif
</pre>
<p>On the most basic level, to use SSE 2 intrinsics all we need to do is include the following line in your code file:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">

#include &lt;emmintrin&gt;
</pre>
<p>You can read more about basic Intel Intrinsics <a title="Intel Intrinsics" href="http://software.intel.com/en-us/articles/how-to-vectorize-code-using-intrinsics-on-32-bit-intel-architecture/" target="_blank">here</a>.</p>
<p>One final point for Qt developers: to use these intrinsics we also need to include the compiler hint in the form of this in your .pro file:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
QMAKE_CXXFLAGS += -msse2
</pre>
<h2>The Instructions We&#8217;ll Use</h2>
<p>The basic workings of SIMD computing are actually quite simple to describe. Instead of working on 1 value at a time, we group values together and apply the same calculation over all elements at once. How many elements depends on the variable type, but the values generally range from 2 to 16. This means in theory your code could see a 16x performance improvement if working with that smallest variable size.</p>
<p>In reality real performance gains may be more modest, but still very respectable. In our case we have the potential for a 4x speedup, as we&#8217;re working with CMYK data. This is because as part of our calculation of converting RGB to CMYK we need to convert our input items into single-precision floats, which means we&#8217;re limited to working on 4 values at a time.</p>
<h2>A First Example</h2>
<p>To get started, lets take a look at the following C++ code which multiplies 8 single precision float values together:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">

// Raw C++ Method 1 - 144/152 clocks
float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
 
float z3[8];
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for (int i = 0; i &gt; 8; i++) {
    z3[i] = z1[i] * z2[i];
}
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for(int i = 0; i &gt; 8; i++){
    cout &lt;&lt; z3[i] &lt;&lt; endl;
}
</pre>
<p>Now let&#8217;s compare that with an SSE Intrinsics Version:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
// SIMD Method 1 - Using Pointers : 56/64 clocks
float a1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float a2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
 
float a3[8];
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
__m128 *v_a1 = (__m128*)a1;
__m128 *v_a2 = (__m128*)a2;
__m128 *v_a3 = (__m128*)a3;
 
for (int i = 0; i &lt; 2; i++) {
    *v_a3 = _mm_mul_ps(*v_a1, *v_a1);
    v_a1++;
    v_a2++;
    v_a3++;
}
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for(int i = 0; i &gt; 8; i++){
    cout &lt;&lt; a3[i] &lt;&lt; endl;
}
</pre>
<p>There are a couple of things of note here. First is the idea of how we tie our C variables to the __m128 data items. In short, we do not operate on __m128&#8242;s like we do normal C++ variables, we instead use them as <em>holders</em> of values, to which we pass as operands to various intrinsic functions rather than creating direct assignments and references to.</p>
<p>This seems limiting until you then see how we actually perform the actual &#8216;tying&#8217; operation:</p>
<pre>__m128 *v1 = (__m128*)a1;</pre>
<p>In other words, we create a pointer variable of type __m128, to which we assign the same memory address as the &#8216;normal&#8217; a1 float array items. The net effect of this then is behind the scenes the compiler intrinsics will generate the proper code to transparently link the two together.</p>
<p>We can do this another way though, that is, without using pointers, but rather explicit load instructions:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
// SIMD Method 3 - Using Loads/Stores : 88/96 clocks
float b1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float b2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
 
float b3[8];
 
__m128 v_b1, v_b2, v_b3;
 
int j = 0;
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for (int i = 0; i &gt; 2; i++) {
    v_b1 = _mm_load_ps(b1 + j);
    v_b2 = _mm_load_ps(b2 + j);
    v_b3 = _mm_mul_ps(v_b1, v_b2);
    _mm_store_ps(b3 + j, v_b3);
    j+=4;
}
 
__asm__ __volatile__(&quot;nop&quot; :::);
</pre>
<p>The big difference is how we load and save our buffers. In the first example we use pointers to alias the two arrays together, in the second we use explicit declarations by way of the <strong>_mm_load_ps</strong> and <strong>_mm_store_ps</strong> statements.</p>
<p>Before we talk about the performance trade-offs of each method, note that in both examples it&#8217;s key to understand how we increment the vectors to point to the next four elements.</p>
<p>In the pointer example we simply use the ++ operator. In assembly, the following is performed:</p>
<pre>0x0000000100000b12  &lt;+0389&gt;  addq   $0x10,-0x48(%rbp)</pre>
<p>That is, we add 16 to the base address of the arrays so that the next loop grabs the next four elements instead of the previous ones.</p>
<p>The load/store example uses the same ++ operator, only we limit it to 1 variable (j), which is incremented with:</p>
<pre>0x0000000100000b19  &lt;+0500&gt;  addl   $0x4,-0x1c(%rbp)</pre>
<p>To which we then use j as an offset to the next batch of items from our source array:</p>
<pre>v_b1 = _mm_load_ps(b1 + j);</pre>
<p>Which produces the following (unoptimized) assembly (<em>please note all assembly in this post uses AT&amp;T syntax, where operands are &#8216;reversed&#8217; compared to Intel syntax</em>):</p>
<pre>0x0000000100000a6b  &lt;+0326&gt;  mov    -0x1c(%rbp),%eax ; base pointer move to eax
0x0000000100000a6e  &lt;+0329&gt;  cltq   ; convert %eax double word to quad word
0x0000000100000a70  &lt;+0331&gt;  shl    $0x2,%rax ; left shift
0x0000000100000a74  &lt;+0335&gt;  mov    %rax,%rdx ; %rdx now contains quad word j value
0x0000000100000a77  &lt;+0338&gt;  lea    -0xe0(%rbp),%rax
0x0000000100000a7e  &lt;+0345&gt;  add    %rdx,%rax ; increment offset pointer with j
0x0000000100000a81  &lt;+0348&gt;  mov    %rax,-0x50(%rbp)
0x0000000100000a85  &lt;+0352&gt;  mov    -0x50(%rbp),%rax ; align instructions (stall)
0x0000000100000a89  &lt;+0356&gt;  movaps (%rax),%xmm0 ; finally, move (aligned) floats to xmm0</pre>
<p>That&#8217;s quite a few instructions, which leads us to:</p>
<h3>Which method is better?</h3>
<p>Setting aside the Raw C++ method for a moment, as you may see from the comments at top, a simple benchmark shows the pointer version to be faster by a good amount, both in terms of lowest average clock/highest average clock.</p>
<p>The main reason for this is the second method, in assembly, employs several expensive conversion and instruction/data alignment operations for the &#8216;separate/extra&#8217; j variable. This will be an important constraint moving forward, as our example RGB-&gt;CMYK algorithm in this post will surfer from not being able to directly use the faster pointer method. More on that later though.</p>
<p>At this point we should mention something quite important&#8211;what about unrolling our raw C++ code?</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
// Raw C++ Method 2 - 1 x Unrolled : 80/88 clocks
float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
 
float z3[8];
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for (int i = 0; i &gt; 8; i+=4) {
    z3[i] = z1[i] * z2[i];
    z3[i+1] = z1[i+1] * z2[i+1];
    z3[i+2] = z1[i+2] * z2[i+2];
    z3[i+3] = z1[i+3] * z2[i+3];
}
 
__asm__ __volatile__(&quot;nop&quot; :::);
 
for(int i = 0; i &amp;lt; 8; i++){
    cout &lt;&lt; z3[i] &lt;&lt; endl;
}
</pre>
<p>Note the clock times of 80/88. Our unrolled version is just as fast as the Load/Store SIMD one!</p>
<p>Why is this so? Simple: in unrolled mode, GCC, even with no optimizations turned on (debugging output), is able to create fast code using simple floating point <strong>movss</strong> and <strong>mulss</strong> instructions. This takes us too&#8230;</p>
<h3>SIMD Lesson 1</h3>
<p>SIMD can in fact be <em>detrimental</em> for smaller data sets, or for operations where we&#8217;re bound by the memory sub-system. Yes our pointer version is still a good deal faster, but that&#8217;s in part because it&#8217;s a very simple example.</p>
<p>Why is &#8216;being simple&#8217; pertinent? One of the challenges of SIMD programming is that we should treat each vector as a single unit, not as autonomous agents. As soon as we violate this rule by extracting individual values performance degrades quickly. This means we must be very clever in how we devise our algorithms, something we&#8217;ll see first hand when we get into the RGB-&gt;CMYK conversion later in this post.</p>
<p>Of course if the algorithm is already complex, SIMD may make it more so. The bottom line is that SIMD is only really useful if you have an algorithm that allows several elements to have the same operation applied to the entire vector at once. There are of course many areas where this does apply, and in those cases SIMD is a real firecracker&#8211;but not always.  At very minimum, in order to gain the advantages SIMD offers we need to pack 2 or more separate values into an xmm register, and this packing process is not free.</p>
<p>When you toss in the fact that vectorized code can be difficult to manage and create, it becomes obvious that SIM&#8217;dizing your code should not always be the first choice for most applications&#8211;choosing the right algorithm should get that distinction.</p>
<p>As a last point on code size and speed, consider a fully unrolled version of our C++ code takes 32/40 cycles, or 16/24 if the array definitions are moved outside the benchmarked area. The best performance I could achieve with unrolled  SIMD is 48 cycles, or  24/32 if again, the __m128 and array items are moved outside of the benchmark section.</p>
<h2>Packed Values</h2>
<p>It may be obvious by now, but the basic concept we need to master in SIMD code is the art of packing several scalar values together into a single xmm register and then applying our logic to this &#8216;packed&#8217; bit string.</p>
<p>We are by no means limited to <em>only </em>dealing with the entire packed vector; one thing you&#8217;ll notice when working with intrinsics is a multitude of related instructions for dealing with packed and non-packed values.</p>
<p>Generally speaking, we&#8217;ll see intrinsics instructions taking the form of:</p>
<pre class="brush: cpp; title: ; notranslate">

_mm_rsqrt_ps

_mm_rsqrt_ss
</pre>
<p>Which differ only in the last two letters. The difference is the first item operates on <em><span style="text-decoration: underline;">P</span>acked <span style="text-decoration: underline;">S</span>calar</em> values, the second <em><span style="text-decoration: underline;">S</span>ingle <span style="text-decoration: underline;">S</span>calar</em>.</p>
<p>As the name suggests, for <em>_ps</em> variants of the intrinsic we operate on all values of the xmm register at once, <em>_ss</em> only the lower value.</p>
<p>When we combine this knowledge with the shuffle intrinsics such as <a title="Shuffle Intrinsics" href="http://msdn.microsoft.com/en-us/library/5f0858x0.aspx" target="_blank"><strong>_mm_shuffle_ps</strong></a>, we now have a way of ordering and selecting individual values from our vectors.</p>
<p>This is another way of saying this is how we introduce logic into the hammer like approach of SSE processing. Sure we&#8217;re most efficient when clobbering out several operations at once, but shuffles and _ss instructions allow us to use a more fine grained way of dealing with individual values.</p>
<h2>Real World SIMD Applications</h2>
<p>Now that we have a good idea of the very basics of how to link C++ and intrinsic data types together, we can look at a more complex example.</p>
<p>Of course there are plenty of examples of SIMD on the Internet, from simple tutorials by hobbyists to <a title="Intel Tutorial" href="http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/" target="_blank">professional write-ups from Intel</a>.  The point of this post is to show you the process I&#8217;ve taken to  implement a SIMD solution for an already high-performance C++ code  block, with an emphasis on discovery and basic concepts.</p>
<p>That said, we must keep in mind that their are important design decisions that must be made, as well potential pitfalls and trade-offs that could very well make writing SIMD code a poor choice for our optimization targets.</p>
<p>As a perfect case-in-point, consider one of the most common performance issues, the conversion of data from one type to another. To see <em>why</em> this is an issue, let&#8217;s take a look at the algorithm we want to vectorize:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate"> uchar *bits = imin.bits();
uchar *bits = imin.bits();
 
qreal c_c, c_m, c_y, c_k;
 
for(int i = 0; i &gt; wt * ht; i++){
 
    c_c = 1.0 - (bits[j+2] / qreal(255));
    c_m = 1.0 - (bits[j+1] / qreal(255));
    c_y = 1.0 - (bits[j] / qreal(255));
 
    c_k = qMin(c_c, qMin(c_m, c_y));
 
    if (!qFuzzyCompare(c_k,1)) {
        c_c = (c_c - c_k) / (1.0 - c_k);
        c_m = (c_m - c_k) / (1.0 - c_k);
        c_y = (c_y - c_k) / (1.0 - c_k);
    }
 
    cmyk_temp[j] = c_c * 255;
    cmyk_temp[j + 1] = c_m * 255;
    cmyk_temp[j + 2] = c_y * 255;
    cmyk_temp[j + 3] = c_k * 255;
 
    j += 4;
 
}
</pre>
<p>The code follows the following form:</p>
<p>We read from an input buffer consisting of blocks of four unsigned chars that will always be between 1 and 255. We then take the first 3 values from the buffer block to create the base cyan, yellow, and magenta color channels by dividing each value by 255 in the first step and subtracting by 1.</p>
<p>From the standpoint of the color values being created, this has the effect of constricting our values between 0 and 1. For example, if one input value for a pixel containing 100% cyan and no other color, the JPEG file may encode this as:</p>
<p>255, 255, 1, 255 (the last 255 is the alpha channel, which is always 255 for our images and is ignored)</p>
<p>This means the lower numbers become quite small, and the higher numbers less than or equal to 1:</p>
<pre>c = 1 / 255 = 0.003921568627451
m = 255 / 255 = 1
y = 255 / 255 = 1</pre>
<p>Now we subtract 1 from each value, but it&#8217;s this step which causes a big problem. Notice the format of our potential result:</p>
<pre>1 - 0.003921568627451 = -0.996078431372549</pre>
<p>We start with unsigned chars (ints, really), but our conversion calculation creates intermediate<em> signed floats</em>. This is the reason we&#8217;re limited to 4 values at a time. In theory we could do 4 pixels (16 values) at a time if everything stayed as unsigned ints.</p>
<p>This just isn&#8217;t possible though, so not only are we limited to doing 1 pixel at a time, we&#8217;re further penalized by having to converting everything into a float in the first place. This is not our choice, but a fact of life that our image routines (Qt&#8217;s) only return unsigned chars.</p>
<p>As we saw above, using the pointer method to link a source buffer an an __m128 vector produces the fastest code. In essence, we use intrinsics to say, &#8220;<em>this value exists as a vector and an __m128, go ahead and use whichever interpretation you need to create our final output buffer values</em>&#8220;.</p>
<p>This is convenient because as high-level coders we do not even need to think about any conversion conversion process, the compiler simply does it for us if required. The problem is this only works well in terms of performance when our source and intermediate values are of the same type. While on my x86-64 floats and unsigned chars are both 4 bytes, we cannot treat unsigned chars as floats and expect to get accurate results back.</p>
<p>This fact gets at a very important aspect of type conversions, which is that unless we know how they work, we can easily produce faulty code.</p>
<p>Consider:</p>
<pre>unsigned int a = 8;
unsigned int b = 9;
unsigned int c = a / b;

cout &lt;&lt; c &lt;&lt; endl;</pre>
<p>The output from this code: 0.</p>
<p>The reason becomes apparent when we see the assembly code:</p>
<pre>0x0000000100000b37  &lt;+0118&gt;  movl   $0x8,-0x1c(%rbp) ; move a
0x0000000100000b3e  &lt;+0125&gt;  movl   $0x9,-0x20(%rbp) ; move b
0x0000000100000b45  &lt;+0132&gt;  mov    -0x1c(%rbp),%eax ; move a to eax
0x0000000100000b48  &lt;+0135&gt;  mov    $0x0,%edx ; push 0 to edx
0x0000000100000b4d  &lt;+0140&gt;  divl   -0x20(%rbp) ; divide eax with b
0x0000000100000b50  &lt;+0143&gt;  mov    %eax,-0x24(%rbp) ; push result to stack</pre>
<p>8 is moved into %eax, with 9 being referenced as divl&#8217;s second operand via the stack at -0&#215;20(%rbp). The result is then placed into %eax and pushed back onto the stack at -0&#215;24.</p>
<p>The divl instruction returns any remainder to the %edx register, but as we can see, only %eax is pushed back to memory. The end result then is if our division result it a whole integer we&#8217;ll get a correct value back. If it doesn&#8217;t though, we&#8217;ll get at best a truncated value back. Of course this is a result that&#8217;s totally unacceptable to our algorithm.</p>
<p>The fact is floats are represented completely different to our cpu&#8217;s than ints, which in turn means if we expect an accurate result back from a calculation that produces a fractional value, we must use floating point values or casts for <em>at least one of the input operands.</em></p>
<p>This means we could not, for example, change just the <em>destination</em> parameter to a float and expect a proper result:</p>
<pre>unsigned int a = 10;
unsigned int b = 7;
float c = a / b;

cout &lt;&lt; c &lt;&lt; endl;</pre>
<p>Despite using a float value as the return value we receive is still 1.</p>
<p>The pertinent question of course is: why?</p>
<p>Before we answer that question, just know that this is the reason we cannot just use pointers to our unsigned int buffer and the __m128 containers as preferred. We need to employ the explicit load method to correctly convert our unsigned ints to floats and back. The downside of course is we already know this conversion will produce slower code, sometimes significantly so.</p>
<p>So why does at least one input variable need to be a float? To help answer this, lets take a look at the GCC assembly this code produces:</p>
<pre>unsigned int a = 10;
unsigned int b = 3;
float c = a / b;

cout &lt;&lt; c &lt;&lt; endl;</pre>
<pre>0x0000000100000af3  &lt;+0118&gt;  movl   $0xa,-0x1c(%rbp) ; a var
0x0000000100000afa  &lt;+0125&gt;  movl   $0x3,-0x20(%rbp) ; b var
0x0000000100000b01  &lt;+0132&gt;  mov    -0x1c(%rbp),%eax ; move a into eax
0x0000000100000b04  &lt;+0135&gt;  mov    $0x0,%edx        ; place 0 into edx (clears quotient)
0x0000000100000b09  &lt;+0140&gt;  divl   -0x20(%rbp)      ; divide eax with b
0x0000000100000b0c  &lt;+0143&gt;  mov    %eax,%eax        ; aligned nop</pre>
<p>The key is at no point during our division task are we using floating point registers which means the integer workarounds to create results with remainders is all we have. In short, we will <em>never</em> get a floating point result back, as the two source values going into c are not floating point numbers. This means we use a form of division that doesn&#8217;t produce a proper floating point result.</p>
<p>However, you&#8217;ll notice <em>c</em> is still defined as a floating point variable, which is then used in the <em>cout</em> call.</p>
<p>Here&#8217;s the kicker: as we pass <em>cout</em> a float, the compiler,<em> without being asked, </em>ends up converting the int to a float with the following assembly code which is found below our division code:</p>
<pre>0x0000000100000b37  &lt;+0186&gt;  cvtsi2ssq %rax,%xmm0
0x0000000100000b3c  &lt;+0191&gt;  movaps %xmm0,%xmm1
0x0000000100000b3f  &lt;+0194&gt;  addss  %xmm0,%xmm1
0x0000000100000b43  &lt;+0198&gt;  movss  %xmm1,-0x6c(%rbp)
0x0000000100000b48  &lt;+0203&gt;  movss  -0x6c(%rbp),%xmm0
0x0000000100000b4d  &lt;+0208&gt;  movss  %xmm0,-0x24(%rbp)
0x0000000100000b52  &lt;+0213&gt;  movss  -0x24(%rbp),%xmm0
0x0000000100000b57  &lt;+0218&gt;  mov    0x4d2(%rip),%rdi        # 0x100001030
0x0000000100000b5e  &lt;+0225&gt;  callq  0x100000d28 &lt;dyld_stub__ZNSolsEf&gt;</pre>
<p>Just above this code block we switch from using %eax to %rax, which as you may recall, is the full 64 bits of the ax register. The point of this switch is to, as we can see in the first line of assembly, convert (via cvtsi2ss) and push this value into the %xmm0 resister, which now means we&#8217;re in floating point mode. Of course it goes without saying that as the original division operation produced an int result, converting to a float now is more or less pointless as the input to this code block is always an int.</p>
<p>This conversion step is key, as it takes us to heart of why casting can be an expensive operation to undertake.</p>
<p>Just to drive this point home from the low-level perspective, let&#8217;s say we update our code to use float casts:</p>
<pre>unsigned int a = 10;
unsigned int b = 3;
float c = (float)a / (float)b;

cout &lt;&lt; c &lt;&lt; endl;</pre>
<p>Here&#8217;s what the commented assembly looks like:</p>
<pre>0x0000000100000abb  &lt;+0118&gt;  movl   $0xa,-0x1c(%rbp) ; a
0x0000000100000ac2  &lt;+0125&gt;  movl   $0x3,-0x20(%rbp) ; b
0x0000000100000ac9  &lt;+0132&gt;  mov    -0x1c(%rbp),%eax
0x0000000100000acc  &lt;+0135&gt;  mov    %rax,-0x70(%rbp)
0x0000000100000ad0  &lt;+0139&gt;  cmpq   $0x0,-0x70(%rbp)
0x0000000100000ad5  &lt;+0144&gt;  js     0x100000ae4 &lt;_Z6testerPxS_+159&gt;
0x0000000100000ad7  &lt;+0146&gt;  cvtsi2ssq -0x70(%rbp),%xmm0 ; convert a to a float
0x0000000100000add  &lt;+0152&gt;  movss  %xmm0,-0x68(%rbp)
0x0000000100000ae2  &lt;+0157&gt;  jmp    0x100000b06 &lt;_Z6testerPxS_+193&gt;
0x0000000100000ae4  &lt;+0159&gt;  mov    -0x70(%rbp),%rax
0x0000000100000ae8  &lt;+0163&gt;  shr    %rax
0x0000000100000aeb  &lt;+0166&gt;  mov    -0x70(%rbp),%rdx
0x0000000100000aef  &lt;+0170&gt;  and    $0x1,%edx
0x0000000100000af2  &lt;+0173&gt;  or     %rdx,%rax
0x0000000100000af5  &lt;+0176&gt;  cvtsi2ssq %rax,%xmm0
0x0000000100000afa  &lt;+0181&gt;  movaps %xmm0,%xmm1
0x0000000100000afd  &lt;+0184&gt;  addss  %xmm0,%xmm1
0x0000000100000b01  &lt;+0188&gt;  movss  %xmm1,-0x68(%rbp)
0x0000000100000b06  &lt;+0193&gt;  mov    -0x20(%rbp),%eax ; move b to eax
0x0000000100000b09  &lt;+0196&gt;  mov    %rax,-0x78(%rbp) ; move 64 bits to stack
0x0000000100000b0d  &lt;+0200&gt;  cmpq   $0x0,-0x78(%rbp) ; is 0?
0x0000000100000b12  &lt;+0205&gt;  js     0x100000b21 &lt;_Z6testerPxS_+220&gt; ; jmp if signed
0x0000000100000b14  &lt;+0207&gt;  cvtsi2ssq -0x78(%rbp),%xmm0 ; if not signed, convert b to float
0x0000000100000b1a  &lt;+0213&gt;  movss  %xmm0,-0x64(%rbp)
0x0000000100000b1f  &lt;+0218&gt;  jmp    0x100000b43 &lt;_Z6testerPxS_+254&gt;
0x0000000100000b21  &lt;+0220&gt;  mov    -0x78(%rbp),%rax
0x0000000100000b25  &lt;+0224&gt;  shr    %rax
0x0000000100000b28  &lt;+0227&gt;  mov    -0x78(%rbp),%rdx
0x0000000100000b2c  &lt;+0231&gt;  and    $0x1,%edx
0x0000000100000b2f  &lt;+0234&gt;  or     %rdx,%rax
0x0000000100000b32  &lt;+0237&gt;  cvtsi2ssq %rax,%xmm0
0x0000000100000b37  &lt;+0242&gt;  movaps %xmm0,%xmm1
0x0000000100000b3a  &lt;+0245&gt;  addss  %xmm0,%xmm1
0x0000000100000b3e  &lt;+0249&gt;  movss  %xmm1,-0x64(%rbp)
0x0000000100000b43  &lt;+0254&gt;  movss  -0x68(%rbp),%xmm0
0x0000000100000b48  &lt;+0259&gt;  divss  -0x64(%rbp),%xmm0 ; divide a and b
0x0000000100000b4d  &lt;+0264&gt;  movss  %xmm0,-0x24(%rbp)
0x0000000100000b52  &lt;+0269&gt;  movss  -0x24(%rbp),%xmm0
0x0000000100000b57  &lt;+0274&gt;  mov    0x4d2(%rip),%rdi</pre>
<p>The end result of having to convert our parameters to floats is the clock cycle count with the conversions is 40, and 24 without. Clearly conversions have their cost, but in our case, what choice do we have? For our CMYK to RGB conversion process it is an unfortunate necessity. We get an array of uchars but the calculations produce floats. A conversion is simply something we have to deal with.</p>
<p>As to answer our original question of why having one float parameter allows for a correct answer is to simply state that: <em>GCC cannot divide a float by an int.</em> It must first convert non-floats to floats, as seen below when the following C++ code is fed to GCC:</p>
<pre>float a = 10;
unsigned int b = 3;
float c = a / b;</pre>
<p>Commented assembly code:</p>
<pre>0x0000000100000af7  &lt;+0118&gt;  mov    $0x41200000,%eax ; IEEE floating point representation of 10
0x0000000100000afc  &lt;+0123&gt;  mov    %eax,-0x1c(%rbp)
0x0000000100000aff  &lt;+0126&gt;  movl   $0x3,-0x20(%rbp) ; move b to stack
0x0000000100000b06  &lt;+0133&gt;  mov    -0x20(%rbp),%eax ; move to eax for conversion
0x0000000100000b09  &lt;+0136&gt;  mov    %rax,-0x70(%rbp) ; push full 64 bit value to stack
0x0000000100000b0d  &lt;+0140&gt;  cmpq   $0x0,-0x70(%rbp) ; test for 0
0x0000000100000b12  &lt;+0145&gt;  js     0x100000b21 &lt;_Z6testerPxS_+160&gt; ; if signed
0x0000000100000b14  &lt;+0147&gt;  cvtsi2ssq -0x70(%rbp),%xmm0 ; convert b to float
0x0000000100000b1a  &lt;+0153&gt;  movss  %xmm0,-0x64(%rbp) ; move new float value to stack
0x0000000100000b1f  &lt;+0158&gt;  jmp    0x100000b43 &lt;_Z6testerPxS_+194&gt;
0x0000000100000b21  &lt;+0160&gt;  mov    -0x70(%rbp),%rax
0x0000000100000b25  &lt;+0164&gt;  shr    %rax
0x0000000100000b28  &lt;+0167&gt;  mov    -0x70(%rbp),%rdx
0x0000000100000b2c  &lt;+0171&gt;  and    $0x1,%edx
0x0000000100000b2f  &lt;+0174&gt;  or     %rdx,%rax
0x0000000100000b32  &lt;+0177&gt;  cvtsi2ssq %rax,%xmm0
0x0000000100000b37  &lt;+0182&gt;  movaps %xmm0,%xmm1
0x0000000100000b3a  &lt;+0185&gt;  addss  %xmm0,%xmm1
0x0000000100000b3e  &lt;+0189&gt;  movss  %xmm1,-0x64(%rbp)
0x0000000100000b43  &lt;+0194&gt;  movss  -0x1c(%rbp),%xmm0 ; move a into xmm0
0x0000000100000b48  &lt;+0199&gt;  divss  -0x64(%rbp),%xmm0 ; divide by b
0x0000000100000b4d  &lt;+0204&gt;  movss  %xmm0,-0x24(%rbp) ; c</pre>
<p><em>As a quick aside, please see <a title="IEEE Floating Point Calculator" href="http://babbage.cs.qc.edu/IEEE-754/Decimal.html" target="_blank">this link</a> for a handy calculator of decimal to IEEE floating point values, which helps to explain the first assembly instruction.</em></p>
<p>As you can see, when one parameter is a float <em>the entire chain of parameters must also be floats, and so GCC will, without being asked, convert the non-float parameters to be so</em>. This conversion process is not free of course, and each conversion instance adds around 16 clock cycles on my machine.</p>
<p>As a final cost-of-conversion example, this C++ code take two floats and stores the result into an unsigned int. Note how the assembly created uses an SSE register to perform the division, but then issues a cvttss2si (Scalar Single-FP to Signed INT32 Conversion) instruction to convert the float to an int. Also note how the conversion is implicit&#8211;we do not cast the result, the cast simply happens as it&#8217;s logically required for the program to work. The cost of this conversion is 8 cycles on my machine.</p>
<pre>float a = 10;
float b = 3;
unsigned int c = a / b;</pre>
<pre>0x0000000100000b2f  &lt;+0118&gt;  mov    $0x41200000,%eax ; push a to eax
0x0000000100000b34  &lt;+0123&gt;  mov    %eax,-0x1c(%rbp) ; move a to the stack
0x0000000100000b37  &lt;+0126&gt;  mov    $0x40400000,%eax ; push b to eax
0x0000000100000b3c  &lt;+0131&gt;  mov    %eax,-0x20(%rbp) ; move b to the stack
0x0000000100000b3f  &lt;+0134&gt;  movss  -0x1c(%rbp),%xmm0 ; move a to xmm0
0x0000000100000b44  &lt;+0139&gt;  divss  -0x20(%rbp),%xmm0 ; divide
0x0000000100000b49  &lt;+0144&gt;  cvttss2siq %xmm0,%rax ; convert to int
0x0000000100000b4e  &lt;+0149&gt;  mov    %eax,-0x24(%rbp) ; move lower 32 bits to stack</pre>
<p>The final point to make here is just as we cannot use non-float values for floating point result operands, we cannot&#8211;and this is important&#8211;link together <strong>__m128</strong> and <strong>unsigned int</strong> vectors. This means we must first convert each uchar value to a float before it can be used.</p>
<h2>Implementing Our Code &#8211; A Few More Challenges</h2>
<p>This conversion discussion brings us to the crux of the task: no mater how I structure the algorithm we must use floating point numbers for the core conversion calculation. This means we&#8217;ll be able to do at most one pixel per sub-iteration (though this doesn&#8217;t mean we can&#8217;t unroll the loop), as SSE allows for at most 4 single precision floating point values per instruction. This is fine, but may undermine any potential gains using SSE brings to the table. As our example above shows, we may well be better served by simply unrolling our scalar loop.</p>
<p>Yet another issue is that per the source data, while we only start with 3 data elements we must add the fourth, the k channel, based on the <em>min</em> value of the first calculation block. As stated above, one of the major no-no&#8217;s of SIMD is accessing individual elements in our packed vector. This means that in order to add this new value we&#8217;ll need to use the shuffle instructions, or at some point grab and manipulate individual values out of the vector. Both of these options introduce performance penalties.</p>
<p>Finally, we should note that this concept also ties into the idea of <a title="Data Dependencies" href="http://en.wikipedia.org/wiki/Data_dependency" target="_blank">data dependencies</a>, which is the notion that some values defined in a computation may depend on other values set at some earlier stage. In our case the algorithm is filled with such dependencies: The k value depends on the first computation step (a <em>true dependency</em>), which in turn creates a <em>control dependency</em> for the next computational step (the if statement), and so on.</p>
<p>The end result is these data dependencies may limit our ability to reshuffle instructions around to eliminates cache misses, unroll the loop, and so on.</p>
<h3>To sum up then, we have four general challenges:</h3>
<ul>
<li>We cannot directly link our source buffer and the __m128 containers, and will also need to use conversions.</li>
<li>We are limited to 4 values at a time, but can unroll if needed.</li>
<li>We would like to dynamically access and set vector elements without resorting to non-SIMD instructions.</li>
<li>Various data dependencies mean we&#8217;re limited to how we can actually optimize the algorithm.</li>
</ul>
<h2>The First Benchmark</h2>
<p>As a first look at an SSE implementation that offers minor performance improvements, consider this loop which processes 35 2,500 x 1,800px CKMY JPEG images in 1.5 seconds. The non-SSE version performs the same task in 2 seconds:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
// 7 seconds (core loop time around 1.5 seconds)
 
uchar *bits = imin.bits(); // Qt grabs all pixel data and places in uchar array
 
__m128 *v1 = (__m128*)bits;
__m128 *v2 = (__m128*)cmyk_temp;
 
__m128 m_255 = _mm_set_ps1(255.0f);
__m128 m_1 = _mm_set_ps1(1.0f);
 
for(int i = 0; i &gt; wt * ht; i++){
 
    // holder
    float tmp[4] = {bits[j+2], bits[j+1], bits[j], 0};
 
    // vector containing 4 values
    __m128 *v_tmp = (__m128*)&amp;amp;tmp;
 
    // divide by 255
    *v_tmp = _mm_div_ps(*v_tmp, m_255);
 
    // subtract 1
    *v_tmp = _mm_sub_ps(m_1, *v_tmp);
 
    // find min
    tmp[3] = qMin(tmp[0], qMin(tmp[1], tmp[2]));
 
    float tmp_3 = tmp[3]; // cache original value
 
    if (!qFuzzyCompare(tmp[3],1)) {
 
        // subtract original c_k
        *v_tmp = _mm_sub_ps(*v_tmp, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
 
        // subtract 1 from c_k
        float bk_min = 1.0 - tmp[3];
 
        // divide by (cmy - c_k) - (1.0 - c_k_min)
        *v_tmp = _mm_div_ps(*v_tmp, _mm_set_ps(bk_min, bk_min, bk_min, bk_min));
 
        // reset black item
        tmp[3] = tmp_3;
 
    }
 
    // create final values
    *v_tmp = _mm_mul_ps(*v_tmp, m_255);
 
    // store valus back into buffer
    cmyk_temp[j] = fabs(tmp[0]); // 0
    cmyk_temp[j+1] = fabs(tmp[1]); // 1
    cmyk_temp[j+2] = fabs(tmp[2]); // 2
    cmyk_temp[j+3] = fabs(tmp[3]); // 3
 
    // increment counter
    j+=4;
}
</pre>
<p>While this code works, the problem is that we rely far to heavily on temporary values and repeated extraction of single float values form the vectors to get the job done.</p>
<p>We also issue out own conversion instructions via <em>fabs()</em> which works, but is a bit heavy handed. It would be better to let the compiler choose its own conversion strategy.</p>
<p>A second attempt addresses some of these issues, shaving of a half-second, and looks like:</p>
<pre class="brush: cpp; first-line: 1; pad-line-numbers: false; title: ; notranslate">
// 6.5 seconds (core loop time around 1 second)
 
j = 0;
 
uchar *bits = imin.bits();
 
__m128 v2;
 
__m128 m_255 = _mm_set_ps1(255.0f);
 
__m128 m_1 = _mm_set_ps1(1.0f);
 
float tmp[4];
 
float tmp_3;
 
for(int i = 0; i &gt; wt * ht; i++){
 
    v2 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order!
 
    // divide by 255
    v2 = _mm_div_ps(v2, m_255);
 
    // subtract 1
    v2 = _mm_sub_ps(m_1, v2);
 
    // store to find min
    _mm_store_ps(tmp, v2);
 
    // find min (_mm_min_ps)
    tmp_3 = qMin(tmp[0], qMin(tmp[1], tmp[2]));
 
    if (!qFuzzyCompare(tmp[3],1)) {
 
        // subtract original k
        v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
 
        // subtract 1 from base k
        float bk_min = 1.0 - tmp_3;
 
        // divide by (cmy - k) - (1.0 - k)
        v2 = _mm_div_ps(v2, _mm_set_ps(bk_min, bk_min, bk_min, bk_min));
 
    }
 
    v2 = _mm_mul_ps(v2, m_255); // multiply all by 255
 
    _mm_store_ps(tmp, v2); // save back to tmp buffer
 
    cmyk_temp[j] = tmp[0]; // c
    cmyk_temp[j+1] = tmp[1]; // m
    cmyk_temp[j+2] = tmp[2]; // y
    cmyk_temp[j+3] = tmp_3 * 255; // k
 
    j+=4;
 
}
</pre>
<p>This code runs slightly faster, with improvements coming from several locations:</p>
<p>First, we&#8217;ve moved unneeded code outside of the loop. This won&#8217;t create huge gains, but it does make the inner loop a bit cleaner.</p>
<p>The next big change is we&#8217;ve eliminated the pointer logic, and instead rely on a temporary array to link the __m128 and final buffer values together. We already know this is slower, but again, we have no choice as we cannot link unsigned int and __m128 items together.</p>
<p>All told, and despite the less-than- perfect code, the loops SSE logic runs quite fast. While the entire process takes around 7 seconds for 35 large images, only 1 second of that total time is spent on our core loop. The majority of time is spent in the other libraries such as Little CMS and libjpeg.</p>
<p>Still though, the biggest hit we&#8217;re taking now is the final part of the loop where we need to write to our output buffer. As we now know, this means we need to convert our floats back to ints, a costly operation. All told, these 4 lines account for half of the loops total execution time.</p>
<pre>cmyk_temp[j] = tmp[0]; // c
cmyk_temp[j+1] = tmp[1]; // m
cmyk_temp[j+2] = tmp[2]; // y
cmyk_temp[j+3] = tmp_3 * 255; // k</pre>
<p>As discussed previously, the conversion process takes 8 cycles per float to int, and all these extra instructions really add up. Combine that with the expensive memory access needed to write to the buffer in the first place and it&#8217;s not hard to see why this is slow.</p>
<h2>How Fast Will It Go?</h2>
<p>Of course it&#8217;s not all bad news. When we enable -O2 optimizations we&#8217;ll see our execution speed increase for all versions, and of course we can also unroll our loop to provide better utilization of instruction  paring. Finally, SSE provides instructions for bypassing the cache mechanisms of the processor, as well as others for pre-fetching the next set of values if using the cache.</p>
<p>Thus, to see how fast we can make this code we&#8217;ll make use of loop unrolling and pre-fetching in this final example.</p>
<p><em>_mm_prefetch()</em> is issued within the loop with the intention of starting our memory access ahead of time so that when the next line of  data is needed it&#8217;s accessed from a cache line rather than main memory. This has a measurable and consistent speedup of around 50 ms.</p>
<p>It is worth noting however, that when this code is profiled for L2 Cache misses we have some definite saturation going on. Optimizing cache access can be a tricky affair, requiring that we manually check for hot-spots in our code to see where we might run out of cache lines.</p>
<p>The general rule is the working set of our loop should match the cache size. The cache size, in return, depends on the processor. As our loop is consuming huge amounts of data compared to the cache size it&#8217;s important that we make sure we don&#8217;t unroll too far, as:</p>
<p>a) we will run out of registers to hold immediate values causing more cache reads (and potential misses)</p>
<p>b) can overfill all cache lines with the raw data if we try to make the working set too large.</p>
<p>A good tool for viewing cache performance is <a title="Shark" href="http://developer.apple.com/tools/shark_optimize.html" target="_blank">Shark</a> on OSX, <a title="PAPI" href="http://icl.cs.utk.edu/papi/faq/index.html" target="_blank">PAPI</a> or <a title="Valgrind" href="http://valgrind.org/" target="_blank">Valgrind</a> for Linux, and <a title="Visual Studio" href="http://msdn.microsoft.com/en-us/library/bb385772.aspx" target="_blank">Visual Studio</a> For Windows.</p>
<p>Finally, we also use the <em>_mm_shuffle_ps</em> instruction to insert the k value into each vector. This eliminates some of the overhead of member-wise access, but is still not perfect. Again though, as the source and destination buffers are not floats this is a price we have to pay.</p>
<pre class="brush: cpp; title: ; notranslate">
j = 0;
 
uchar *bits = imin.bits();
 
__m128 v1, v2, v3, v4;
__m128 k1, k2, k3, k4;
 
__m128 m_255 = _mm_set_ps1(255.0f);
 
__m128 m_1 = _mm_set_ps1(1.0f);
 
float tmp1[4], tmp2[4], tmp3[4], tmp4[4];
 
float tmp_1, tmp_2, tmp_3, tmp_4;
 
for(int i = 0; i &lt; wt * ht; i+=4){
 
    v1 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order!
    v2 = _mm_set_ps(1, bits[j+4], bits[j+5], bits[j+6]); // rev order!
    v3 = _mm_set_ps(1, bits[j+8], bits[j+9], bits[j+10]); // rev order!
    v4 = _mm_set_ps(1, bits[j+12], bits[j+13], bits[j+14]); // rev order!
 
    // fetch next data values
    _mm_prefetch(&amp;bits[j+16], _MM_HINT_T0);
 
    // divide by 255
    v1 = _mm_div_ps(v1, m_255);
    v2 = _mm_div_ps(v2, m_255);
    v3 = _mm_div_ps(v3, m_255);
    v4 = _mm_div_ps(v4, m_255);
 
    // subtract 1
    v1 = _mm_sub_ps(m_1, v1);
    v2 = _mm_sub_ps(m_1, v2);
    v3 = _mm_sub_ps(m_1, v3);
    v4 = _mm_sub_ps(m_1, v4);
 
    // store to find min
    _mm_store_ps(tmp1, v1);
    _mm_store_ps(tmp2, v2);
    _mm_store_ps(tmp3, v3);
    _mm_store_ps(tmp4, v4);
 
    // find min (_mm_min_ps)
    tmp_1 = qMin(tmp1[0], qMin(tmp1[1], tmp1[2]));
    tmp_2 = qMin(tmp2[0], qMin(tmp2[1], tmp2[2]));
    tmp_3 = qMin(tmp3[0], qMin(tmp3[1], tmp3[2]));
    tmp_4 = qMin(tmp4[0], qMin(tmp4[1], tmp4[2]));
 
    if (!qFuzzyCompare(tmp1[3],1)) {
 
        // subtract original k
        v1 = _mm_sub_ps(v1, _mm_set_ps(tmp_1, tmp_1, tmp_1, tmp_1));
 
        // subtract 1 from base k
        float bk_min1 = 1.0 - tmp_1;
 
        // divide by (cmy - k) - (1.0 - k)
        v1 = _mm_div_ps(v1, _mm_set_ps(bk_min1, bk_min1, bk_min1, bk_min1));
 
    }
    if (!qFuzzyCompare(tmp2[3],1)) {
 
        // subtract original k
        v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_2, tmp_2, tmp_2, tmp_2));
 
        // subtract 1 from base k
        float bk_min2 = 1.0 - tmp_2;
 
        // divide by (cmy - k) - (1.0 - k)
        v2 = _mm_div_ps(v2, _mm_set_ps(bk_min2, bk_min2, bk_min2, bk_min2));
 
    }
    if (!qFuzzyCompare(tmp3[3],1)) {
 
        // subtract original k
        v3 = _mm_sub_ps(v3, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
 
        // subtract 1 from base k
        float bk_min3 = 1.0 - tmp_3;
 
        // divide by (cmy - k) - (1.0 - k)
        v3 = _mm_div_ps(v3, _mm_set_ps(bk_min3, bk_min3, bk_min3, bk_min3));
 
    }
    if (!qFuzzyCompare(tmp4[3],1)) {
 
        // subtract original k
        v4 = _mm_sub_ps(v4, _mm_set_ps(tmp_4, tmp_4, tmp_4, tmp_4));
 
        // subtract 1 from base k
        float bk_min4 = 1.0 - tmp_4;
 
        // divide by (cmy - k) - (1.0 - k)
        v4 = _mm_div_ps(v4, _mm_set_ps(bk_min4, bk_min4, bk_min4, bk_min4));
 
    }
 
    // store to extract (prevents green cast)
    _mm_store_ps(tmp1, v1);
    _mm_store_ps(tmp2, v2);
    _mm_store_ps(tmp3, v3);
    _mm_store_ps(tmp4, v4);
 
    k1 = _mm_set_ps(tmp_1, tmp1[2], 0, 0); // reverse
    k2 = _mm_set_ps(tmp_2, tmp2[2], 0, 0); // reverse
    k3 = _mm_set_ps(tmp_3, tmp3[2], 0, 0); // reverse
    k4 = _mm_set_ps(tmp_4, tmp4[2], 0, 0); // reverse
 
    // add fourth vector back in (k)
    v1 = _mm_shuffle_ps(v1, k1, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v2 = _mm_shuffle_ps(v2, k2, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v3 = _mm_shuffle_ps(v3, k3, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v4 = _mm_shuffle_ps(v4, k4, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
 
    // multiply all by 255
    v1 = _mm_mul_ps(v1, m_255);
    v2 = _mm_mul_ps(v2, m_255);
    v3 = _mm_mul_ps(v3, m_255);
    v4 = _mm_mul_ps(v4, m_255);
 
    // standard store
    _mm_store_ps(tmp1, v1); // save back to tmp buffer
    _mm_store_ps(tmp2, v2); // save back to tmp buffer
    _mm_store_ps(tmp3, v3); // save back to tmp buffer
    _mm_store_ps(tmp4, v4); // save back to tmp buffer
 
    // mem write
    cmyk_temp[j] = tmp1[0]; // c
    cmyk_temp[j+1] = tmp1[1]; // m
    cmyk_temp[j+2] = tmp1[2]; // y
    cmyk_temp[j+3] = tmp1[3]; // k
 
    cmyk_temp[j+4] = tmp2[0]; // c
    cmyk_temp[j+5] = tmp2[1]; // m
    cmyk_temp[j+6] = tmp2[2]; // y
    cmyk_temp[j+7] = tmp2[3]; // k
 
    cmyk_temp[j+8] = tmp3[0]; // c
    cmyk_temp[j+9] = tmp3[1]; // m
    cmyk_temp[j+10] = tmp3[2]; // y
    cmyk_temp[j+11] = tmp3[3]; // k
 
    cmyk_temp[j+12] = tmp4[0]; // c
    cmyk_temp[j+13] = tmp4[1]; // m
    cmyk_temp[j+14] = tmp4[2]; // y
    cmyk_temp[j+15] = tmp4[3]; // k
 
    j+=16;
 
}
</pre>
<p>As baseline of performance, our previous SSE version from the last section runs in 2505 milliseconds, or 1384 with -O2 optimizations enabled. This is a noticeable improvement* over the non-SSE version which runs at 2857 ms  unoptimized, and 2670 using -O2.</p>
<p><em>* These time values are now per core as opposed to all 4 cores the last quoted values specified.</em></p>
<p>Our new unrolled and prefetched version performs even better, clocking in at 842 ms. Compared to our original non-SSE version this represents around a 3x speedup. Not bad!</p>
<h2>The Final Frontier</h2>
<p>The last big area of contention now appears to be non-user code in the form of the Little CMS and libjpeg libraries. Specifically, as we can see from the <a title="Apple Instruments" href="http://developer.apple.com/technologies/tools/" target="_blank">Instruments </a>output below, the <em>Evel4Inputs</em> and Qt&#8217;s call to <em>jpeg_idct_islow</em> are now the biggest performance bottlenecks:</p>
<p><img class="alignnone size-full wp-image-822" title="call-stack" src="http://www.formboss.net/blog/wp-content/uploads/2010/10/call-stack.png" alt="" width="729" height="289" /></p>
<p><em>As you can also see, our SSE optimized routine in the form of the yellow highlight is now at 878 ms, faster than </em><em>Eval4Inputs and </em><em>jpeg_idct_islow. This was of course not the situation before the SSE optimizations.<br />
</em></p>
<p>What can we do about these slower library calls? For one, there is a faster, freely-available JPEG library availible called <a title="libjpeg-turbo" href="http://libjpeg-turbo.virtualgl.org/" target="_blank">libjepg-turbo</a> which uses SSE to significantly speed up the decoding of jpeg files. The best part is the API is the same as the older libjpeg V6 and V8 versions, so using this faster library is literally a matter of changing our include paths.</p>
<p>In the case of a Qt project, all we need to do is update our INCLUDEPATH and LIBS directives in our .pro file. For my Mac, which had a convenient installer provided by libjpeg-turbo, the lib and code files were located at:</p>
<pre>INCLUDEPATH += \
    /Users/grdinic/Documents/libs/boost_1_44_0 \
    #/user/local/include/ \
    /opt/libjpeg-turbo/include/

LIBS += -L/usr/local/lib -ltiff \
        -L/usr/local/lib -llcms2 \
        #-L/usr/local/lib -ljpeg \
        -L/opt/libjpeg-turbo/lib64 -lturbojpeg \
</pre>
<p>As you can see, I have commented out the &#8216;older&#8217; version of the library so the new one is used.</p>
<p>Unfortunately, despite our own user-code now using libjpeg-turbo, it appears that Qt is still calling its own version of the jpeg libraries, as the call stack shows when expanded:</p>
<p><img class="alignnone size-full wp-image-823" title="call-stack-expanded" src="http://www.formboss.net/blog/wp-content/uploads/2010/10/call-stack-expanded.png" alt="" width="729" height="289" /></p>
<p>This of course makes it very difficult to realize the speed gains of libjpeg-turbo without having to rewrite the image reading routines of Qt.</p>
<p>The calls to <em>Eval4Inputs</em> are trickier still, as now we&#8217;re getting into rewriting Little CMS. This is not something we&#8217;ll do in this post, though we may visit it at some point in the future. If nothing else that name, Eval<strong>4</strong>Inputs is rather evocative.</p>
<h2>Conclusion</h2>
<p>All told, as I had feared the limitations in SSE coding, namely the conversion steps and cost of accessing single elements of a vector (member-wise access), meant that the first iteration of the non-optimized SSE code was not appreciably faster than the non-SSE version.</p>
<p>However, as soon as we re-wrote the code to make better user of -O2 optimizations, instruction paring, cache pre-fetching, and loop unrolling, we saw a very solid gain of 3x. Even the non-unrolled version saw an average gain of 1.5-2x.</p>
<p>The bottom line is with -O2 optimizations on it would be hard to write SSE code that&#8217;s slower than non-SSE code, provided we pay close attention to the basic rules of being aware of conversion costs and limiting member-wise access.</p>
<p>When you further consider that just about every desktop CPU has at minimum SSE 2 support, the speed and efficiency gained by SSE coding outweighs just about any potential down-side.</p>
<h2>Useful Links</h2>
<p><a href="http://easycalculation.com/binary-converter.php">http://easycalculation.com/binary-converter.php</a></p>
<p><a href="http://msdn.microsoft.com/en-us/library/4d3eabky.aspx">http://msdn.microsoft.com/en-us/library/4d3eabky.aspx</a></p>
<p><a href="http://en.wikipedia.org/wiki/SSE4">http://en.wikipedia.org/wiki/SSE4</a></p>
<p><a href="http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/">http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/</a></p>
<p><a href="http://www.intel.com/products/processor/manuals/index.htm">http://www.intel.com/products/processor/manuals/index.htm</a></p>
<p><a href="http://software.intel.com/en-us/articles/avx-emulation-header-file/">http://software.intel.com/en-us/articles/avx-emulation-header-file/</a></p>
<p><a href="http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00145.html">http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00145.html</a></p>
<p><a href="http://www.x86-64.org/documentation/assembly.html">http://www.x86-64.org/documentation/assembly.html</a></p>
<p><a href="http://docs.sun.com/app/docs/doc/817-5477/enmzx?a=view">http://docs.sun.com/app/docs/doc/817-5477/enmzx?a=view</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2010/10/sse-intrinsics-tutorial/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GCC Inline Assembly Loop Structures</title>
		<link>http://www.formboss.net/blog/2010/10/gcc-inline-assembly-loop-structures/</link>
		<comments>http://www.formboss.net/blog/2010/10/gcc-inline-assembly-loop-structures/#comments</comments>
		<pubDate>Tue, 19 Oct 2010 20:42:06 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[At&T Assembly Syntax]]></category>
		<category><![CDATA[GC Assembly]]></category>
		<category><![CDATA[Inline Assembly]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=718</guid>
		<description><![CDATA[While I&#8217;ve covered basic assembly before, I wanted to quick touch on the idea of implementing simple control flow techniques. GCC of course allows for this in the extended syntax mode, but finding good documentation with examples is hard. The point of this post then is to show a few simple examples of writing control [...]]]></description>
			<content:encoded><![CDATA[<p>While I&#8217;ve covered basic assembly before, I wanted to quick touch on the idea of implementing simple control flow techniques. GCC of course allows for this in the <em>extended syntax</em> mode, but finding good documentation with examples is hard. The point of this post then is to show a few simple examples of writing control flow code, as well as point out a few of the more subtle points. </p>
<p><span id="more-718"></span></p>
<p>Assuming a C++ file with the following declarations:</p>
<p>int a = 1;</p>
<p>int b = 2;</p>
<p>int c = 0;</p>
<p>int ct = 2;</p>
<h2>Basic Extended Syntax Code</h2>
<p>As a refresher for our <a href="http://www.formboss.net/blog/2010/06/c-64bit-inline-assembly-primer-part-1/">earlier post on the subject</a>, in this first example we&#8217;ll add a to b and place the result in c. The key is that inline assembly <em>glues</em> our C or C++ code to raw assembly by way of <em>constraints</em>, which are mappings of memory, registers, and other locations to the variables being used.</p>
<p>In this first example we define which registers to use for the add step, but leave it up to the assembler to handle the details of our parameters offsets and stack management.</p>
<pre class="brush: cpp; title: ; notranslate">
	// add a+b=c
	__asm__ __volatile__ (
		&quot;movl %1, %%eax \n\t&quot;
		&quot;movl %2, %%ebx \n\t&quot;
		&quot;addl %%eax, %%ebx \n\t&quot;
		&quot;movl %%ebx, %0 \n\t&quot;
	  : &quot;=m&quot; (c)
	  :	&quot;m&quot; (a), &quot;m&quot; (b)
	  : /* clobbers */

	);</pre>
<h2>Adding a Simple Loop</h2>
<p>In this block we employ a label, along with a counting register (ecx), to loop though the code as many times as defined by ct.</p>
<pre class="brush: cpp; title: ; notranslate">
	// add a+b=c
	__asm__ __volatile__ (
		&quot;movl %3, %%ecx \n\t&quot; // loop counter
		&quot;.LOOP1_START:&quot;
		&quot;movl %1, %%eax \n\t&quot;
		&quot;movl %2, %%ebx \n\t&quot;
		&quot;addl %%eax, %%ebx \n\t&quot;
		&quot;movl %%ebx, %0 \n\t&quot;
		&quot;dec %%ecx \n\t&quot;
		&quot;jnz .LOOP1_START&quot;
	  : &quot;=m&quot; (c)
	  :	&quot;m&quot; (a), &quot;m&quot; (b), &quot;m&quot; (ct)
	  : /* clobbers */
	);</pre>
<p>To effectively create loops and other flow control statements in assembly we must learn the <strong>Label</strong> and <strong>Symbol</strong> syntax.</p>
<p>At the top of the hierarchy is the general idea of a Symbols, with <strong>Labels</strong> and <strong>Local Symbols</strong> being implementations of specific <em>types</em> of symbols. This distinction is important because when creating assembly code we will use <strong>Labels</strong> and <strong>Local Symbols</strong> to create flow control.</p>
<p>Labels can be thought of as a way to create blocks of code that perform a related function. Within these blocks we can further create <strong>Local Symbols</strong> to create jump point in our code.</p>
<p>Naming conventions are important, and are defined as follows:</p>
<p>Labels should start (but do not have to) with a .L and end with a colon. For example:</p>
<pre>
.L_ADD_LOOP:</pre>
<p>To refer to a label, for example, using the <em>jmp</em> op-code, we would use:</p>
<pre>
jmp .L_ADD_LOOP
</pre>
<p>Local Symbols are used to create jump point within a Label Block, and follow the form:</p>
<p><em>N:</em></p>
<p>With <em>N</em> being the unique positional number of that symbol within the larger, named <strong>Label</strong> block.</p>
<p>For example, we could have:</p>
<pre>
LReverseShort:
	movl    %ecx,%edx		// copy length
	shrl	$2,%ecx			// #words
	jz	3f
1:
	subl	$4,%esi
	movl	(%esi),%eax
	subl	$4,%edi
	movl	%eax,(%edi)
	dec	%ecx
	jnz	1b
3:
	andl	$3,%edx			// bytes?
	jz	5f
4:
	dec	%esi
	movb	(%esi),%al
	dec	%edi
	movb	%al,(%edi)
	dec	%edx
	jnz	4b
5:
        movl    8(%ebp),%eax		// get return value (dst ptr) for memcpy/memmove
        popl    %edi
        popl    %esi
	popl	%ebp
        ret</pre>
<p>Which as you can see, allows the developer to use various jump instructions to navigate the code block.</p>
<p>The important part here is to note how the tail end of each call to the Local Symbols uses a <em>f</em> or <em>b</em>. For example, we have:</p>
<p><strong>jnz 4b </strong>and <strong>jz 5f</strong>.</p>
<p>Why? We follow the convention of writing the raw numeric value of the local symbol, followed by either an <em>f</em> or <em>b</em>, for <em>forward</em> or <em>backward</em> with respect to where that label sits in relation to the calling jump instruction.</p>
<p>Thus, we have two main ways to refer to our <strong>Labels</strong> and <strong>Local Symbols</strong>:</p>
<p>We refer to Labels with the full name, and Local Symbols (optionally), with with the number of the Symbol followed by the direction the called area lies in relation to the Local Symbol. </p>
<p>To help drive this point home, see if you can follow the flow if this code segment:</p>
<pre class="brush: cpp; title: ; notranslate">
	int a = 1;
	int b = 2;
	int c = 0;
	int ct = 1;

	__asm__ __volatile__ (
	  &quot;movl %3, %%ecx \n\t&quot; // loop counter

	  &quot;movl %1, %%eax \n\t&quot; // holds 1
	  &quot;movl %2, %%ebx \n\t&quot; // holds 2

	  &quot;jmp .Multiply \n\t&quot;

	  &quot;.L_ADD_LOOP:&quot;
	  &quot;addl %%eax, %%ebx \n\t&quot;

	  &quot;cmp $0x1, %%ecx \n\t&quot;
	  &quot;je 1f \n\t&quot;

	  &quot;dec %%ecx \n\t&quot;
	  &quot;jnz .L_ADD_LOOP \n\t&quot;

	  &quot;jmp .EXIT \n\t&quot;	

	  &quot;1:&quot;
	  &quot;addl $0x10, %%ebx \n\t&quot;
	  &quot;dec %%ecx \n\t&quot;
	  &quot;cmp $0x0, %%ecx \n\t&quot;
	  &quot;je .EXIT \n\t&quot;
	  &quot;jmp .L_ADD_LOOP \n\t&quot;

	  &quot;.Multiply:&quot;
	  &quot;movl $0x2, %%eax \n\t&quot;
	  &quot;mull %%ebx \n\t&quot;
	  &quot;cmp $0x0, %%ecx \n\t&quot;
	  &quot;jne .L_ADD_LOOP \n\t&quot;

	  &quot;.EXIT:&quot;
	  &quot;movl %%ebx, %0 \n\t&quot;

	  : &quot;=m&quot; (c)
	  :	&quot;m&quot; (a), &quot;m&quot; (b), &quot;m&quot; (ct)
	  : /* clobbers */

	  );

	cout &lt;&lt; c &lt;&lt; endl;
</pre>
<p>An interesting item of note in this example is that our eax register, because of the requirement the <strong>mull</strong> instruction places on the loading of the source operand into eax, actually corrupts our final calculation because eax is then used in our addl call in .L_ADD_LOOP. </p>
<p>We could avoid this by using a different register for the add, say edx, or more appropriately, by using a different control flow. No matter, the important point is that if we&#8217;re not careful such subtle bugs can creep up. Such is the nature of assembly coding. </p>
<p>For good measure, here is the c++ version of the same function:</p>
<pre class="brush: cpp; title: ; notranslate">
	// c++ version
	int a1 = 1;
	int b1 = 2;
	int c1 = 0;
	int ct1 = 1;

	c1 = 2 * b1;
	for(int i = ct; i != 0; --i){
		c1 = c1 + b1;
		if(i == 1){
			c1 = 16 + c1;
		}
	}
</pre>
<p>A good source of these items in action can be found <a href="http://www.opensource.apple.com/source/xnu/xnu-1504.7.4/osfmk/i386/commpage/bcopy_sse42.s" target="_blank">here</a>.</p>
<h2>Useful Links</h2>
<p><a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html">http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html</a></p>
<h4>Local Labels and Symbols</h4>
<p><a href="http://tigcc.ticalc.org/doc/gnuasm.html#SEC18">http://tigcc.ticalc.org/doc/gnuasm.html#SEC18</a></p>
<p><a href="http://tigcc.ticalc.org/doc/gnuasm.html#SEC48">http://tigcc.ticalc.org/doc/gnuasm.html#SEC48</a></p>
<h4>Basic Math</h4>
<p><a href="http://en.wikibooks.org/wiki/X86_Assembly/Arithmetic">http://en.wikibooks.org/wiki/X86_Assembly/Arithmetic</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2010/10/gcc-inline-assembly-loop-structures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

