<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NIX/WIN/WEB &#187; optimizing code</title>
	<atom:link href="http://www.formboss.net/blog/tag/optimizing-code/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.formboss.net/blog</link>
	<description>Modern Web Application Development</description>
	<lastBuildDate>Thu, 02 Feb 2012 18:43:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>C++ Performance &#8211; Counting Clocks</title>
		<link>http://www.formboss.net/blog/2010/07/c-performance-counting-clocks/</link>
		<comments>http://www.formboss.net/blog/2010/07/c-performance-counting-clocks/#comments</comments>
		<pubDate>Mon, 19 Jul 2010 01:03:04 +0000</pubDate>
		<dc:creator>grdinic</dc:creator>
				<category><![CDATA[C++]]></category>
		<category><![CDATA[assembly code]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[optimizing code]]></category>

		<guid isPermaLink="false">http://www.formboss.net/blog/?p=503</guid>
		<description><![CDATA[Performance computing can be an interesting task to undertake, as it lets us get under the hood of a complex abstraction (C++) and talk to the machine directly. In the next two posts we&#8217;ll take a look at the Fibonacci algorithm used in a few other posts and see just how fast we can make [...]]]></description>
			<content:encoded><![CDATA[<p>Performance computing can be an interesting task to undertake, as it lets us get under the hood of a complex  abstraction (C++) and talk to the machine directly.</p>
<p>In the next two posts we&#8217;ll take a look at the Fibonacci algorithm used in a few other posts and see just how fast we can make it by looking at loop implementation, data type usage, and assembly code optimizations. In this first post we&#8217;ll take a look at some basic timing code, what gcc will do for us using the -O2 flag, then in the second post, see what we can do (if anything) to improve upon the generated code. </p>
<p><span id="more-503"></span></p>
<p>The Algorithm<br />
The traditional use of the Fibonacci sequence is to show off the use of recursive function calls. The only problem with this is for every recursive call we dump the result of the previous call back on the functions call stack. For a few recursions this isn&#8217;t that bad, but in the case of a Fibonacci sequence it can lead to millions of calls and loads of memory&#8211;the most common end result then is too many calls will simply stall or crash the application.</p>
<p>The solution then is to use an iterative approach to solving the sequence.</p>
<p>Recall that the fib sequence is simply the sum of the previous two numbers in the sequence:</p>
<p>1+1=2, 1+2=3, 2+3=5, 3+5=8, 5+8=13, 8+13=21 [...]</p>
<p>For the sake of my implementation, we need to tell the loop how many times to run, and to shift the variables values down by 1 for each loop.</p>
<p>So on the first turn we set x and y to be 1 and 1 respectively. We then add them together, placing the value into a third variable.</p>
<p>We then swap the values of x and y down &#8220;1&#8243;, with x getting y&#8217;s old value and y getting z.</p>
<div class="geshi no cpp">
<ol>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">int</span> iterations <span class="sy1">=</span> <span class="nu0">10</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">int</span> i;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">long</span> values<span class="br0">&#91;</span>iterations<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">long</span> x,y,z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; x <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; y <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="co1">// standard loop</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span>i <span class="sy1">=</span> <span class="nu0">0</span>; i <span class="sy3">&amp;</span>lt; iterations; i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; z <span class="sy1">=</span> x <span class="sy2">+</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; x <span class="sy1">=</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; y <span class="sy1">=</span> z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; values<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy1">=</span> z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>So how can we make this run as fast as possible? The first step is to figure out how fast it currently is. With modern CPU&#8217;s being as fast as they are seconds and milliseconds simply won&#8217;t do. Thus, we need to look at the finest grain counter an x86 CPU offers: CPU clocks.</p>
<p>To count clocks we&#8217;ll implement a CPU clock counting benchmark routine valid on gcc and x86-64 hardware:</p>
<div class="geshi no cpp">
<ol>
<li class="li1">
<div class="de1"><span class="coMULTI">/* </span></div>
</li>
<li class="li1">
<div class="de1"><span class="coMULTI">&nbsp;* File: &nbsp; main.cpp</span></div>
</li>
<li class="li1">
<div class="de1"><span class="coMULTI">&nbsp;* Author: M. Grdinic</span></div>
</li>
<li class="li1">
<div class="de1"><span class="coMULTI">&nbsp;*</span></div>
</li>
<li class="li1">
<div class="de1"><span class="coMULTI">&nbsp;* Created on July 19, 2010, 3:30 PM</span></div>
</li>
<li class="li1">
<div class="de1"><span class="coMULTI">&nbsp;*/</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co2">#include &lt;stdlib.h&gt;</span></div>
</li>
<li class="li1">
<div class="de1"><span class="co2">#include &lt;iostream&gt;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw2">using</span> <span class="kw2">namespace</span> std;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw4">typedef</span> <span class="kw4">long</span> <span class="kw4">long</span> __int64;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw2">inline</span> __int64 rdtsc<span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __int64 hi, lo;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__ __volatile__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;xorl %%eax, %%eax;<span class="es0">\n</span><span class="es0">\t</span>&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;push %%rbx;&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;cpuid<span class="es0">\n</span><span class="es0">\t</span>&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">::</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">:</span><span class="st0">&quot;%rax&quot;</span>, <span class="st0">&quot;%rbx&quot;</span>, <span class="st0">&quot;%rcx&quot;</span>, <span class="st0">&quot;%rdx&quot;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__ __volatile__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;rdtsc;&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">:</span> <span class="st0">&quot;=a&quot;</span> <span class="br0">&#40;</span>lo<span class="br0">&#41;</span>, &nbsp;<span class="st0">&quot;=d&quot;</span> <span class="br0">&#40;</span>hi<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">::</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __asm__ __volatile__<span class="br0">&#40;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;xorl %%eax, %%eax; cpuid;&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;pop %%rbx;&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">::</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="sy4">:</span><span class="st0">&quot;%rax&quot;</span>, <span class="st0">&quot;%rbx&quot;</span>, <span class="st0">&quot;%rcx&quot;</span>, <span class="st0">&quot;%rdx&quot;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; </div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">return</span> <span class="br0">&#40;</span>__int64<span class="br0">&#41;</span>hi <span class="sy1">&lt;&lt;</span> <span class="nu0">32</span> <span class="sy3">|</span> lo;</div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="co2">#define COUNT 20</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw4">void</span> tester<span class="br0">&#40;</span>__int64 overhead<span class="br0">&#91;</span><span class="br0">&#93;</span>, __int64 clocks<span class="br0">&#91;</span><span class="br0">&#93;</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __int64 s, f, t1, t2;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">int</span> x_i; <span class="kw4">int</span> x_it <span class="sy1">=</span> <span class="nu0">0</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="co1">// set locals</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">int</span> iterations <span class="sy1">=</span> <span class="nu0">20</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">int</span> i;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">long</span> values<span class="br0">&#91;</span>iterations<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw4">long</span> x,y,z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; x <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; y <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="co1">// LOOP CONTROL CODE</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span>x_it <span class="sy1">=</span> <span class="nu0">0</span>; x_it <span class="sy1">&lt;</span> COUNT; x_it<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span>x_i <span class="sy1">=</span> <span class="nu0">0</span>; x_i <span class="sy1">&lt;</span> <span class="nu0">3</span>; x_i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; t1 <span class="sy1">=</span> rdtsc<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; t2 <span class="sy1">=</span> rdtsc<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; overhead<span class="br0">&#91;</span>x_it<span class="br0">&#93;</span> <span class="sy1">=</span> t2 <span class="sy2">-</span> t1;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; s <span class="sy1">=</span> rdtsc<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// CODE START</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// standard loop</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; asm<span class="br0">&#40;</span><span class="st0">&quot;nop&quot;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span>i <span class="sy1">=</span> <span class="nu0">0</span>; i <span class="sy1">&lt;</span> iterations; i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; z <span class="sy1">=</span> x <span class="sy2">+</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; x <span class="sy1">=</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; y <span class="sy1">=</span> z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy1">=</span> z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; asm<span class="br0">&#40;</span><span class="st0">&quot;nop&quot;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// CODE END</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; f <span class="sy1">=</span> rdtsc<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// reset locals values</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; x <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; y <span class="sy1">=</span> <span class="nu0">1</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; clocks<span class="br0">&#91;</span>x_it<span class="br0">&#93;</span> <span class="sy1">=</span> <span class="br0">&#40;</span>f <span class="sy2">-</span> s<span class="br0">&#41;</span> <span class="sy2">-</span> overhead<span class="br0">&#91;</span>x_it<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span>i <span class="sy1">=</span> <span class="nu0">0</span>; i <span class="sy1">&lt;</span> iterations; i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">cout</span> <span class="sy1">&lt;&lt;</span> values<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy1">&lt;&lt;</span> endl;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw4">int</span> main<span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span> &nbsp; &nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; __int64 overhead<span class="br0">&#91;</span>COUNT<span class="br0">&#93;</span>, clocks<span class="br0">&#91;</span>COUNT<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="co1">// run test</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; tester<span class="br0">&#40;</span>overhead, clocks<span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; </div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="co1">// print results</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">for</span><span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy1">=</span> <span class="nu0">0</span>; i <span class="sy1">&lt;</span> COUNT; i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">cout</span> <span class="sy1">&lt;&lt;</span> <span class="st0">&quot;Overhead:&quot;</span> <span class="sy1">&lt;&lt;</span> overhead<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy1">&lt;&lt;</span> <span class="st0">&quot;<span class="es0">\t</span>Clocks: &quot;</span> <span class="sy1">&lt;&lt;</span> clocks<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy1">&lt;&lt;</span> endl;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; <span class="kw1">return</span> <span class="br0">&#40;</span><span class="kw2">EXIT_SUCCESS</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>The basic idea is we define a number for how many tests to perform, then loop over the code this many times. On each iteration we call our inline rdtsc() routine which is written in assembly. It&#8217;s important to note that with modern processors we need to serialize the instruction stream (via CPUID), as well as issue clobbers for the effected registers from the CPUID call. We&#8217;ll also measure the overhead of calling CPUID and subtract this from the final result. In the end this routine allow us to get a good feel for how many CPU clocks our code consumed.</p>
<p>Finally, as a general point of interest, context switches present in all modern operating systems can and will effect the clock count of very large functions. Thus, in our case we make sure to only benchmark small sections of code. </p>
<p>With our counting routine ready for action the first thing I wanted to look at was the base performance. Running through the first 8 iterations shows:</p>
<p>Overhead:440	Clocks: 363<br />
Overhead:440	Clocks: 286<br />
Overhead:440	Clocks: 407<br />
Overhead:440	Clocks: 308<br />
Overhead:440	Clocks: 286<br />
Overhead:440	Clocks: 286<br />
Overhead:440	Clocks: 286<br />
Overhead:440	Clocks: 286</p>
<p>In other words, after a cache warm up period we stabilize at 286 cycles. </p>
<p>We&#8217;ll then add the -O2 optimization flag to the compiler directives (in NetBeans, open your projects options page and select the C++ Compiler options &gt; Additional Options) and see what we get:</p>
<p>Overhead:429	Clocks: 121<br />
Overhead:429	Clocks: 99<br />
Overhead:429	Clocks: 66<br />
Overhead:429	Clocks: 55<br />
Overhead:429	Clocks: 66<br />
Overhead:429	Clocks: 55<br />
Overhead:429	Clocks: 66<br />
Overhead:429	Clocks: 66</p>
<p>To make sense of this we&#8217;ll check out the assembly for each:</p>
<p>First, the unoptimized version:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp;movl $<span class="nu0">0</span>, <span class="nu0">-40</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">jmp</span> .L7</div>
</li>
<li class="li1">
<div class="de1">.L8:</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-88</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-80</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rdx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;leaq <span class="br0">&#40;</span>%rdx,%rax<span class="br0">&#41;</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rax, <span class="nu0">-96</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-88</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rax, <span class="nu0">-80</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-96</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rax, <span class="nu0">-88</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movl <span class="nu0">-40</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %<span class="kw3">edx</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-112</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movslq %<span class="kw3">edx</span>,%rdx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq <span class="nu0">-96</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %rcx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rcx, <span class="br0">&#40;</span>%rax,%rdx,<span class="nu0">8</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;addl $<span class="nu0">1</span>, <span class="nu0">-40</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">.L7:</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movl <span class="nu0">-40</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;cmpl <span class="nu0">-36</span><span class="br0">&#40;</span>%rbp<span class="br0">&#41;</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">setl</span> %<span class="kw3">al</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;testb %<span class="kw3">al</span>, %<span class="kw3">al</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">jne</span> .L8</div>
</li>
</ol>
</div>
<p>And the -O2 optimized version:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp;xorl %<span class="kw3">eax</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movl $<span class="nu0">1</span>, %<span class="kw3">edx</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movl $<span class="nu0">1</span>, %<span class="kw3">ebx</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;.p2align <span class="nu0">4</span>,,<span class="nu0">10</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;.p2align <span class="nu0">3</span></div>
</li>
<li class="li1">
<div class="de1">.L6:</div>
</li>
<li class="li1">
<div class="de1">&nbsp;leaq <span class="br0">&#40;</span>%rdx,%rbx<span class="br0">&#41;</span>, %rcx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rdx, %rbx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rcx, <span class="br0">&#40;</span>%r12,%rax<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;addq $<span class="nu0">8</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rcx, %rdx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cmpq $<span class="nu0">80</span>, %rax</div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">jne</span> .L6</div>
</li>
</ol>
</div>
<p>The first thing that jumps out is the loop counting&#8211;whereas in the unoptimized version we count using the 16-bit registers against main memory, the optimized version adds &#8216;logic&#8217; to the process, using cmpq to check an immediate value from a register (%rax).</p>
<p>This explains some of the speedup, but not all. The get to bottom of why the optimized code is so much smaller (faster) we can take a huge hint from the number of raw instructions, of which the optimized code of course has far fewer.</p>
<p>Part of this is a side-effect of the 64-bit architecture we&#8217;re compiling too, as %rdx and %rbx are leaq&#8217;d to %rcx to produce code which stores loop values in registers at all times, as opposed to the unoptimized code which continually pushes values to the call stack. This eliminates almost all of the movq&#8217;s which certainly helps, but far more importantly, we no longer access main memory.</p>
<p>All that said, it <em>still </em>doesn&#8217;t explain everything in terms of raw performance. </p>
<p>The big hint comes from recompiling the application to use only 4 Fibonacci loops. It&#8217;s a bit hard to see, so I&#8217;ve added an intermediate asm(&#8220;nop&#8221;) in the loop so it becomes clear: </p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">#NO_APP</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq $<span class="nu0">2</span>, <span class="br0">&#40;</span>%r12<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">#APP</div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">72</span> <span class="st0">&quot;main.cpp&quot;</span> <span class="nu0">1</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">nop</span></div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">0</span> <span class="st0">&quot;&quot;</span> <span class="nu0">2</span></div>
</li>
<li class="li1">
<div class="de1">#NO_APP</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq $<span class="nu0">3</span>, <span class="nu0">8</span><span class="br0">&#40;</span>%r12<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">#APP</div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">72</span> <span class="st0">&quot;main.cpp&quot;</span> <span class="nu0">1</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">nop</span></div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">0</span> <span class="st0">&quot;&quot;</span> <span class="nu0">2</span></div>
</li>
<li class="li1">
<div class="de1">#NO_APP</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq $<span class="nu0">5</span>, <span class="nu0">16</span><span class="br0">&#40;</span>%r12<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">#APP</div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">72</span> <span class="st0">&quot;main.cpp&quot;</span> <span class="nu0">1</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">nop</span></div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">0</span> <span class="st0">&quot;&quot;</span> <span class="nu0">2</span></div>
</li>
<li class="li1">
<div class="de1">#NO_APP</div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq $<span class="nu0">8</span>, <span class="nu0">24</span><span class="br0">&#40;</span>%r12<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">#APP</div>
</li>
</ol>
</div>
<p>gcc has unrolled the loop for us. Of course we only have so many registers, so one more loop to make 5 means we lose the unroll, and we&#8217;re back to standard loop code.</p>
<p>This is related to another important rule -O2 brings to the mix, which is if we remove the call to save our values in <em>values[n]</em>, -O2 literally removes our code from the final assembly file:</p>
<div class="geshi no cpp">
<ol>
<li class="li1">
<div class="de1"><span class="kw1">for</span><span class="br0">&#40;</span>i <span class="sy1">=</span> <span class="nu0">0</span>; i <span class="sy1">&lt;</span> iterations; i<span class="sy2">++</span><span class="br0">&#41;</span><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; z <span class="sy1">=</span> x <span class="sy2">+</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; x <span class="sy1">=</span> y;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; y <span class="sy1">=</span> z;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">//values[i] = z;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>Generates:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1"># <span class="nu0">67</span> <span class="st0">&quot;main.cpp&quot;</span> <span class="nu0">1</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">nop</span></div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">0</span> <span class="st0">&quot;&quot;</span> <span class="nu0">2</span></div>
</li>
<li class="li1">
<div class="de1"># <span class="nu0">74</span> <span class="st0">&quot;main.cpp&quot;</span> <span class="nu0">1</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">nop</span></div>
</li>
</ol>
</div>
<p>If however, we add a: cout << y; call after our loop, we generate:</p>
<div class="geshi no asm">
<ol>
<li class="li1">
<div class="de1">&nbsp;xorl %<span class="kw3">eax</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;.p2align <span class="nu0">4</span>,,<span class="nu0">10</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;.p2align <span class="nu0">3</span></div>
</li>
<li class="li1">
<div class="de1">.L6:</div>
</li>
<li class="li1">
<div class="de1">&nbsp;leaq <span class="br0">&#40;</span>%rsi,%rdi<span class="br0">&#41;</span>, %rdx</div>
</li>
<li class="li1">
<div class="de1">&nbsp;addl $<span class="nu0">1</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rsi, %rdi</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cmpl $<span class="nu0">10</span>, %<span class="kw3">eax</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;movq %rdx, %rsi</div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">jne</span> .L6</div>
</li>
</ol>
</div>
<p>In other words, the secret to the speed-up is in -O2 mode gcc adds just enough code to get the correct result, but nothing more. We&#8217;ll see this time and again when using optimizations, gcc stripping away the fat and making intelligent decisions based on your actual code, as opposed to blindly creating assembly that fits the most general scenario.</p>
<p>That said, nothings stopping us from rewriting the code using hand-crafted assembly to match the -O2 version, though with -O2 being 3 characters away, one could ask why bother&#8230;though we will anyway in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.formboss.net/blog/2010/07/c-performance-counting-clocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

