Recently I was analyzing some very computation intensive code. The code was operating on an image and was fiddling with each pixel. For image of size 4000×4000 my function was called ~5 billion times. Since most of the computation depends only on the pixel attributes – so this problem tended itself very well for data parallel computation, but I was more interested in eliminating bottlenecks in serial code first than running un-optimized code in parallel.
I started on Mac, compiled with Clang-3.5 and observed that computation on 20 images take somewhere around ~8 seconds on my 3 year old MacBook Pro. But the moment I compiled with Visual Studio 2013 and ran the same code on Windows 8.1 desktop with shiny new Haswell Processor it took whopping ~24 seconds to run!! I was astonished and was looking for what really went wrong. How could Visual Studio 2013 generate such slow code?
To tackle this head on I compiled the same code on windows using clang-cl and Visual Studio 2013 and dumped the generated assembly (/Fa option to compiler). Looking at assembly it was quite clear what caused the slowdown. The extra checking windows added in debug builds with Run Time Error Checks did add quite a bit of code, but above all it was thrashing cache badly. Later on I compiled my code with /RTC set to default and could see comparable timings with what clang generated.
It was quite an interesting exercise investigating and analyzing various CPU performance counters on Mac and Windows, perhaps I will write more about it some time later.