5. February 2014 03:20
by nKast

CPU Skinning: ARM-NEON

5. February 2014 03:20 by nKast | 0 Comments

 One of the nice things about Windows Phone 7 was the experimental support for ARM-NEON instructions. What it did, was generate NEON instructions for XNA's build in vector classes. It greatly improved performance on things like Physics, Particles, Geometry generation,etc. I used this from the very beginning on The Juggler to improve Farseer physics and later on Dr. Pickaxe to improve both Physics and CPU Skinning. In this forum @Moblunatic describes how you can modify the CPU Skinning sample to get a ~40% improvement on WP7 devices.

 As we moved our next project to WP8/MonoGame I decided to do again some measurements to see if I could do any optimization.

XNA HD7 (WP7.5)  28,412ms 18,463ms
 L 620 (WP8)   8,750ms  14,159ms
 MonoGame  L 620 (WP8) 11,769ms 25,639ms

 The first thing we notice is that on WP8 we no longer get the benefit of ARM-NEON. Even on old XNA projects the OS no longer use them. So you need to detect WP8, probably by checking for it via reflection, and use the original skinning code if you want maximum performance. 
 The same is true for MonoGame. You should revert back to the original code if you need max performance.

Assume nothing

One of the things I notice about the code, was that it makes a method call inside the loop, which in turn calls a second method. 

// skin all of the vertices
for (int i = 0; i < vertexCount; i++)
            ref cpuVertices[i].Position,
            ref cpuVertices[i].Normal,
            ref cpuVertices[i].BlendIndices,
            ref cpuVertices[i].BlendWeights,
            out gpuVertices[i].Position,
            out gpuVertices[i].Normal);

 I assumed that I could speed up the code significantly by removing the overhead caused by those calls by bringing the actual code inside the loop. This turn out to work for the neon version but I also got some weird artifacts on HD7, so I couldn't use it.
 For the original code, the one I use for MonoGame, it made things worst! It turns out the code is not written this way for simplicity but there are some very clever optimizations going on. 
 Notice the use ref & out? This is like taking the address of -let's say- cpuVertices[i].Position and passing it down to the next method instead copying the struct to a local variable or keep accessing it through the cpuVertices[i] list. Since some platforms don't allow pointers/unsafe code the use of ref/out is a nice trick!

XNA HD7 (WP7.5)  28,412ms 18,463ms  34,210ms 14,099ms
 L 620 (WP8)   8,750ms  14,159ms  9,421ms  9,7598ms
 MonoGame  L 620 (WP8) 11,769ms 25,639ms 12,758ms 21,035ms



 If you still support WP7 with XNA, always use CPU skinning (the GPUs were really weak) enable NEON (EnableFPIntrinsicsUsingSIMD inside AssemplyInfo.cs), and use the NEON version of CPUSkinning.

 For WP8 the GPU is fast enough to do skinning, but you can always use that extra headroom for rich shading/post-proccessing/etc. Skinning is one of few things that you can move to a second thread so it comes for free if you do so. Use the original skinning code which is better optimized and performs better in the absence of NEON/XNA.


CPUSkinning - 02 - Neon.zip (10.99 mb)



