pixel perf-ect

29. November 2023 16:19
by nKast

What's new in KNI v3.10

29. November 2023 16:19 by nKast | 0 Comments

  KNI 3.10.9001 implements plenty of missing features in the BlazorGL platform, improvements in the Effect Processor, bug fixes, and performance improvements. A more detailed list of the changes is available on the changelog.

Download KNI v3.10 today.

.net6 Templates and nuget

  The SDK now includes new VS2022 templates targeting .net6. Both the .net6 templates and the former templates for net4.0 framework, Xamarin and uap10, reference the nuget packages of the framework. The MAUI platforms (Android, iOS) and their templates are upgraded to target .net8.



  The BlazorGL platform got a couple of improvements in this version. Those are the things that got implemented:

  • Texture2D.GetData(…)
  • SongReader & MediaPlayer
  • VideoPlayer
  • Game.IsActive
  • Depth24Stencil8 usage for Rendertarget2D and the backbuffer PreferredDepthStencilFormat.
  • PreferMultiSampling and RenderTargetUsage for the backbuffer.


  A common complaint with the content builder is it takes a couple of seconds to load a project and check for changed assets. The db files in the IntermediateDir folder are stored in a new binary format, replacing the XmlSerializer and any tedious reflection and parsing.  The project loading has a O(1) dictionary to check for duplicates. The tool  will no longer output ‘Skipping…’ lines. As a result, the content builder is not lighting fast and you will not even notice it when rebuilding a project. Additionally an issue with the /OutputDir and /IntermediateDir has been fixed, and the tool  will store its output in the same folder whether you build from the editor,  the command line, or the .csproj.

    The EffectProcessor takes 10% less time that the previous version, after some refactoring and removal of unnecessary data copies.

  Macros are no longer necessary when writing effects.  Whether you are targeting WindowsDX (DX11) or an OpenGL platform, you are writing one Shader. You can use the new HLSL4.0 syntax to define Samplers and Textures or the old HLSL2.0 syntax (compatibility mode).


Texture2D SpriteTexture : register(t0);
sampler2D SpriteTextureSampler : register(s0) = sampler_state
    Texture = <SpriteTexture>;
struct VertexShaderOutput
	float4 Position : SV_POSITION;
	float2 TextureCoordinates : TEXCOORD0;
float4 MainPS(VertexShaderOutput input) : COLOR
	return tex2D(SpriteTextureSampler, input.TextureCoordinates);
technique SpriteDrawing
	pass pass0 { PixelShader = compile ps_4_0_level_9_1 MainPS(); }


Texture2D SpriteTexture : register(t0);
sampler SpriteTextureSampler : register(s0);
struct VertexShaderOutput
	float4 Position : SV_POSITION;
	float2 TextureCoordinates : TEXCOORD0;
float4 MainPS(VertexShaderOutput input) : COLOR
	return SpriteTexture.Sample(SpriteTextureSampler, input.TextureCoordinates);
technique SpriteDrawing
	pass pass0 { PixelShader = compile ps_4_0_level_9_1 MainPS(); }



  Rendering performance has been improved, particularly for OpenGL.



The performance tests can be found in the KniBenchmarks project on github.
Framework versions: KNI 3.10.9001, MonoGame, FNA 23.07, XNA 4.0
All tests were performed on the following system:

CPU: AMD Ryzen 3 2200U
GPU: AMD Randeon Vega 3 Mobile Gfx
HDD: WD Blue SN550 NVMe SSD 



While KNI is free and open-source, maintaining and expanding the framework requires ongoing effort and resources. We rely on the support of our community to continue delivering top-notch updates, features, and support.
By becoming a Sponsor, you can directly contribute to the growth and sustainability of the KNI Game Framework.

8. August 2023 18:19
by nKast

What's new in KNI v3.9

8. August 2023 18:19 by nKast | 0 Comments

  The new version of KNI 3.9.9001 brings performance improvements, bug fixes, and implements missing features from the XNA. A more detailed list of the changes is available on the changelog.

Download KNI v3.9 today.


  The most significant change in this version has to do with the library structure. All basic math types have been moved from MonoGame.dll to Xna.Framework.dll and the Vector converters to Xna.Framework.Design.dll. The latter is needed only for game editor and you can ignore it on a typical game project. MonoGame.dll still contain the majority of namespaces like .Graphics, Audio, Input, etc.
  The core of the Content.Pipeline types can be found in Xna.Framework.Content.Pipeline.dll, while the Importers, Processors and Content items have been moved to their own libraries. MonoGameContent.Pipeline.dll has been removed.

  The effort will continue to remove preprocessor directives and partial classes from the code and to split all major namespaces into their own modules. The benefits from this is a clean structured codebase, eliminating the need to use bait-and-switch trickery, seamless support for nuget packages, signed 3rd party libraries, and easier adoption of new platforms.  
  It is necessary to update your projects and libraries with the new targets. Add a reference to Xna.Framewrok.dll and use the Ref platform as the bait and switch target for MonoGame.Framework.dll. 


  After being in the backlog for years, it’s finally here. This version of KNI brings a functional implementation of VideoPlayer to WindowsDX, UAP and Android platforms.


  The Font importer is constantly improved over the last versions. Starting with a bug in Spacing element in the FontDescription, you can see bellow how the font is rendered after the fix.

 Another old issue is the font baseline. Any font family will be rendered with correct spacing on top and bellow the font.

  FontDescriptionProcessor has a new property that let you chose the hinting algorithm.
This property is specific to SharpFont.

Pipeline Content Editor

The  font size of the Content editor GUI has been increased for improved readability.


The performance of Font and Texture processor has been improved. Building content during development can be distracting, especially in a big project.

Loading times are also important, not only during development but also to your target audience.

Fast content loading is essential on mobile marketplaces that have strict requirements on apps.

Rendering performance can give you not only smother animations, but also more headroom for your update logic and physics.

The performance tests can be found in the KniBenchmarks project on github.
Framework versions: KNI 3.9.9001, MonoGame, FNA 23.07, XNA 4.0
All tests were performed on the following system:

CPU: AMD Ryzen 3 2200U
GPU: AMD Randeon Vega 3 Mobile Gfx
HDD: WD Blue SN550 NVMe SSD 



While KNI is free and open-source, maintaining and expanding the framework requires ongoing effort and resources. We rely on the support of our community to continue delivering top-notch updates, features, and support.
By becoming a Sponsor, you can directly contribute to the growth and sustainability of the KNI Game Framework.


14. April 2014 03:19
by nKast

CPU Skinning: Go Native

14. April 2014 03:19 by nKast | 0 Comments

One of the cool things with WP8 was the ability to write native code, something that was missing from the previous platform. Skinning was the perfect testbed to test what native code could do. 


The way to add native code to WP8 is C++/CX, a new language extension that replaced the managed C++/CLI as a mean to mix C++ with C#. One drawback of C++/CX is that you can't pass pointers around so you have to copy structs defined in MonoGame like Vector3 & Matrix to equivalent C++ structs. For example here is a Matrix struct in C++/CX.

namespace NativeHelper 

    namespace Data
        public value struct MatrixData
            float M11, M12, M13, M14;
            float M21, M22, M23, M24;
            float M31, M32, M33, M34;
            float M41, M42, M43, M44;


Another drawback was that any parameter or array you pass to C++/CX is copied/marshaled. That meant that native code should be way faster that C# to counter any slowdown from all that data copied around. Extra care was taken to limit that to the minimum. 

First, I populated the native object with the cpuVertices on initialization. That way I only had to pass the new bones on every frame. 
Second, the returned skinned vertices can be used directly to update the dynamic vertex buffer. The fact that the vertex struct is defined in native code is irrelevant since public C++/CX struct are valid C# struct and VertexBuffer.SetData() accept either IVertexType or struct.

Another important detail was the type of parameters in C++/CX. The bones were declared const Array<MatrixData>^ which means that there is no need to copy the content back when the function returns and the skinned vertices were declared as WriteOnlyArray<VertexPositionNormalTextureData>^ which means that there is no need to copy it's content when you call the native code. It only copy/marshal the content back when the function returns.

void Skin(const Array<MatrixData>^ bones, WriteOnlyArray<VertexPositionNormalTextureData>^ vertices);

There are some more tricks to get the most out of C++. Disable all kinds of runtime checks in the project, Maximize Speed, Favor Fast Code (over size) and enable Fast Floating Point Model.

Finally, I found that accessing directly the Data pointer of the arrays was a bit faster than accessing them through the [] operator.

void NativeHelper::SkinnedModel::Skin(const Array<Matrix3x4Data>^ bones, Platform::WriteOnlyArray<VertexPositionNormalTextureData>^ vertices)
    // copy data locally
    int bonesLength = bones->Length;
    Matrix3x4Data* locbones = bones->Data;
    VertexPositionNormalTextureData* vout = vertices->Data;
    // skin all of the vertices
    int icount = _verticesLength;
    for (int i = 0; i < icount; i++)
        int b0 = _skinVertices[i].BlendIndices.X;
        int b1 = _skinVertices[i].BlendIndices.Y;
        int b2 = _skinVertices[i].BlendIndices.Z;
        int b3 = _skinVertices[i].BlendIndices.W;
        Matrix3x4Data* m1 = &locbones[b0];
        Matrix3x4Data* m2 = &locbones[b1];
        Matrix3x4Data* m3 = &locbones[b2];
        Matrix3x4Data* m4 = &locbones[b3];
        float w1 = _skinVertices[i].BlendWeights.X;
        float w2 = _skinVertices[i].BlendWeights.Y;
        float w3 = _skinVertices[i].BlendWeights.Z;
        float w4 = _skinVertices[i].BlendWeights.W;
        Matrix3x4Data skinnedTransformSum;
        skinnedTransformSum.M11 = (m1->M11 * w1) + (m2->M11 * w2) + (m3->M11 * w3) + (m4->M11 * w4);
        skinnedTransformSum.M12 = (m1->M12 * w1) + (m2->M12 * w2) + (m3->M12 * w3) + (m4->M12 * w4);
        skinnedTransformSum.M13 = (m1->M13 * w1) + (m2->M13 * w2) + (m3->M13 * w3) + (m4->M13 * w4);
        skinnedTransformSum.M21 = (m1->M21 * w1) + (m2->M21 * w2) + (m3->M21 * w3) + (m4->M21 * w4);
        skinnedTransformSum.M22 = (m1->M22 * w1) + (m2->M22 * w2) + (m3->M22 * w3) + (m4->M22 * w4);
        skinnedTransformSum.M23 = (m1->M23 * w1) + (m2->M23 * w2) + (m3->M23 * w3) + (m4->M23 * w4);
        skinnedTransformSum.M31 = (m1->M31 * w1) + (m2->M31 * w2) + (m3->M31 * w3) + (m4->M31 * w4);
        skinnedTransformSum.M32 = (m1->M32 * w1) + (m2->M32 * w2) + (m3->M32 * w3) + (m4->M32 * w4);
        skinnedTransformSum.M33 = (m1->M33 * w1) + (m2->M33 * w2) + (m3->M33 * w3) + (m4->M33 * w4);
        skinnedTransformSum.M41 = (m1->M41 * w1) + (m2->M41 * w2) + (m3->M41 * w3) + (m4->M41 * w4);
        skinnedTransformSum.M42 = (m1->M42 * w1) + (m2->M42 * w2) + (m3->M42 * w3) + (m4->M42 * w4);
        skinnedTransformSum.M43 = (m1->M43 * w1) + (m2->M43 * w2) + (m3->M43 * w3) + (m4->M43 * w4);
        // Support the 4 Bone Influences - Position then Normal
        Vector3Data position = _skinVertices[i].Position;
        vout[i].Position.X = position.X * skinnedTransformSum.M11 + position.Y * skinnedTransformSum.M21 + position.Z * skinnedTransformSum.M31 + skinnedTransformSum.M41;
        vout[i].Position.Y = position.X * skinnedTransformSum.M12 + position.Y * skinnedTransformSum.M22 + position.Z * skinnedTransformSum.M32 + skinnedTransformSum.M42;
        vout[i].Position.Z = position.X * skinnedTransformSum.M13 + position.Y * skinnedTransformSum.M23 + position.Z * skinnedTransformSum.M33 + skinnedTransformSum.M43;
        Vector3Data normal = _skinVertices[i].Normal;
        vout[i].Normal.X = normal.X * skinnedTransformSum.M11 + normal.Y * skinnedTransformSum.M21 + normal.Z * skinnedTransformSum.M31;
        vout[i].Normal.Y = normal.X * skinnedTransformSum.M12 + normal.Y * skinnedTransformSum.M22 + normal.Z * skinnedTransformSum.M32;
        vout[i].Normal.Z = normal.X * skinnedTransformSum.M13 + normal.Y * skinnedTransformSum.M23 + normal.Z * skinnedTransformSum.M33;
        vout[i].TextureCoordinate = _skinVertices[i].TextureCoordinate;


Overall, here are the results: 

Device Original Native Native(Parallelization)
L 620 11,769ms 5,875ms 4,475ms


From 11,77ms it goes down to 5.87ms just by moving the code to C++, including the extra copy, cost of crossing the ABI from managed to native, etc. That is down to 50% of the original code = Twice as fast!





Another cool feature of VC+ compiler+ is Auto-Parallelization and Auto-VectorizationVectorization uses SIMD instructions when possible. It works only with basic value types like floats. It didn't like structs or pointers, but to be fair I didn't spend much time on it, nor did I try it on VS2015. All those are tests I done more than a year ago on VS2013.
Parallelization on the other hand was relatively easy to achieve. Parallelization uses multiple cores to run a loop in parallel. In the case of Lumia 620, it uses 2 cores/threads. By using native code and enabling Parallelization the time drop down to 4,47ms. Unfortunately it's very unstable, every few seconds it can spikes up to 300ms which makes it unsuitable for games.



 CPUSkinning - 03 - GoNative.zip (7.17 mb)


5. February 2014 03:20
by nKast

CPU Skinning: ARM-NEON

5. February 2014 03:20 by nKast | 0 Comments

 One of the nice things about Windows Phone 7 was the experimental support for ARM-NEON instructions. What it did, was generate NEON instructions for XNA's build in vector classes. It greatly improved performance on things like Physics, Particles, Geometry generation,etc. I used this from the very beginning on The Juggler to improve Farseer physics and later on Dr. Pickaxe to improve both Physics and CPU Skinning. In this forum @Moblunatic describes how you can modify the CPU Skinning sample to get a ~40% improvement on WP7 devices.

As we moved our next project to WP8/MonoGame I decided to do again some measurements to see if I could do any optimization.


Platform Device Original NEON
XNA HD7 (WP7.5)  28,412ms 18,463ms
 L 620 (WP8)   8,750ms  14,159ms
 MonoGame  L 620 (WP8) 11,769ms 25,639ms


The first thing we notice is that on WP8 we no longer get the benefit of ARM-NEON. Even on old XNA projects the OS no longer use them. So you need to detect WP8, probably by checking for it via reflection, and use the original skinning code if you want maximum performance. 
 The same is true for MonoGame. You should revert back to the original code if you need max performance.

Assume nothing

One of the things I notice about the code, was that it makes a method call inside the loop, which in turn calls a second method. 

// skin all of the vertices
for (int i = 0; i < vertexCount; i++)
            ref cpuVertices[i].Position,
            ref cpuVertices[i].Normal,
            ref cpuVertices[i].BlendIndices,
            ref cpuVertices[i].BlendWeights,
            out gpuVertices[i].Position,
            out gpuVertices[i].Normal);


 I assumed that I could speed up the code significantly by removing the overhead caused by those calls by bringing the actual code inside the loop. This turn out to work for the neon version but I also got some weird artifacts on HD7, so I couldn't use it. 
 For the original code, the one I use for MonoGame, it made things worst! It turns out the code is not written this way for simplicity but there are some very clever optimizations going on. 
 Notice the use ref & out? This is like taking the address of -let's say- cpuVertices[i].Position and passing it down to the next method instead copying the struct to a local variable or keep accessing it through the cpuVertices[i] list. Since some platforms don't allow pointers/unsafe code the use of ref/out is a nice trick!

Platform Device Original NEON Original(flat) NEON(flat)
XNA HD7 (WP7.5)  28,412ms 18,463ms  34,210ms 14,099ms
 L 620 (WP8)   8,750ms  14,159ms  9,421ms  9,7598ms
 MonoGame  L 620 (WP8) 11,769ms 25,639ms 12,758ms 21,035ms




 If you still support WP7 with XNA, always use CPU skinning (the GPUs were really weak) enable NEON (EnableFPIntrinsicsUsingSIMD inside AssemplyInfo.cs), and use the NEON version of CPUSkinning.

 For WP8 the GPU is fast enough to do skinning, but you can always use that extra headroom for rich shading/post-proccessing/etc. Skinning is one of few things that you can move to a second thread so it comes for free if you do so. Use the original skinning code which is better optimized and performs better in the absence of NEON/XNA.


CPUSkinning - 02 - Neon.zip (10.99 mb)


2. February 2014 01:13
by nKast

CPU Skinning: Better Loading times

2. February 2014 01:13 by nKast | 0 Comments

One of the issues I had to resolve during the development of our next game was slow loading times on WP8. After some investigation I figure that about half of the time was spent on loading models with skinning information.

I use the code from the CPU Skinning sample. The sample demonstrate how to efficiently do animations on mobile devices which means all other aspects are left as simple as possible so you can adapt it to your needs easily. So, it comes as no surprise that the code depends on automatic serialization (reflection) which is not very efficient. Since we are going to talk about content loading on XNA / MonoGame this post apply to traditional GPU-skinning as well

Most of the CPU circles were wasted on serializing the list of Keyframes in AnimationClip. To resolve this we can write our own serializer. If you think this doesn't worth doing then take a look at the numbers below...

Platform Reader Loading Time
XNA automatic serialization 03,826 sec
custom AnimationClipReader 01,970 sec
MonoGame automatic serialization 14,263 sec
custom AnimationClipReader 07,284 sec

(Lumia 620). You can clearly see a drop by ~50% (Twice as Fast!).
The produced .xnb are also a bit smaller. 

he first step is to write a new ContentTypeWriter. Open the CpuSkinningPipelineExtensions project and add a new file named AnimationClipWriter.cs. Copy-paste the following code.


using CpuSkinningDataTypes;
using Microsoft.Xna.Framework.Content.Pipeline;
using Microsoft.Xna.Framework.Content.Pipeline.Serialization.Compiler;
using System;
using System.Collections.Generic;
namespace CpuSkinningPipelineExtensions
    /// <summary>
    /// Writes out a KeyframeContent object to an XNB file to be read in as
    /// a Keyframe.
    /// </summary>
    class AnimationClipWriter : ContentTypeWriter<AnimationClip>
        protected override void Write(ContentWriter output, AnimationClip value)
            // write duration
            WriteDuration(output, value.Duration);
            WriteKeyframes(output, value.Keyframes);
        private void WriteDuration(ContentWriter output, TimeSpan duration)
        private void WriteKeyframes(ContentWriter output, IList<Keyframe> keyframes)
            Int32 count = keyframes.Count;
            for (int i = 0; i < count; i++)
                Keyframe keyframe = keyframes[i];
        public override string GetRuntimeType(TargetPlatform targetPlatform)
            return "CpuSkinningDataTypes.AnimationClip, CpuSkinningDataTypes";
        public override string GetRuntimeReader(TargetPlatform targetPlatform)
            return "CpuSkinningDataTypes.AnimationClipReader, CpuSkinningDataTypes";


At this point you should rebuild the Content to get the new .XNB.

Next, Open the CpuSkinningDataTypes project and add a new file named AnimationClipReader.cs. Copy-paste the following code. 


using System.Collections.Generic;
using System.Collections.ObjectModel;
using Microsoft.Xna.Framework.Content;
using Microsoft.Xna.Framework.Graphics;
using Microsoft.Xna.Framework;
using System;
namespace CpuSkinningDataTypes
    /// <summary>
    /// A custom reader to read Keyframe.
    /// </summary>
    public class AnimationClipReader : ContentTypeReader<AnimationClip>
        protected override AnimationClip Read(ContentReader input, AnimationClip existingInstance)
            AnimationClip animationClip = existingInstance;
            if (existingInstance == null)           
                TimeSpan duration = ReadDuration(input);
                List<Keyframe> keyframes = ReadKeyframes(input, null);
                animationClip = new AnimationClip(duration, keyframes);
                animationClip.Duration = ReadDuration(input);
                ReadKeyframes(input, animationClip.Keyframes);
            return animationClip;                       
        private TimeSpan ReadDuration(ContentReader input)
            return new TimeSpan(input.ReadInt64());
        private List<Keyframe> ReadKeyframes(ContentReader input, List<Keyframe> existingInstance)
            List<Keyframe> keyframes = existingInstance;
            int count = input.ReadInt32();
            if (keyframes == null)
                keyframes = new List<Keyframe>(count);
            for (int i = 0; i < count; i++)
                Keyframe keyframe = new Keyframe();
                keyframe.Bone = input.ReadInt32();
                keyframe.Time = new TimeSpan(input.ReadInt64());
                keyframe.Transform = input.ReadMatrix();
                if (existingInstance == null)
                    keyframes[i] = keyframe;
            return keyframes;



At this point you must make a few minor changes to AnimationClip & Keyframe classes.
Open AnimationClip.cs and change the access modifier of Duration to internal protected.

 public TimeSpan Duration { get; internal protected set; }


Now, open Keyframe.cs and replace all private modifiers to internal.

public class Keyframe
    public int Bone { get; internal set; }
    public TimeSpan Time { get; internal set; }
    public Matrix Transform { get; internal set; }
    internal Keyframe() {}

That's it!


If you want to know more about how content serialization works, 
see: XNA custom content writer/reader part 1: Introduction.

The .zip file below has some extra changes to correctly reload the model after Resuming under WP8/MonoGame. If you need these changes, make sure to copy both the CpuSkinnedModelWriter.cs CpuSkinnedModelReader.cs to your project and then rebuild your content.



 CPUSkinning - 01 - Loader.zip (7.18 mb)


24. November 2012 07:21
by nKast

Stencil Ball Shadows for XNA / WP7

24. November 2012 07:21 by nKast | 0 Comments

 In the early versions of The Juggler, the first pinball game for WP7, the ball had no shadow. Later we added a shadow by using a very simple trick, a transparent texture was drawn on the table right below the ball.  I have seen this on many games; it's not heavy on the GPU and gives good results. It adds volume to the ball by separating it from the table graphics.

A transparent texture is used to 
simulate the ball shadow

 During the development of The World of Dr. Pickaxe I used a lot from our existing codebase, including the shadow trick. We also made a lot of additions; one example is the addition of ramps, rails and multiple levels on the table while we build a powerful table editor. 
  Soon, problems with the ball shadow were spotted. Should the shadow be cast on the table when ball was rolling on rails above? What about when the ball was on the second level or down a ramp? There were many ideas but none was simple or efficient. Should we accept a shadow that breaks the illusion every now  and then or remove the effect altogether?

Left: No shadows, Center: decal shadows still looks good, Right: the reason you are here...

Left: No shadows, Center: decal shadows. Half the texture falls inside the ramp, 
Right: Stencil shadows. Perfect !

Left: No shadows, Center: decal shadows. Notice how the wrong shadow creates the illusion that the rail is bend down by the ball weight. Also the ball seems bigger because it is perceived lower and far behind. Right: Stencil shadows. Ball casting shadow on the rail *and* the table below. Notice the curved shadow on the wall and the droptarget !! (full size

 There are basically two methods to generate shadows. Shadow mapping(aka shadow buffering) and stencil shadows (aka shadow volumes). Let's reject the first one because it requires a custom shader, something we can't do on WP7.


 Screenshot from Doom 3. Stencil shadows (also known as Shadow volumes)
produce hard shadows with sharp edges. They are best suited
 to simulate
strong sunlight or spotlights in dark environments (low ambient light).



 So, we are left with one choice, Stencil shadows. That didn't look that good either, mainly due to performance issues. The typical steps involved one by one are:

  1. Render the scene with ambient light only.
  2. For each light source
    1. Extract the silhouette of each object casting a shadow.
    2. Extend the silhouette away from the light to generate a shadow volume 
    3. Render the shadow volume (not really) using some stencil rules.
    4. Render the shadow volume, using some other stencil rules, reverse culling this time.
    5. Render the scene again using lighting and the stencil buffer to mask covered areas. 


huh, Let's see.

Steps 1 & 2.5 are too expensive. Andreno 200, the GPU on first gen windows phones, has a really low fill rate (around 1.5). Impossible to render the full scene twice. What we normally do is render the scene once with full lighting and shadows burned on the texture.

Step 2, let's limit ourselves to one static light source for now.  

Step 2.1 & 2.2 Too much work for the CPU. But we don't have to do it on every frame for static objects. If we could somehow render the scene as we already do, with full lighting and shadows, and get away with it... we might don't need to bother at all. But we still need to perform those steps for the ball. Don't we?  

Step 2.3 & 2.4 is fast. No, texture is involved, no lighting is performed. Only Z-buffer tests and writing to the stencil buffer. As a bonus, XNA can combine those steps into one! 

Step 2.5, I already made it clear we don't have the luxury to draw the scene again. But then how can we make the shadow to appear in an already lit environment? We need users to see a dark spot there... where the scene intersects with the shadow volume...


A generic ball shadow volume. 


A picture, they say, is like...well, you get the idea! If it didn't already strike you by looking at the picture above, let me elaborate. From any point you look at it, a ball's silhouette is always a circle. And its shadow volume is always a cylinder. Actually, the far end should expanded the closer you get to the light source, but assuming the ball doesn't fly high above the ground you can ignore that.

Here is how the ball stencil shadow actually work on The World of Dr. Pickaxe.

  1. Render the scene fully lit (shadows and lights burned into texture, lighting disabled in the effect)
  2. Rotate the generic shadow volume to face way from the light source
  3. Render the generic shadow volume using stencilShadowPass1State (combine back face and front face rules in one step)
  4. Render the generic shadow volume again using stencilShadowPass2State and with alpha blending and a dark color. The result is to dim the pixels where scene & shadow volume intersect. Same result if were rendering the full scene, but much more efficient.  


Left: No shadows, Center: Stencil shadows. Right: Shadow Volume


DepthStencilState stencilShadowPass1State;
DepthStencilState stencilShadowPass2State;

stencilShadowPass1State = new DepthStencilState()
DepthBufferWriteEnable = false,
  StencilEnable = true,
  TwoSidedStencilMode = true,
  StencilPass = StencilOperation.Increment,
  CounterClockwiseStencilPass = StencilOperation.Decrement
stencilShadowPass2State = new DepthStencilState()
  DepthBufferWriteEnable = false,
  StencilEnable = true,
  StencilFunction = CompareFunction.Equal,
  ReferenceStencil = 1


This method is really the best replacement for our previous shadowing method. Dare I say, it could possibly be faster! We replaced decal draw (texture), with two draws of a lightweight model. (no normals, no color vertices, no texture).



As with the previous method, only the ball is casting shadow on other objects. The scene does not cast shadow on the ball. . In case you would like that a possible solution would be to pre-calculate volume shadows for each object and check with boundingBoxes/farseer before performing stencil shadow. For moving objects you still need to implement silhouette extraction and volume construction on every frame. Again, check against a worst case bounding box before performing any calculations.

Another problem with this method is that shadows pass through other objects. Place the light source higher, or limit the rotation of the shadow volume to avoid such artifacts. 



The sample is based on Marble Maze tutorial
MarbleMazeStencil.zip (7.26 mb)