GPU Experiments

Direct3D 11.1

2011-09-18T20:21:00.002-05:00

I just heard about Direct3D 11.1 and would like to share some of the new features.

Some of the major improvements:

Shader tracing. The documentation is lacking at this time, but it appears as if it will be possible to retrieve information about registers.
Logical operations on a render target. Direct3D has been lacking this functionality for quite a while (glLogicOp in OpenGL), so it is nice to see it arrive to the API.
More powerful resource copy routines. It will now be possible to copy from one subresource into itself, even if the regions overlap.
Up to 64 UAVs can now be bound to the pipeline. Even more exciting, UAVs can now be used from any pipeline stage. I am already thinking of what could be done with a UAV in the hull/domain shaders...

Tessellation example

2010-02-19T23:01:00.009-06:00

Now that I have explained a bit about tessellation, it's time for an actual example. We'll start off with a basic cubic Bézier spline renderer.

Let's start by looking at the parametric function used to compute a cubic Bézier curve. The control points are represented by P0, P1, P2 and P3.

Vertex shader

Recall that the vertex shader is run once per control point. For this example, we just pass the control points through to the next stage.


struct IA_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct VS_OUTPUT
{
    float3 cpoint : CPOINT;
};

VS_OUTPUT VS(IA_OUTPUT input)
{
    VS_OUTPUT output;
    output.cpoint = input.cpoint;
    return output;
}

Hull shader

The patch constant function (HSConst below) is executed once per patch (a cubic curve in our case). Recall that the patch constant function must at least output tessellation factors. The control point function (HS below) is executed once per output control point. In our case, we just pass the control points through unmodified.


struct VS_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct HS_CONSTANT_OUTPUT
{
    float edges[2] : SV_TessFactor;
};

struct HS_OUTPUT
{
    float3 cpoint : CPOINT;
};

HS_CONSTANT_OUTPUT HSConst()
{
    HS_CONSTANT_OUTPUT output;

    output.edges[0] = 1.0f; // Detail factor (see below for explanation)
    output.edges[1] = 8.0f; // Density factor

    return output;
}

[domain("isoline")]
[partitioning("integer")]
[outputtopology("line")]
[outputcontrolpoints(4)]
[patchconstantfunc("HSConst")]
HS_OUTPUT HS(InputPatch<VS_OUTPUT, 4> ip, uint id : SV_OutputControlPointID)
{
    HS_OUTPUT output;
    output.cpoint = ip[id].cpoint;
    return output;
}

Tessellator

The actual tessellator is not programmable with HLSL, but it is worth noting that the actual tessellation takes place between the hull shader and the domain shader. The tessellation factors and compile-time settings (domain, partitioning, output topology, etc.) influence the tessellator.

Domain shader

Note that up until now, we have not used the cubic Bézier curve parametric function. The domain shader is where we use this function to compute the final position of the tessellated vertices.


struct HS_CONSTANT_OUTPUT
{
    float edges[2] : SV_TessFactor;
};

struct HS_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct DS_OUTPUT
{
    float4 position : SV_Position;
};

[domain("isoline")]
DS_OUTPUT DS(HS_CONSTANT_OUTPUT input, OutputPatch<HS_OUTPUT, 4> op, float2 uv : SV_DomainLocation)
{
    DS_OUTPUT output;

    float t = uv.x;

    float3 pos = pow(1.0f - t, 3.0f) * op[0].cpoint + 3.0f * pow(1.0f - t, 2.0f) * t * op[1].cpoint + 3.0f * (1.0f - t) * pow(t, 2.0f) * op[2].cpoint + pow(t, 3.0f) * op[3].cpoint;

    output.position = float4(pos, 1.0f);

    return output;
}

Because this is an example, I omitted optimizations to maintain clarity.

Pixel shader

This is a simple pixel shader that produces black lines.


struct DS_OUTPUT
{
    float4 position : SV_Position;
};

float4 PS(DS_OUTPUT input) : SV_Target0
{
    return float4(0.0f, 0.0f, 0.0f, 1.0f);
}

API setup

Control points are treated the same way as vertices.

Input assembler signature:


D3D11_INPUT_ELEMENT_DESC desc[] =
{
    {"CPOINT", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0}
};

Input assembler binding code:


UINT strides[] = {3 * sizeof(float)}; // 3 dimensions per control point (x,y,z)
UINT offsets[] = {0};
g_pd3dDC->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_4_CONTROL_POINT_PATCHLIST); // 4 control points per primitive
g_pd3dDC->IASetInputLayout(layout);
g_pd3dDC->IASetVertexBuffers(0, 1, &controlpoints, strides, offsets);

// Bind the shaders
// ...

// Render 4 control points (1 patch in this example, since we're using 4-control-point primitives).
// Rendering 8 control points simply means we're processing two 4-control-point primitives, and so forth.
// Instancing and indexed rendering works as expected.
g_pd3dDC->Draw(4, 0);

Now that the shaders are out of the way, it is a good time to explain the purpose of two tessellation factors for isolines rather than just one. Recall that a single tessellation factor can be no greater than 64. When dealing with isolines, this number is rather small; it is desirable to render a single isoline patch with a high degree of tessellation. To alleviate this problem, D3D11 allows us to specify two isoline tessellation factors: a detail factor and a density factor.

To understand what these factors mean, visualize a square. Now imagine that the detail factor describes how much to divide up the y axis, while the density factor describes how much to divide up the x axis. Now imagine connecting the dots along the x axis to form lines.

Another way to think about this: the density factor describes how much to tessellate a line, while the detail factor describes how many times to instance the tessellated line. We can find the location within a tessellated line by using SV_DomainLocation.x and we can find which line we're evaluating by using SV_DomainLocation.y. This effectively lets us chain the lines together into one, ultra-tessellated line. Darn good use of parallelism if you ask me.

Back to the example at hand: let's run some control points through this shader and see what we end up with.

Consider the following control points:


P0 = [-1, -0.8, 0]
P1 = [ 4, -1,   0]
P2 = [-4,  1,   0]
P3 = [ 1,  0.8, 0]

Keep in mind that we're using a hard-coded density tessellation factor of 8 here, which is why the result looks low-resolution. Let's up the factor to 64 and see what we get.

Much better.

There are a number of things we could do to improve upon this example. For example, to obtain more than 64 divisions per patch, we can use the detail factor to "instance" the line up to 64 times, and piece together the instanced, divided lines in the domain shader. Another thing we could do is create a geometry shader which transforms lines into triangles. We could procedurally perturb the control points in the vertex shader for animation effects. We could compute the tessellation factors as a function of the control points.

Patches and geometry shaders

2010-02-12T22:33:00.001-06:00

You might be wondering how the geometry shader interacts with the new shader stages and the new patch primitive types.

Consider a pipeline with a vertex shader, geometry shader and pixel shader. The vertex shader runs per-vertex, and once one primitive's worth of vertices have been processed, the geometry shader runs. The geometry shader runs per-primitive and outputs vertices of a potentially different primitive type.

Now add a hull and domain shader to the mix. Say the input assembler primitive type is a patch with n control points. The vertex shader runs n times per patch, then the hull shader runs another n times per patch, processing a total of m control points. The tessellator feeds the domain shader each tessellated vertex, and the domain shader outputs the processed vertex. From here, we head to the geometry shader.

Recall that the tessellator can produce lines or triangles. This determines the incoming primitive type to our geometry shader. For the sake of this example, assume the tessellator is configured to output a triangular topology. Say we're emitting points from the geometry shader. This means the signature to the geometry shader looks something like this:



void GS(triangle DOMAIN_SHADER_OUTPUT input[3], inout PointStream<GEOMETRY_SHADER_OUTPUT> stream);

Had the tessellator been configured to deal with isolines, then we would be using line DOMAIN_SHADER_OUTPUT input[2] instead.

So, there you have it. The geometry shader integrates seamlessly into the tessellation model. Recall that the domain shader runs per tessellated vertex, and so we have little control over each individual triangle in that stage. A geometry shader can be used to break a patch up into individual primitives, which can be independently transformed, culled, duplicated, etc. Not to mention that geometry shaders can be instanced now... that's for a future post. :)

What happens if we configure the input assembler to use a patch primitive type, but do not bind hull and domain shaders to the pipeline? Remember that the geometry shader operates on primitives, and that the new patch types are primitives... therefore, the geometry shader can operate on patch primitives!

Here are some example geometry shader signatures.

Input primitive type of point:
void GS(point VERTEX_SHADER_OUTPUT pt[1], ...);

Input primitive type of line:
void GS(line VERTEX_SHADER_OUTPUT pt[2], ...);

Input primitive type of triangle:
void GS(triangle VERTEX_SHADER_OUTPUT pt[3], ...);

Input primitive type of a 25-point patch:
void GS(InputPatch<VERTEX_SHADER_OUTPUT, 25> pt, ...);

Excited yet? You should be! What this essentially means is that you can make up your own primitive types (with anywhere from 1 point to 32 points) that the geometry shader can operate on. Wish you had a quad primitive type? Use a 4-point patch with a geometry shader and emit two triangles!

D3D11 tessellation, in depth

2010-02-06T14:39:00.024-06:00

Consider the typical flow of data through the programmable pipeline:

Input assembler -> Vertex shader -> Pixel shader

Buffers containing per-vertex data are bound to the input assembler. The vertex shader is executed once per vertex, and each execution is given one vertex worth of data from the input assembler.

Say, however, that we wish to process control points and patches instead. The vertex shader by itself isn't particularly well-suited for handling the manipulation of patches; we could store the control points in a buffer and index with SV_VertexID, but this is not very efficient especially when dealing with 16+ control points per patch.

To solve this problem, D3D11 adds two new programmable stages: the hull shader and the domain shader. Consider the following pipeline.

Input assembler -> Vertex shader -> Hull shader -> Tessellator -> Domain shader -> Pixel shader

Normally, the input assembler can be configured to handle points, lines, line strips, triangles and triangle strips. It turns out that it is quite elegant to add new primitive types for patches. D3D11 adds 32 new primitive types: each represents a patch with a different number of control points. That is, it's possible to describe a patch with anywhere from 1 to 32 control points.

For the purpose of this example, say we've configured the input assembler to handle patches with 16 control points, and also that we're only rendering one patch. We will use a triangular patch domain.

Since we're rendering one patch, we'll need a buffer with 16 points in it -- in this context, these points are control points. This buffer is bound to the input assembler as usual. The vertex shader is executed once per control point, and each execution is given one control point worth of data from the input assembler. Similar to the non-patch primitive types, the vertex shader can only see one control point at a time; it cannot see all 16 of the control points on the patch.

When not using tessellation, the next shader stage is executed once the vertex shader has operated on all of the vertices of a single primitive. For example, when using the triangle primitive type, the next stage is run once per every three executions of the vertex shader. The same principle holds when using tessellation: the next stage isn't executed until all 16 control points have been transformed by 16 executions of the vertex shader.

Once all 16 control points have been transformed, the hull shader executes. The hull shader consists of two parts: a patch constant function and the hull program. The patch constant function is responsible for computing data that remains constant over the entire patch. The hull program is run per control point, but unlike the vertex shader, it can see all of the control points for the entire patch.

You might be wondering what the point of the hull program is. After all, we did already transform the control points in the vertex shader. The important part is that the hull program can take into account all of the control points when computing the further transformed output control points. D3D11 allows us to output a different number of control points from the hull program than we took in. This means we can perform basis transformations -- for example, using a little math we could transform 32 control points into 16 control points, which saves us some processing time later on down the pipeline. At this point, further clarification is helpful: the hull program runs once per output control point. So, if we've configured the hull program to output 4 control points, it will run 4 times total per patch. It will not run 16 times, even though we have 16 input control points.

The next stage is the tessellator unit itself. This stage is not programmable with HLSL, but has a number of properties that can be set. The tessellator is responsible for producing a tessellated mesh and nothing more; it does not care at all about any user-defined data or any of our control points. The one thing it does care about, however, are tessellation factors -- or, how much to tessellate regions of the patch. You may be wondering where we actually output these values. Since the tessellation factors are determined once per patch, we compute these in the patch constant function. Thus, the only thing given to the tessellator is the tessellation factors from the patch constant function.

The topologies produced by the tessellator vary depending on how it is setup. For this example, using a triangular domain means that the tessellator will produce a tessellated triangle topology described by 3D barycentric coordinates. How cool is that?

So, by this point we've transformed each control point in the vertex shader, performed a possible basis transformation of the control points in the hull program, and have determined the tessellation factors for this patch in the patch constant function, along with any other user-defined data. The tessellation factors have been run through the tessellation hardware, which has created a shiny new tessellated mesh: in this case, a tessellated triangle described with barycentric coordinates. I would like to emphasize once again that the tessellator does not care at all about anything besides the tessellation factors and a small number of configuration properties set at shader compile-time. This is what makes the D3D11 implementation so beautiful: it is very general and very powerful.

You're probably wishing we could transform the tessellated mesh in arbitrary ways, and, well... we can! The next stop is the domain shader. The domain shader can be thought of as a post-tessellation vertex shader; it is run once per tessellated vertex. It is handed all of our output control points, our patch constant data, as well as a special system value which describes the barycentric coordinate of the tessellated vertex we're operating on. Barycentric coordinates are very handy when working in triangular domains, since they allow us to interpolate data quite easily over the triangle.

At this point, the flow of data is familiar: the output from the domain shader is handed to the pixel shader. It is important to note that in general, 32 float4s can be passed between every shader stage. We can pass 32 float4s from the vertex shader to the hull shader, 32 float4s from the patch constant function to the domain shader, 32 float4s from the hull program to the domain shader, and 32 float4s from the domain shader to the pixel shader. In other words, a lot of data can be passed using interstage registers, not to mention we can also bind shader resource views to the vertex, hull, domain, geometry and pixel shader stages.

I have left the geometry shader out of this explanation to simplify things, but it is very possible to throw a geometry shader into the mix to do some very interesting things -- one example that comes to mind is eliminating portions of a patch, or breaking it up into individual triangles to form new topologies. It is also possible to use stream-out with tessellation.

Due to the general nature of the pipeline, we can even use tessellation without binding any actual control point data to the pipeline at all. Consider that the vertex shader is able to see the vertex ID (control point ID in this case) and instance ID. The hull and domain shaders can see the primitive ID (which is basically a patch ID). Using this information alone, very interesting and useful things can be accomplished: a good example is producing a large mesh consisting of many individual patches. The patches can be placed appropriately by using the primitive ID.

Earlier I touched on the tessellation stages having compile-time settings. These settings are specified with the hull program. Here is an example declaration of settings.



[domain("tri")]

[partitioning("integer")]

[outputtopology("triangle_cw")]

[outputcontrolpoints(16)]

[patchconstantfunc("HSConst")]

domain(x) - This attribute specifies which domain we're using for our patches. In this example, I specified a triangle domain, but it's also possible to specify a quadrilateral or isoline domain.

partitioning(x) - This attribute tells the tessellator how it is to interpret our tessellation factors. Integer partitioning means the tessellation factors are interpreted as integral values; there are no "in-between" tessellated vertices. The other partitioning schemes are fractional_even, fractional_odd and pow2.

outputtopology(x) - This attribute tells the tessellator what kind of primitives we want to deal with after tessellation. In this case, triangle_cw means clockwise-wound triangles. Other possibilities are triangle_ccw and line.

outputcontrolpoints(x) - This attribute describes how many control points we will be outputting from the hull program. We can choose to output anywhere from 0 to 32 control points which are then fed into the domain shader.

patchconstantfunc(x) - This attribute specifies the name of the patch constant function, which is executed once per patch.

Each stage is given different data. To illustrate this, I will show one possible function signature for each stage.



VS_OUTPUT VS(IA_OUTPUT input, uint vertid : SV_VertexID, uint instid : SV_InstanceID);



HS_CONSTANT_OUTPUT HSConst(InputPatch<VS_OUTPUT, n> ip, OutputPatch<HS_OUTPUT, m> op, uint pid : SV_PrimitiveID);

HS_OUTPUT HS(InputPatch<VS_OUTPUT, n> ip, uint cpid : SV_OutputControlPointID, uint pid : SV_PrimitiveID);



DS_OUTPUT DomainShader(HS_CONSTANT_OUTPUT constdata, OutputPatch<HS_OUTPUT, m> op, uint pid : SV_PrimitiveID, float3 coord : SV_DomainLocation);

SV_DomainLocation's type depends on the chosen patch domain. For the triangular domain, SV_DomainLocation is a float3. For the quad domain, it is a float2. For the isoline domain, it is a float2 (for reasons which I will touch on in a future post). n stands for the number of input control points and m stands for the number of output control points.

As stated earlier, the patch constant function (HSConst in this case) is required to output at least the tessellation factors. The number of tessellation factors depends on the patch domain. For the triangular domain, there are 4 factors (3 sides, 1 inner). For the quadrilateral domain, there are 6 factors (4 sides, 2 inner). For the isoline domain, there are 2 factors (detail and density).

Let's take a look at the topology produced by the tessellator by using the wireframe rasterization mode, a quadrilateral domain, and integer partitioning.

In the following patch constant function, I have chosen to use hard-coded tessellation factors. In practice, the tessellation factors are computed dynamically. The tessellation factors are not required to be hard-coded constants!


struct HS_CONSTANT_OUTPUT
{
float edges[4] : SV_TessFactor;
float inside[2] : SV_InsideTessFactor;
};

HS_CONSTANT_OUTPUT HSConst()
{
HS_CONSTANT_OUTPUT output;

output.edges[0] = 1.0f;
output.edges[1] = 1.0f;
output.edges[2] = 1.0f;
output.edges[3] = 1.0f;

output.inside[0] = 1.0f;
output.inside[1] = 1.0f;

return output;
}

The edge factors are held constant at 1, 1, 1, 1 and the inside factors at 1, 1. The tessellator produces the following mesh:

What about edge factors of 3, 1, 1, 1 and inside factors of 1, 1?

Edge factors of 5, 5, 5, 5 and inside factors of 1, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 2, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 4, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 4, 4:

Edge factors of 4, 4, 4, 4 and inside factors of 4, 4:
(Same as edge factors of 3.5, 3.8, 3.9, 4.0 and inside factors of 3.1, 3.22!)

Edge factors of 4, 4, 4, 1 and inside factors of 4, 4:

It should be noted that when using integer partitioning, the implementation is essentially using the ceiling of the written tessellation factors. Let's take a look at the output from the fractional_even partitioning scheme.

Edge factors of 2, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.1, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.2, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.5, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 3, 1, 1, 1 and inside factors of 1, 1:

Here's a funky one with edge factors of 3, 3, 3, 3 and inside factors of 4, 6, using the fractional_odd partitioning scheme:

Obviously hard-coded tessellation factors are only so useful. The real usefulness of tessellation comes into play when computing the tessellation factors dynamically, per patch, in realtime based on factors such as level of detail in a height map, camera distance, or model detail.

D3D11 Resource Limitations

2010-02-04T19:01:00.003-06:00

Ever wondered how many resources you can bind to a shader at one time? Or how many slices you can store in a 2D texture array? All of these limits can be found in the D3D header. For D3D11, this is D3D11.h. In this post, I'd like to point out notable improvements in D3D11.

Input assembler binding points (D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT): 32, up from 16

Maximum number of layers in a 1D or 2D texture array (D3D11_REQ_TEXTURE2D_ARRAY_AXIS_DIMENSION): 2048, up from 512

Maximum number of interstage float4s: 32 between every stage, up from 16

Maximum size 1D texture (D3D11_REQ_TEXTURE1D_U_DIMENSION): 16384 texels, up from 8192

Maximum size 2D texture (D3D11_REQ_TEXTURE2D_U_OR_V_DIMENSION): 16384x16384 texels, up from 8192x8192

Maximum number of unordered buffers that can be bound to a pixel or compute shader (D3D11_PS_CS_UAV_REGISTER_COUNT): 8, up from 0 ;)

More tessellator fun

2010-01-10T21:40:00.004-06:00

This is just a quick update post on projects I am currently working on.

1. Adding support for the tessellator into my realtime parametric surface plotting program. The original version of this program simply rendered an equally distributed mesh and offset the vertices in the vertex shader. Now I am tessellating patches in realtime to determine how many triangles should be output.

The initial results are promising. Shown here are three images, each of the same plot but with a differing camera position.

2. Music visualization program using wavelets. I plan on bringing the tessellator into this application once I go 3D.

Integer division

2010-01-03T10:58:00.002-06:00

Beware integer division on the GPU.

In a certain compute shader I am working on, this code


if (tcoord.x < 0) tcoord.x += gridsize.x;
if (tcoord.y < 0) tcoord.y += gridsize.y;
if (tcoord.z < 0) tcoord.z += gridsize.z;

if (tcoord.x >= gridsize.x) tcoord.x -= gridsize.x;
if (tcoord.y >= gridsize.y) tcoord.y -= gridsize.y;
if (tcoord.z >= gridsize.z) tcoord.z -= gridsize.z;

runs much faster than using


tcoord = (tcoord + gridsize) % gridsize;

C++0x random header

2009-11-15T08:31:00.006-06:00

C++0x includes a number of new headers, one of which includes new random number facilities. The functionality is ingeniously separated into two major parts: engines and distributions.

Engines are responsible for generating uniformly distributed random numbers. One such provided engine is the Mersenne Twister engine. Distributions use the output of engines to mold the numbers to a specific distribution.

Consider the following example.


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
cout << dist(engine) << endl;

The output is a single random number, following the exponential distribution. Now let's say we want to use this in a call to generate_n with a lambda:


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
generate_n(ostream_iterator<double>(cout, "\n"), 20, [&dist,&engine]() -> double { return dist(engine); });

You're probably thinking that a simple for loop would be much cleaner here, and I don't disagree. However, there is one other thing we can do:


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
variate_generator<mt19937, exponential_distribution<double>> gen(engine, dist);
generate_n(ostream_iterator<double>(cout, "\n"), 20, gen);

That's right -- variate_generator is provided to us so that we can encapsulate an engine along with a distribution. That way, a simple gen() gets us a random number using the desired engine and distribution.

Tessellator

2009-09-19T10:23:00.003-05:00

I finally decided to dive into the new tessellation shaders, and I am quite pleased. Going into it I thought it would be very specific to gaming applications, but as I've found out it is surprisingly general.

New primitive topologies have been added for tessellation. Since the basic unit for the new shaders are patches and control points, the new types allow you to render anywhere from 1 to 32 control points per patch.

Say you're using 16 control point patches. The vertex shader is run per control point, and the output from this stage is passed into the hull shader. The hull shader is really described in two parts; one being a patch-constant function, and the other being the hull program.

The patch-constant function computes user-defined data and is run once per patch. This allows you to compute things that remain constant across the entire patch. The required output from the patch-constant function are tessellation factors: these tell the tessellation hardware how much to tessellate the patch. The hull program is run once per output control point, and both the patch-constant function and hull program can see all control points.

The next step is the actual tessellation, which is performed in a fixed-function, yet configurable stage. The tessellator ONLY looks at the tessellation factors output from your patch-constant function. The user-defined output from the patch-constant function and hull program are provided to the domain shader, which is run after tessellation.

The domain shader is run per-tessellated-vertex and is provided with the tessellated vertex location on the patch. To me it seems that the domain shader can be seen as a post-tessellation vertex shader; this is where you transform the tessellated vertices. The output from the domain shader is provided to the geometry shader (or pixel shader, if not using a geometry shader).

Here are some results from my initial experiments with the new stages:

The toughest part to me is computing the per-patch tessellation factors. But since this is completely programmable, it's a fun problem.

D3D11 Types

2009-08-29T10:15:00.006-05:00

I did some more fiddling with D3D11's new types and have learned new things about them.

First, it seems that it is not possible to read from multi-component RWTexture* and RWBuffer objects due to a hardware limitation. However, it is possible to read-write the 32-bit RGBA type thanks to the way D3D handles views.

Create the texture with a DXGI_FORMAT_R8G8B8A8_TYPELESS format, then for the unordered access view cast it to DXGI_FORMAT_R32_UINT. This allows for a common texture format to be read/written without ping ponging, which is great for in-place transformations.

There is another reason why this is not a major limitation. Consider applications whose requirements are reading and writing a texture, but also use shared memory to reduce texture fetching. This most likely means that there is overlapped texture fetching going on (e.g., for an image convolution), and so ping ponging two textures is necessary here anyway to prevent clobbering of data between shaders.

I have found the new structured buffer types to be much more flexible since they are independent of the texture subsystem. It is possible to read/write any structure and any element of a RWStructuredBuffer. Any shader can read from structured buffers, and the compute and pixel shaders can write to them. According to John Rapp on the DX11 forum, this type also has beneficial performance characteristics.

It should be noted that a structured buffer cannot be bound to the input assembler (not that you'd want to since you can just read from it in a random access manner), and cannot be the output of a stream-out operation. I consider these limitations minimal, since really, the input assembler probably should be going away sometime soon. As for stream out, one can just stream out to a regular Buffer and read from that.

The March 2009 SDK Direct3D 11 documentation mentions that the AppendStructuredBuffer and ConsumeStructuredBuffer types operate similar to a stack in that items are appended and consumed from the end of the buffer. If this is true, this is a very nice property to have. This means it is possible to append to a structured buffer in one pass, and bind it as a plain old StructuredBuffer in another pass (for example, indexed by SV_InstanceID in the vertex shader). Or, filling up a RWStructuredBuffer in one pass, then consuming from it in another pass.

I haven't played around too much with the ByteAddressBuffer type. From my experiments, it seems that StructuredBuffer is the way to go for most things. It seems that these replace Buffer for me in most of my applications.

New DirectWrite Samples

2009-08-10T17:38:00.003-05:00

I am excited to see that there are new sample applications that show off the capabilities of DirectWrite.

For anyone who is thinking about porting over older GDI code to D2D/DWrite, have a look at this sample. It's a word processor that renders using DWrite. This should give you a better understanding of what all DWrite can do. :)

C++0x auto

2009-07-30T07:13:00.004-05:00

Consider the following iterator example:


map<string, pair<int, float>> data;

for (map<string, pair<int, float>>::iterator i = data.begin(); i != data.end(); i++)
{
   // ...
}

We can clean this up a bit by using the auto keyword:


map<string, pair<int, float>> data;

for (auto i = data.begin(); i != data.end(); i++)
{
   // ...
}

Limitations

2009-07-26T19:21:00.004-05:00

Having used Direct2D, DirectWrite and the Direct3D 11 previews, I would like to discuss some of the limitations I have run into.

Direct2D has the ability to render into Direct3D textures. However, D2D does not deal with resource views directly; it uses DXGI's facilities to access surfaces. The problem comes when trying to obtain the DXGI surface representation of a 2D texture that has more than one mip in its mipmap chain and/or more than one layer. Unless I am missing something, this is simply not possible. This means that it is not possible to use D2D to render directly into a Direct3D multilayer texture (or mipmapped texture).

Admittedly, I have not found myself needing to do this very often. Indeed, the most useful application of D2D/D3D interop to me has proved to be rendering to the backbuffer, which is neither mipmapped nor multilayered. In one scenario, however, I needed to render some numbers into a texture array. I had to create a temporary texture without any mipmap/layers, render to that using D2D, then perform an on-device copy to get it into my texture array.

This copy could be eliminated in two ways. One way involves adding a D3D dependency to D2D, which is not the best route. The second way involves a modification to DXGI to enable the casting of multilayer/mipmapped 2D textures to surfaces; it would be nice to be able to pass in a subresource number and get a surface representing a particular subresource of a 2D texture.

The second limitation I have run into is in the compute shader. I dislike how the number of threads per group is declared in the shader, and cannot be changed during runtime without a shader recompile. I really do not see the need for this limitation, as both OpenCL and CUDA allow the number of threads per group to be specified at runtime. That aside, I still prefer Microsoft's approach to computation on the GPU. I like that it is integrated into the Direct3D API and uses a language similar to the other shaders.

Aside from these minor limitations, my expectations are definitely surpassed with regard to Direct2D and DirectWrite. I think these APIs fill in a large gap in the Windows graphics API collection.

Unicode

2009-06-03T21:21:00.005-05:00

I like to make my C++ applications Unicode-aware. What do I mean by this? I use UTF-16 where I can, and convert between UTF-16 and UTF-8 if necessary.

C++ has a wide character type, wchar_t, which is used for storing wide characters. The problem with wchar_t is that it its size is platform-specific; on Windows, wchar_t is 2 bytes while on most *nix-based machines it's 4 bytes. In other words, on Windows it would be used for storing UTF-16 and on *nix, it'd be used for storing UTF-32.

This complication is reason enough for many libraries to avoid wchar_t altogether and simply use UTF-8. However, I prefer UTF-16 as I find it to be a nice trade-off between UTF-8 and UTF-32: efficient and more compact than UTF-32 in most cases.

Luckily, C++0x adds two new character types: char16_t for storing UTF-16 characters and char32_t for storing UTF-32 characters. With these new types, it will be possible to write cleaner, portable, Unicode-aware C++0x code.

As an aside, Windows charmap cannot display characters past 0xFFFF, which I find to be annoying. So, I've begun writing my own Unicode character viewer using Direct2D and DirectWrite.

C++0x Lambdas

2009-05-22T18:51:00.005-05:00

I recently installed the VS2010 beta so I could experiment with some of the new C++0x features. In this post, I would like to cover a few simple lambda examples.

Let's start off simple - without lambdas. Suppose we have a vector allocated for 100 floats, and we want to fill it up with random numbers in [0,1). The most obvious way is to use a looping construct of some kind.


vector<float> vec(100);

for (unsigned int i = 0; i < 100; i++)
{
    vec[i] = static_cast<float>(rand()) / (static_cast<float>(RAND_MAX) + 1.0f);
}

Okay, I realize I didn't need that many casts, but I like being safe. ;) While this code does the job, we can also use std::generate() to avoid the explicit loop construct:


float randval()
{
    return rand() / (RAND_MAX + 1.0f);
}

vector<float> vec(100);

generate(vec.begin(), vec.end(), randval);

Looks fine, right? What can we possibly do differently? We can use a lambda:


vector<float> vec(100);

generate(vec.begin(), vec.end(), []() -> float
{
    return rand() / (RAND_MAX + 1.0f);
});

Now consider the slightly more complicated example where we have two vectors and want to produce a third vector from them. We can use a variant of std::transform() to do this.

We have vectors invec1 and invec2, and want to produce outvec which is simply the elementwise product of invec1 and invec2.

First try:


float product(float x, float y)
{
    return x * y;
}

transform(invec1.begin(), invec1.end(), invec2.begin(), outvec.begin(), product);

Second try:


transform(invec1.begin(), invec1.end(), invec2.begin(), outvec.begin(), [](float x, float y) -> float
{
    return x * y;
});

Another application for lambdas is to use one as a sort predicate:


sort(vec.begin(), vec.end(), [](float x, float y) -> bool { return x > y; });

Don't forget, we can always assign a lambda to a variable to clean things up a little:


void somefunc()
{
    auto f = [](float x, float y) -> bool { return x > y; };
    sort(vec.begin(), vec.end(), f);
}

Now for some code to put it all together:


vector<float> vec(10);

// Fill up the vector with random numbers in [0,1)
generate(vec.begin(), vec.end(), []() -> float { return rand() / (RAND_MAX + 1.0f); });

// Sort the vector in descending order
sort(vec.begin(), vec.end(), [](float x, float y) -> bool { return x > y; });

// Print the vector
copy(vec.begin(), vec.end(), ostream_iterator<float>(cout, " "));
cout << endl;

Windows 7 APIs

2009-05-04T18:08:00.004-05:00

Windows 7 introduces a number of new APIs. In this post I would like to focus on ITaskbarList3. With this interface, it is possible to turn your application's taskbar button into a progress bar, as well as control what shows up as the thumbnail preview. It is even possible to add buttons to the preview window, as depicted below.

The code to add the buttons is quite simple:


DWORD dwMask = THB_TOOLTIP | THB_BITMAP;
THUMBBUTTON tbhButtons[2];
wstring btn1 = L"Button 1";
wstring btn2 = L"Button 2";

tbhButtons[0].dwMask = dwMask;
tbhButtons[0].iId = 0;
tbhButtons[0].iBitmap = 0;
tbhButtons[0].szTip[btn1.length()] = L'\0';
btn1.copy(tbhButtons[0].szTip, btn1.length());

tbhButtons[1].dwMask = dwMask;
tbhButtons[1].iId = 1;
tbhButtons[1].iBitmap = 1;
tbhButtons[1].szTip[btn2.length()] = L'\0';
btn2.copy(tbhButtons[1].szTip, btn2.length());

ITaskbarList3 *ptbl;
CoCreateInstance(CLSID_TaskbarList, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&ptbl));

ptbl->ThumbBarAddButtons(g_hWnd, ARRAYSIZE(tbhButtons), tbhButtons);

HIMAGELIST imglist = ImageList_LoadImage(NULL, L"btns.bmp", 16, 0, CLR_NONE, IMAGE_BITMAP, LR_LOADFROMFILE | LR_CREATEDIBSECTION);
ptbl->ThumbBarSetImageList(g_hWnd, imglist);

More Reaction-Diffusion!

2009-04-28T00:24:00.003-05:00

I am adding stylized shaders to the reaction-diffusion program:

Reaction-Diffusion

2009-04-24T20:06:00.004-05:00

I have recently gotten into partial differential equation (PDE) visualization. In particular, I am focusing on a set of PDEs known as reaction-diffusion systems. These systems have two important terms: a reaction term, and a diffusion term (Laplacian).

One such reaction-diffusion model is the Gray-Scott model. Below is a screenshot from my visualization program, which is applying a palette on the GPU.

The Gray-Scott model can be estimated by trivial numerical methods such as the finite difference method. Because of this, it is very easy to parallelize, which means implementing it in a compute shader, OpenCL or CUDA is a simple task. I have a compute shader solver written, which will be accelerated by D3D11 hardware when it is available. I will share more details in an upcoming post.

D3D11 Stream Types

2009-03-09T22:30:00.002-05:00

I have been wanting to cover more of the specifics on the new stream data types in upcoming Direct3D 11. Essentially what these types enable you to do is emit data without having to worry about order. That is, these are unordered data types; order is not preserved.

One application of the structured buffer stream types is emitting pixel data in a structure, from the pixel shader. In this scenario, it is necessary to determine how many structures are emitted - luckily, this can be manipulated without ever reading back from the GPU. D3D11 provides the CopyStructureCount method, to copy the number of written items into a buffer. That buffer can then be used with any of the draw indirect methods.

Image Convolution

2009-02-21T09:37:00.008-06:00

One of the most commonly performed image post-processing effect is the image convolution. A number of tricks are employed to make convolutions more efficient on the GPU, such as using separable convolutions, upscaling a smaller image to fake a blur convolution, etc.

The problem with using the pixel shader to perform convolutions is the redundant texture fetching. Imagine the convolution window being slid to the right by one pixel: each time, there is a large overlap in texture fetches. Ideally, we should be able to fetch the information from the texture once, and store it into a cache. This is where compute shaders come in.

Compute shaders allow access to "groupshared" memory: in other words, memory that is shared amongst all of the threads in a group. Essentially what we can do is fill up a group's shared memory with a chunk of the texture, synchronize the threads, and then continue with the convolution. Only this time, we reference the shared memory instead of the texture.

In a future post, I will provide a more complete example. But for now, I will outline the two methods:

Method A: Pixel shader


Texture2D<float> img;

float result = 0.0f;

int w2 = (w - 1) / 2;
int h2 = (h - 1) / 2;

for (int j = -h2; j <= h2; j++)
{
    for (int i = -w2; i <= w2; i++)
    {
        result += img[int2(x + i, y + j)] * kernel[w * (j + h2) + (i + w2)];
    }
}

return result;

Above, x and y represent the position of the pixel being processed, while w and h are the width and height of the convolution kernel.

Method B: Compute shader


Texture2D<float> img;
RWTexture2D<float> outimg;

groupshared float smem[(BLOCKDIM + 2) * (BLOCKDIM + 2)];

// Read texture data into smem for this group

// Synchronize the threads
GroupMemoryBarrierWithGroupSync();

float result = 0.0f;

for (int j = 0; j < h; j++)
{
    for (int i = 0; i < w; i++)
    {
        result += smem[offset + (BLOCKDIM + 2) * j + i] * kernel[w * j + i];
    }
}

outimg[int2(x, y)] = result;

Here, BLOCKDIM is the width (and height) of threads in a group, and offset is an offset into shared memory, which is a function of the thread ID within a group.

The compute shader method substantially reduces the number of redundant fetches necessary compared to the pixel shader method, especially when using an inseparable kernel.

DirectWrite Text Layouts, Part 2

2009-02-06T20:08:00.003-06:00

In my previous post, I briefly covered DirectWrite text layouts. In this post, I would like to go into greater depth.

A text layout essentially enables you to describe many aspects of the contents of a string - text size, text style, text weight, custom drawing effects, inline objects, etc. The methods provided by a text layout enables you to apply specific formatting to specific ranges of text.

To backtrack a little bit, there are multiple ways of rendering text with Direct2D and DirectWrite. The first way, a way which I consider to be at the highest level of abstraction, is the DrawText method provided by a Direct2D render target. This method can be used to draw simple text that requires no extensive formatting. This method does not take a text layout object at all, but instead a simpler text format object.

The second way, which I consider to be mid-level, is the DrawTextLayout method (again provided by a Direct2D render target). This method takes a text layout object and renders it.

The third way, which I consider to be the lowest level, is the Draw method provided by a DirectWrite text layout object. This Draw method takes a custom class (a class which implements the IDWriteTextRenderer class) and uses its callbacks to render. This may seem complex, but it is actually trivial to write a class which acts just like DrawTextLayout does in Direct2D.

I would first like to focus on the lowest level method, since it excites me the most. Using this method, the text rendering possibilities are truly endless. The IDWriteTextRenderer interface defines six functions. I am going to focus on the DrawGlyphRun method in this post.

When the Draw method is called on the text layout object with a custom class, it will call the class's DrawGlyphRun method for contiguous sets of glyphs that have similar formatting. You may be wondering how you are supposed to write a pass-through function that simply renders the glyph run it receives - simple! Direct2D provides a render target method also called "DrawGlyphRun" which is the absolute lowest level glyph rendering function that handles ClearType.

Obviously, this is not a very interesting thing to do; this is basically what Direct2D's DrawText and DrawTextLayout use. An example on MSDN illustrates a more interesting use of a custom renderer. What they have essentially done is retrieved the glyphs' geometric information, and used Direct2D's draw/fill geometry methods.

This brings me to another interesting use case: writing a custom rendering class to "suck out" the geometry from glyphs. This can be done to create vertex buffers for Direct3D for extruded text, as done in this example.

As I mentioned earlier, it is possible to apply custom drawing effects to ranges of text. The way this works is simple: an application-specific object and text range are provided to the SetDrawingEffect method of the text layout object. The object provided is passed to the application-defined DrawGlyphRun method. The data can then be used in any way imaginable. Think brushes, stroke styles, transformation matrix effects, etc.

You may be thinking: is it necessary to write a pass-through custom renderer just to use different brushes as drawing effects? The answer is no - the implementation provided by Direct2D's DrawTextLayout interprets drawing effects as brushes!

DirectWrite Text Layouts

2009-02-01T22:07:00.007-06:00

In experimenting with DirectWrite, I discovered how to apply specific formatting to substrings: DirectWrite text layout objects.

Consider the following code.


m_spBackBufferRT->BeginDraw();

wstring text = L"This is a test of the text rendering services provided by Direct2D and DirectWrite.  I am testing the quality and performance of these new APIs.  So far, they are proving to be quite nice.";
m_spBackBufferRT->DrawText(text.c_str(), text.length(), m_spTextFormat, D2D1::RectF(0.0f, 0.0f, static_cast<float>(width), static_cast<float>(height)), m_spTextBrush, D2D1_DRAW_TEXT_OPTIONS_NO_CLIP);

m_spBackBufferRT->EndDraw();

The results are as expected.

Now, say I want to render the substrings "Direct2D" and "DirectWrite" in bold. One way would be to use the font metric methods of DirectWrite and render the paragraph in multiple pieces, but this feels a bit too tedious for what I want to do. A better approach would be to use a text layout object.

The following code does the trick.


m_spBackBufferRT->BeginDraw();

wstring text = L"This is a test of the text rendering services provided by Direct2D and DirectWrite.  I am testing the quality and performance of these new APIs.  So far, they are proving to be quite nice.";

IDWriteTextLayout *m_spTextLayout;
m_spDWriteFactory->CreateTextLayout(text.c_str(), text.length(), m_spTextFormat, width, height, &m_spTextLayout);

{
    DWRITE_TEXT_RANGE dtr = {58, 8};
    m_spTextLayout->SetFontWeight(DWRITE_FONT_WEIGHT_BOLD, dtr);
}

{
    DWRITE_TEXT_RANGE dtr = {71, 11};
    m_spTextLayout->SetFontWeight(DWRITE_FONT_WEIGHT_BOLD, dtr);
}

m_spBackBufferRT->DrawTextLayout(D2D1::Point2F(0.0f, 0.0f), m_spTextLayout, m_spTextBrush);

m_spBackBufferRT->EndDraw();
m_spTextLayout->Release();

It cannot get much simpler than that. All that is needed is to supply a substring range and method-specific argument and we are set.

Direct2D and DirectWrite: An example

2009-01-31T21:42:00.013-06:00

I heavily use Direct3D in my Windows graphics programs. So why would I want to use an API meant for 2D graphics? The answer is simple: text rendering.

Text rendering has always been a massive pain in 3D APIs, but rightfully so. Why should a low-level GPU API care about text? One solution to this is to write one's own text rendering class. I would much rather use a standard library, though. That's where Direct2D and DirectWrite come into play.

As I mentioned in a previous post, Direct2D is actually independent from Direct3D. You can write an application that only uses Direct2D and never actually touch Direct3D in your code. (Direct2D, of course, uses Direct3D internally). This may sound inflexible when wanting to mix it with Direct3D, but the situation is quite the opposite. Thanks to DXGI, it is possible to obtain the DXGI surface representation of a Direct3D texture and hand it off to Direct2D.

I demonstrate a simple example. In this program (a 3D vector field plotter, as a matter of fact), I am interested in displaying the time it takes to render a frame, as well as a simple performance log graph. I was able to eliminate a good chunk of D3D code and replace it with a small section of D2D code.

Big deal, right? Check this out.

Would you want to try rendering Gabriola by hand in a 3D graphics API? :)

How about a thicker line in the performance graph, and with a dashed stroke style?

I plan on posting more complete code snippets later, but for now I will get right down to the fundamental code.


if (FAILED(DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED, __uuidof(IDWriteFactory), reinterpret_cast<IUnknown **>(&m_spDWriteFactory)))) exit(EXIT_FAILURE);
m_spDWriteFactory->CreateTextFormat(L"Gabriola", NULL, DWRITE_FONT_WEIGHT_NORMAL, DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL, 30, L"", &m_spTextFormat);

The first eye-pleasing line creates a DirectWrite factory object. We then use the factory to create a new text format. A text format encapsulates basic information such as the font, font weight, font style, font size, etc.

We then use Direct2D to draw the string.


wstring mystring = L"Hello, world!";
m_spBackBufferRT->DrawText(mystring, mystring.length(), m_spTextFormat, D2D1::RectF(0.0f, 0.0f, 150.0f, 50.0f), m_spTextBrush, D2D1_DRAW_TEXT_OPTIONS_NO_CLIP);

As can be seen above, one of the arguments to the DrawText function is the DirectWrite text format we created earlier. In a future post I will cover in greater detail how I obtained m_spBackBufferRT.

I have not forgotten about OpenGL: in such situations I highly recommend the QuesoGLC text renderer. I expect to write up QuesoGLC examples as well in the future.

Direct2D and DirectWrite

2009-01-26T20:46:00.002-06:00

I recently installed the Windows 7 SDK so I could experiment with Direct2D and DirectWrite. I am very pleased with the APIs; they definitely simplify the text handling in my Direct3D applications.

One of the great design points of Direct2D is that it can inter-operate with Direct3D textures through DXGI.

Expect code snippets and screenshots soon!

D3D11 Types

2009-01-05T14:52:00.004-06:00

Direct3D 11 introduces a number of new datatypes to HLSL.

New read-only types:

ByteAddressBuffer
StructuredBuffer

New read-write types:

RWByteAddressBuffer
RWStructuredBuffer
RWBuffer
RWTexture1D/RWTexture1DArray
RWTexture2D/RWTexture2DArray
RWTexture3D

New stream types:

AppendByteAddressBuffer/ConsumeByteAddressBuffer
AppendStructuredBuffer/ConsumeStructuredBuffer

The (RW)ByteAddressBuffer type is a buffer that is DWORD-aligned byte-addressable. What this means is that I can pack an arbitrary mix of scalar and struct types into a buffer, and then pull the data out with a cast.

The (RW)StructuredBuffer type is an extension to the Buffer type in that it allows arbitrary structures to be stored. For example, we might wish to store per-instance data in a structure for cleaner code:


struct Vert
{
   float3 color1, color2;
   float mixamount;
   float3 deform;
};

StructuredBuffer<Vert> data;

PS_INPUT VS(VS_INPUT input)
{
   Vert v = data[input.instanceid];
   // Use v to compute vertex properties
}

The RWBuffer type is simply an extension to the Buffer type in that it allows reading and writing in pixel and compute shaders.

Next, we have the read-write texture types. These new types will have exciting new possibilities and will eliminate the need to ping-pong two textures in some cases. These types are pixel-addressable.

Finally, we have the stream data types. The stream types are meant for applications that deal with variable amounts of data that need not preserve ordering of records. For example, say we want to emit per-fragment data from the pixel shader, but not into a texture. We can define a structure that describes a fragment, and then we can emit the structures.


struct Fragment
{
    float3 color;
    float depth;
    uint2 location;
};

AppendStructuredBuffer<Fragment> data;

void PS(...)
{
    Fragment f;
    f.color = ...;
    f.depth = ...;
    f.location = ...;
    data.Append(f);
}

Now say we would like to process each fragment in a compute shader.


struct Fragment
{
    float3 color;
    float depth;
    uint2 location;
};

ConsumeStructuredBuffer<Fragment> data;
RWTexture2D frame;

void CS(...)
{
    Fragment f;
    data.Consume(f);
    // Compute result and write to texture
    frame[f.location] = ...;
}