GPU Experiments

Sunday, September 18, 2011

Direct3D 11.1

I just heard about Direct3D 11.1 and would like to share some of the new features.

Some of the major improvements:

Shader tracing. The documentation is lacking at this time, but it appears as if it will be possible to retrieve information about registers.
Logical operations on a render target. Direct3D has been lacking this functionality for quite a while (glLogicOp in OpenGL), so it is nice to see it arrive to the API.
More powerful resource copy routines. It will now be possible to copy from one subresource into itself, even if the regions overlap.
Up to 64 UAVs can now be bound to the pipeline. Even more exciting, UAVs can now be used from any pipeline stage. I am already thinking of what could be done with a UAV in the hull/domain shaders...

Friday, February 19, 2010

Now that I have explained a bit about tessellation, it's time for an actual example. We'll start off with a basic cubic Bézier spline renderer.

Let's start by looking at the parametric function used to compute a cubic Bézier curve. The control points are represented by P0, P1, P2 and P3.

Vertex shader

Recall that the vertex shader is run once per control point. For this example, we just pass the control points through to the next stage.


struct IA_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct VS_OUTPUT
{
    float3 cpoint : CPOINT;
};

VS_OUTPUT VS(IA_OUTPUT input)
{
    VS_OUTPUT output;
    output.cpoint = input.cpoint;
    return output;
}

Hull shader

The patch constant function (HSConst below) is executed once per patch (a cubic curve in our case). Recall that the patch constant function must at least output tessellation factors. The control point function (HS below) is executed once per output control point. In our case, we just pass the control points through unmodified.


struct VS_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct HS_CONSTANT_OUTPUT
{
    float edges[2] : SV_TessFactor;
};

struct HS_OUTPUT
{
    float3 cpoint : CPOINT;
};

HS_CONSTANT_OUTPUT HSConst()
{
    HS_CONSTANT_OUTPUT output;

    output.edges[0] = 1.0f; // Detail factor (see below for explanation)
    output.edges[1] = 8.0f; // Density factor

    return output;
}

[domain("isoline")]
[partitioning("integer")]
[outputtopology("line")]
[outputcontrolpoints(4)]
[patchconstantfunc("HSConst")]
HS_OUTPUT HS(InputPatch<VS_OUTPUT, 4> ip, uint id : SV_OutputControlPointID)
{
    HS_OUTPUT output;
    output.cpoint = ip[id].cpoint;
    return output;
}

Tessellator

The actual tessellator is not programmable with HLSL, but it is worth noting that the actual tessellation takes place between the hull shader and the domain shader. The tessellation factors and compile-time settings (domain, partitioning, output topology, etc.) influence the tessellator.

Domain shader

Note that up until now, we have not used the cubic Bézier curve parametric function. The domain shader is where we use this function to compute the final position of the tessellated vertices.


struct HS_CONSTANT_OUTPUT
{
    float edges[2] : SV_TessFactor;
};

struct HS_OUTPUT
{
    float3 cpoint : CPOINT;
};

struct DS_OUTPUT
{
    float4 position : SV_Position;
};

[domain("isoline")]
DS_OUTPUT DS(HS_CONSTANT_OUTPUT input, OutputPatch<HS_OUTPUT, 4> op, float2 uv : SV_DomainLocation)
{
    DS_OUTPUT output;

    float t = uv.x;

    float3 pos = pow(1.0f - t, 3.0f) * op[0].cpoint + 3.0f * pow(1.0f - t, 2.0f) * t * op[1].cpoint + 3.0f * (1.0f - t) * pow(t, 2.0f) * op[2].cpoint + pow(t, 3.0f) * op[3].cpoint;

    output.position = float4(pos, 1.0f);

    return output;
}

Because this is an example, I omitted optimizations to maintain clarity.

Pixel shader

This is a simple pixel shader that produces black lines.


struct DS_OUTPUT
{
    float4 position : SV_Position;
};

float4 PS(DS_OUTPUT input) : SV_Target0
{
    return float4(0.0f, 0.0f, 0.0f, 1.0f);
}

API setup

Control points are treated the same way as vertices.

Input assembler signature:


D3D11_INPUT_ELEMENT_DESC desc[] =
{
    {"CPOINT", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0}
};

Input assembler binding code:


UINT strides[] = {3 * sizeof(float)}; // 3 dimensions per control point (x,y,z)
UINT offsets[] = {0};
g_pd3dDC->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_4_CONTROL_POINT_PATCHLIST); // 4 control points per primitive
g_pd3dDC->IASetInputLayout(layout);
g_pd3dDC->IASetVertexBuffers(0, 1, &controlpoints, strides, offsets);

// Bind the shaders
// ...

// Render 4 control points (1 patch in this example, since we're using 4-control-point primitives).
// Rendering 8 control points simply means we're processing two 4-control-point primitives, and so forth.
// Instancing and indexed rendering works as expected.
g_pd3dDC->Draw(4, 0);

Now that the shaders are out of the way, it is a good time to explain the purpose of two tessellation factors for isolines rather than just one. Recall that a single tessellation factor can be no greater than 64. When dealing with isolines, this number is rather small; it is desirable to render a single isoline patch with a high degree of tessellation. To alleviate this problem, D3D11 allows us to specify two isoline tessellation factors: a detail factor and a density factor.

To understand what these factors mean, visualize a square. Now imagine that the detail factor describes how much to divide up the y axis, while the density factor describes how much to divide up the x axis. Now imagine connecting the dots along the x axis to form lines.

Another way to think about this: the density factor describes how much to tessellate a line, while the detail factor describes how many times to instance the tessellated line. We can find the location within a tessellated line by using SV_DomainLocation.x and we can find which line we're evaluating by using SV_DomainLocation.y. This effectively lets us chain the lines together into one, ultra-tessellated line. Darn good use of parallelism if you ask me.

Back to the example at hand: let's run some control points through this shader and see what we end up with.

Consider the following control points:


P0 = [-1, -0.8, 0]
P1 = [ 4, -1,   0]
P2 = [-4,  1,   0]
P3 = [ 1,  0.8, 0]

Keep in mind that we're using a hard-coded density tessellation factor of 8 here, which is why the result looks low-resolution. Let's up the factor to 64 and see what we get.

Much better.

There are a number of things we could do to improve upon this example. For example, to obtain more than 64 divisions per patch, we can use the detail factor to "instance" the line up to 64 times, and piece together the instanced, divided lines in the domain shader. Another thing we could do is create a geometry shader which transforms lines into triangles. We could procedurally perturb the control points in the vertex shader for animation effects. We could compute the tessellation factors as a function of the control points.

Friday, February 12, 2010

Patches and geometry shaders

You might be wondering how the geometry shader interacts with the new shader stages and the new patch primitive types.

Consider a pipeline with a vertex shader, geometry shader and pixel shader. The vertex shader runs per-vertex, and once one primitive's worth of vertices have been processed, the geometry shader runs. The geometry shader runs per-primitive and outputs vertices of a potentially different primitive type.

Now add a hull and domain shader to the mix. Say the input assembler primitive type is a patch with n control points. The vertex shader runs n times per patch, then the hull shader runs another n times per patch, processing a total of m control points. The tessellator feeds the domain shader each tessellated vertex, and the domain shader outputs the processed vertex. From here, we head to the geometry shader.

Recall that the tessellator can produce lines or triangles. This determines the incoming primitive type to our geometry shader. For the sake of this example, assume the tessellator is configured to output a triangular topology. Say we're emitting points from the geometry shader. This means the signature to the geometry shader looks something like this:



void GS(triangle DOMAIN_SHADER_OUTPUT input[3], inout PointStream<GEOMETRY_SHADER_OUTPUT> stream);

Had the tessellator been configured to deal with isolines, then we would be using line DOMAIN_SHADER_OUTPUT input[2] instead.

So, there you have it. The geometry shader integrates seamlessly into the tessellation model. Recall that the domain shader runs per tessellated vertex, and so we have little control over each individual triangle in that stage. A geometry shader can be used to break a patch up into individual primitives, which can be independently transformed, culled, duplicated, etc. Not to mention that geometry shaders can be instanced now... that's for a future post. :)

What happens if we configure the input assembler to use a patch primitive type, but do not bind hull and domain shaders to the pipeline? Remember that the geometry shader operates on primitives, and that the new patch types are primitives... therefore, the geometry shader can operate on patch primitives!

Here are some example geometry shader signatures.

Input primitive type of point:
void GS(point VERTEX_SHADER_OUTPUT pt[1], ...);

Input primitive type of line:
void GS(line VERTEX_SHADER_OUTPUT pt[2], ...);

Input primitive type of triangle:
void GS(triangle VERTEX_SHADER_OUTPUT pt[3], ...);

Input primitive type of a 25-point patch:
void GS(InputPatch<VERTEX_SHADER_OUTPUT, 25> pt, ...);

Excited yet? You should be! What this essentially means is that you can make up your own primitive types (with anywhere from 1 point to 32 points) that the geometry shader can operate on. Wish you had a quad primitive type? Use a 4-point patch with a geometry shader and emit two triangles!

Saturday, February 6, 2010

D3D11 tessellation, in depth

Consider the typical flow of data through the programmable pipeline:

Input assembler -> Vertex shader -> Pixel shader

Buffers containing per-vertex data are bound to the input assembler. The vertex shader is executed once per vertex, and each execution is given one vertex worth of data from the input assembler.

Say, however, that we wish to process control points and patches instead. The vertex shader by itself isn't particularly well-suited for handling the manipulation of patches; we could store the control points in a buffer and index with SV_VertexID, but this is not very efficient especially when dealing with 16+ control points per patch.

To solve this problem, D3D11 adds two new programmable stages: the hull shader and the domain shader. Consider the following pipeline.

Input assembler -> Vertex shader -> Hull shader -> Tessellator -> Domain shader -> Pixel shader

Normally, the input assembler can be configured to handle points, lines, line strips, triangles and triangle strips. It turns out that it is quite elegant to add new primitive types for patches. D3D11 adds 32 new primitive types: each represents a patch with a different number of control points. That is, it's possible to describe a patch with anywhere from 1 to 32 control points.

For the purpose of this example, say we've configured the input assembler to handle patches with 16 control points, and also that we're only rendering one patch. We will use a triangular patch domain.

Since we're rendering one patch, we'll need a buffer with 16 points in it -- in this context, these points are control points. This buffer is bound to the input assembler as usual. The vertex shader is executed once per control point, and each execution is given one control point worth of data from the input assembler. Similar to the non-patch primitive types, the vertex shader can only see one control point at a time; it cannot see all 16 of the control points on the patch.

When not using tessellation, the next shader stage is executed once the vertex shader has operated on all of the vertices of a single primitive. For example, when using the triangle primitive type, the next stage is run once per every three executions of the vertex shader. The same principle holds when using tessellation: the next stage isn't executed until all 16 control points have been transformed by 16 executions of the vertex shader.

Once all 16 control points have been transformed, the hull shader executes. The hull shader consists of two parts: a patch constant function and the hull program. The patch constant function is responsible for computing data that remains constant over the entire patch. The hull program is run per control point, but unlike the vertex shader, it can see all of the control points for the entire patch.

You might be wondering what the point of the hull program is. After all, we did already transform the control points in the vertex shader. The important part is that the hull program can take into account all of the control points when computing the further transformed output control points. D3D11 allows us to output a different number of control points from the hull program than we took in. This means we can perform basis transformations -- for example, using a little math we could transform 32 control points into 16 control points, which saves us some processing time later on down the pipeline. At this point, further clarification is helpful: the hull program runs once per output control point. So, if we've configured the hull program to output 4 control points, it will run 4 times total per patch. It will not run 16 times, even though we have 16 input control points.

The next stage is the tessellator unit itself. This stage is not programmable with HLSL, but has a number of properties that can be set. The tessellator is responsible for producing a tessellated mesh and nothing more; it does not care at all about any user-defined data or any of our control points. The one thing it does care about, however, are tessellation factors -- or, how much to tessellate regions of the patch. You may be wondering where we actually output these values. Since the tessellation factors are determined once per patch, we compute these in the patch constant function. Thus, the only thing given to the tessellator is the tessellation factors from the patch constant function.

The topologies produced by the tessellator vary depending on how it is setup. For this example, using a triangular domain means that the tessellator will produce a tessellated triangle topology described by 3D barycentric coordinates. How cool is that?

So, by this point we've transformed each control point in the vertex shader, performed a possible basis transformation of the control points in the hull program, and have determined the tessellation factors for this patch in the patch constant function, along with any other user-defined data. The tessellation factors have been run through the tessellation hardware, which has created a shiny new tessellated mesh: in this case, a tessellated triangle described with barycentric coordinates. I would like to emphasize once again that the tessellator does not care at all about anything besides the tessellation factors and a small number of configuration properties set at shader compile-time. This is what makes the D3D11 implementation so beautiful: it is very general and very powerful.

You're probably wishing we could transform the tessellated mesh in arbitrary ways, and, well... we can! The next stop is the domain shader. The domain shader can be thought of as a post-tessellation vertex shader; it is run once per tessellated vertex. It is handed all of our output control points, our patch constant data, as well as a special system value which describes the barycentric coordinate of the tessellated vertex we're operating on. Barycentric coordinates are very handy when working in triangular domains, since they allow us to interpolate data quite easily over the triangle.

At this point, the flow of data is familiar: the output from the domain shader is handed to the pixel shader. It is important to note that in general, 32 float4s can be passed between every shader stage. We can pass 32 float4s from the vertex shader to the hull shader, 32 float4s from the patch constant function to the domain shader, 32 float4s from the hull program to the domain shader, and 32 float4s from the domain shader to the pixel shader. In other words, a lot of data can be passed using interstage registers, not to mention we can also bind shader resource views to the vertex, hull, domain, geometry and pixel shader stages.

I have left the geometry shader out of this explanation to simplify things, but it is very possible to throw a geometry shader into the mix to do some very interesting things -- one example that comes to mind is eliminating portions of a patch, or breaking it up into individual triangles to form new topologies. It is also possible to use stream-out with tessellation.

Due to the general nature of the pipeline, we can even use tessellation without binding any actual control point data to the pipeline at all. Consider that the vertex shader is able to see the vertex ID (control point ID in this case) and instance ID. The hull and domain shaders can see the primitive ID (which is basically a patch ID). Using this information alone, very interesting and useful things can be accomplished: a good example is producing a large mesh consisting of many individual patches. The patches can be placed appropriately by using the primitive ID.

Earlier I touched on the tessellation stages having compile-time settings. These settings are specified with the hull program. Here is an example declaration of settings.



[domain("tri")]

[partitioning("integer")]

[outputtopology("triangle_cw")]

[outputcontrolpoints(16)]

[patchconstantfunc("HSConst")]

domain(x) - This attribute specifies which domain we're using for our patches. In this example, I specified a triangle domain, but it's also possible to specify a quadrilateral or isoline domain.

partitioning(x) - This attribute tells the tessellator how it is to interpret our tessellation factors. Integer partitioning means the tessellation factors are interpreted as integral values; there are no "in-between" tessellated vertices. The other partitioning schemes are fractional_even, fractional_odd and pow2.

outputtopology(x) - This attribute tells the tessellator what kind of primitives we want to deal with after tessellation. In this case, triangle_cw means clockwise-wound triangles. Other possibilities are triangle_ccw and line.

outputcontrolpoints(x) - This attribute describes how many control points we will be outputting from the hull program. We can choose to output anywhere from 0 to 32 control points which are then fed into the domain shader.

patchconstantfunc(x) - This attribute specifies the name of the patch constant function, which is executed once per patch.

Each stage is given different data. To illustrate this, I will show one possible function signature for each stage.



VS_OUTPUT VS(IA_OUTPUT input, uint vertid : SV_VertexID, uint instid : SV_InstanceID);



HS_CONSTANT_OUTPUT HSConst(InputPatch<VS_OUTPUT, n> ip, OutputPatch<HS_OUTPUT, m> op, uint pid : SV_PrimitiveID);

HS_OUTPUT HS(InputPatch<VS_OUTPUT, n> ip, uint cpid : SV_OutputControlPointID, uint pid : SV_PrimitiveID);



DS_OUTPUT DomainShader(HS_CONSTANT_OUTPUT constdata, OutputPatch<HS_OUTPUT, m> op, uint pid : SV_PrimitiveID, float3 coord : SV_DomainLocation);

SV_DomainLocation's type depends on the chosen patch domain. For the triangular domain, SV_DomainLocation is a float3. For the quad domain, it is a float2. For the isoline domain, it is a float2 (for reasons which I will touch on in a future post). n stands for the number of input control points and m stands for the number of output control points.

As stated earlier, the patch constant function (HSConst in this case) is required to output at least the tessellation factors. The number of tessellation factors depends on the patch domain. For the triangular domain, there are 4 factors (3 sides, 1 inner). For the quadrilateral domain, there are 6 factors (4 sides, 2 inner). For the isoline domain, there are 2 factors (detail and density).

Let's take a look at the topology produced by the tessellator by using the wireframe rasterization mode, a quadrilateral domain, and integer partitioning.

In the following patch constant function, I have chosen to use hard-coded tessellation factors. In practice, the tessellation factors are computed dynamically. The tessellation factors are not required to be hard-coded constants!


struct HS_CONSTANT_OUTPUT
{
float edges[4] : SV_TessFactor;
float inside[2] : SV_InsideTessFactor;
};

HS_CONSTANT_OUTPUT HSConst()
{
HS_CONSTANT_OUTPUT output;

output.edges[0] = 1.0f;
output.edges[1] = 1.0f;
output.edges[2] = 1.0f;
output.edges[3] = 1.0f;

output.inside[0] = 1.0f;
output.inside[1] = 1.0f;

return output;
}

The edge factors are held constant at 1, 1, 1, 1 and the inside factors at 1, 1. The tessellator produces the following mesh:

What about edge factors of 3, 1, 1, 1 and inside factors of 1, 1?

Edge factors of 5, 5, 5, 5 and inside factors of 1, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 2, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 4, 1:

Edge factors of 1, 1, 1, 1 and inside factors of 4, 4:

Edge factors of 4, 4, 4, 4 and inside factors of 4, 4:
(Same as edge factors of 3.5, 3.8, 3.9, 4.0 and inside factors of 3.1, 3.22!)

Edge factors of 4, 4, 4, 1 and inside factors of 4, 4:

It should be noted that when using integer partitioning, the implementation is essentially using the ceiling of the written tessellation factors. Let's take a look at the output from the fractional_even partitioning scheme.

Edge factors of 2, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.1, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.2, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 2.5, 1, 1, 1 and inside factors of 1, 1:

Edge factors of 3, 1, 1, 1 and inside factors of 1, 1:

Here's a funky one with edge factors of 3, 3, 3, 3 and inside factors of 4, 6, using the fractional_odd partitioning scheme:

Obviously hard-coded tessellation factors are only so useful. The real usefulness of tessellation comes into play when computing the tessellation factors dynamically, per patch, in realtime based on factors such as level of detail in a height map, camera distance, or model detail.

Thursday, February 4, 2010

D3D11 Resource Limitations

Ever wondered how many resources you can bind to a shader at one time? Or how many slices you can store in a 2D texture array? All of these limits can be found in the D3D header. For D3D11, this is D3D11.h. In this post, I'd like to point out notable improvements in D3D11.

Input assembler binding points (D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT): 32, up from 16

Maximum number of layers in a 1D or 2D texture array (D3D11_REQ_TEXTURE2D_ARRAY_AXIS_DIMENSION): 2048, up from 512

Maximum number of interstage float4s: 32 between every stage, up from 16

Maximum size 1D texture (D3D11_REQ_TEXTURE1D_U_DIMENSION): 16384 texels, up from 8192

Maximum size 2D texture (D3D11_REQ_TEXTURE2D_U_OR_V_DIMENSION): 16384x16384 texels, up from 8192x8192

Maximum number of unordered buffers that can be bound to a pixel or compute shader (D3D11_PS_CS_UAV_REGISTER_COUNT): 8, up from 0 ;)

Sunday, January 10, 2010

More tessellator fun

This is just a quick update post on projects I am currently working on.

1. Adding support for the tessellator into my realtime parametric surface plotting program. The original version of this program simply rendered an equally distributed mesh and offset the vertices in the vertex shader. Now I am tessellating patches in realtime to determine how many triangles should be output.

The initial results are promising. Shown here are three images, each of the same plot but with a differing camera position.

2. Music visualization program using wavelets. I plan on bringing the tessellator into this application once I go 3D.

Sunday, January 3, 2010

Integer division

Beware integer division on the GPU.

In a certain compute shader I am working on, this code


if (tcoord.x < 0) tcoord.x += gridsize.x;
if (tcoord.y < 0) tcoord.y += gridsize.y;
if (tcoord.z < 0) tcoord.z += gridsize.z;

if (tcoord.x >= gridsize.x) tcoord.x -= gridsize.x;
if (tcoord.y >= gridsize.y) tcoord.y -= gridsize.y;
if (tcoord.z >= gridsize.z) tcoord.z -= gridsize.z;

runs much faster than using


tcoord = (tcoord + gridsize) % gridsize;