GPU Experiments: 2009

Sunday, November 15, 2009

C++0x random header

C++0x includes a number of new headers, one of which includes new random number facilities. The functionality is ingeniously separated into two major parts: engines and distributions.

Engines are responsible for generating uniformly distributed random numbers. One such provided engine is the Mersenne Twister engine. Distributions use the output of engines to mold the numbers to a specific distribution.

Consider the following example.


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
cout << dist(engine) << endl;

The output is a single random number, following the exponential distribution. Now let's say we want to use this in a call to generate_n with a lambda:


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
generate_n(ostream_iterator<double>(cout, "\n"), 20, [&dist,&engine]() -> double { return dist(engine); });

You're probably thinking that a simple for loop would be much cleaner here, and I don't disagree. However, there is one other thing we can do:


mt19937 engine(static_cast<unsigned long>(time(NULL)));
exponential_distribution<double> dist;
variate_generator<mt19937, exponential_distribution<double>> gen(engine, dist);
generate_n(ostream_iterator<double>(cout, "\n"), 20, gen);

That's right -- variate_generator is provided to us so that we can encapsulate an engine along with a distribution. That way, a simple gen() gets us a random number using the desired engine and distribution.

Saturday, September 19, 2009

Tessellator

I finally decided to dive into the new tessellation shaders, and I am quite pleased. Going into it I thought it would be very specific to gaming applications, but as I've found out it is surprisingly general.

New primitive topologies have been added for tessellation. Since the basic unit for the new shaders are patches and control points, the new types allow you to render anywhere from 1 to 32 control points per patch.

Say you're using 16 control point patches. The vertex shader is run per control point, and the output from this stage is passed into the hull shader. The hull shader is really described in two parts; one being a patch-constant function, and the other being the hull program.

The patch-constant function computes user-defined data and is run once per patch. This allows you to compute things that remain constant across the entire patch. The required output from the patch-constant function are tessellation factors: these tell the tessellation hardware how much to tessellate the patch. The hull program is run once per output control point, and both the patch-constant function and hull program can see all control points.

The next step is the actual tessellation, which is performed in a fixed-function, yet configurable stage. The tessellator ONLY looks at the tessellation factors output from your patch-constant function. The user-defined output from the patch-constant function and hull program are provided to the domain shader, which is run after tessellation.

The domain shader is run per-tessellated-vertex and is provided with the tessellated vertex location on the patch. To me it seems that the domain shader can be seen as a post-tessellation vertex shader; this is where you transform the tessellated vertices. The output from the domain shader is provided to the geometry shader (or pixel shader, if not using a geometry shader).

Here are some results from my initial experiments with the new stages:

The toughest part to me is computing the per-patch tessellation factors. But since this is completely programmable, it's a fun problem.

Saturday, August 29, 2009

D3D11 Types

I did some more fiddling with D3D11's new types and have learned new things about them.

First, it seems that it is not possible to read from multi-component RWTexture* and RWBuffer objects due to a hardware limitation. However, it is possible to read-write the 32-bit RGBA type thanks to the way D3D handles views.

Create the texture with a DXGI_FORMAT_R8G8B8A8_TYPELESS format, then for the unordered access view cast it to DXGI_FORMAT_R32_UINT. This allows for a common texture format to be read/written without ping ponging, which is great for in-place transformations.

There is another reason why this is not a major limitation. Consider applications whose requirements are reading and writing a texture, but also use shared memory to reduce texture fetching. This most likely means that there is overlapped texture fetching going on (e.g., for an image convolution), and so ping ponging two textures is necessary here anyway to prevent clobbering of data between shaders.

I have found the new structured buffer types to be much more flexible since they are independent of the texture subsystem. It is possible to read/write any structure and any element of a RWStructuredBuffer. Any shader can read from structured buffers, and the compute and pixel shaders can write to them. According to John Rapp on the DX11 forum, this type also has beneficial performance characteristics.

It should be noted that a structured buffer cannot be bound to the input assembler (not that you'd want to since you can just read from it in a random access manner), and cannot be the output of a stream-out operation. I consider these limitations minimal, since really, the input assembler probably should be going away sometime soon. As for stream out, one can just stream out to a regular Buffer and read from that.

The March 2009 SDK Direct3D 11 documentation mentions that the AppendStructuredBuffer and ConsumeStructuredBuffer types operate similar to a stack in that items are appended and consumed from the end of the buffer. If this is true, this is a very nice property to have. This means it is possible to append to a structured buffer in one pass, and bind it as a plain old StructuredBuffer in another pass (for example, indexed by SV_InstanceID in the vertex shader). Or, filling up a RWStructuredBuffer in one pass, then consuming from it in another pass.

I haven't played around too much with the ByteAddressBuffer type. From my experiments, it seems that StructuredBuffer is the way to go for most things. It seems that these replace Buffer for me in most of my applications.

Monday, August 10, 2009

New DirectWrite Samples

I am excited to see that there are new sample applications that show off the capabilities of DirectWrite.

For anyone who is thinking about porting over older GDI code to D2D/DWrite, have a look at this sample. It's a word processor that renders using DWrite. This should give you a better understanding of what all DWrite can do. :)

Thursday, July 30, 2009

C++0x auto

Consider the following iterator example:


map<string, pair<int, float>> data;

for (map<string, pair<int, float>>::iterator i = data.begin(); i != data.end(); i++)
{
   // ...
}

We can clean this up a bit by using the auto keyword:


map<string, pair<int, float>> data;

for (auto i = data.begin(); i != data.end(); i++)
{
   // ...
}

Sunday, July 26, 2009

Limitations

Having used Direct2D, DirectWrite and the Direct3D 11 previews, I would like to discuss some of the limitations I have run into.

Direct2D has the ability to render into Direct3D textures. However, D2D does not deal with resource views directly; it uses DXGI's facilities to access surfaces. The problem comes when trying to obtain the DXGI surface representation of a 2D texture that has more than one mip in its mipmap chain and/or more than one layer. Unless I am missing something, this is simply not possible. This means that it is not possible to use D2D to render directly into a Direct3D multilayer texture (or mipmapped texture).

Admittedly, I have not found myself needing to do this very often. Indeed, the most useful application of D2D/D3D interop to me has proved to be rendering to the backbuffer, which is neither mipmapped nor multilayered. In one scenario, however, I needed to render some numbers into a texture array. I had to create a temporary texture without any mipmap/layers, render to that using D2D, then perform an on-device copy to get it into my texture array.

This copy could be eliminated in two ways. One way involves adding a D3D dependency to D2D, which is not the best route. The second way involves a modification to DXGI to enable the casting of multilayer/mipmapped 2D textures to surfaces; it would be nice to be able to pass in a subresource number and get a surface representing a particular subresource of a 2D texture.

The second limitation I have run into is in the compute shader. I dislike how the number of threads per group is declared in the shader, and cannot be changed during runtime without a shader recompile. I really do not see the need for this limitation, as both OpenCL and CUDA allow the number of threads per group to be specified at runtime. That aside, I still prefer Microsoft's approach to computation on the GPU. I like that it is integrated into the Direct3D API and uses a language similar to the other shaders.

Aside from these minor limitations, my expectations are definitely surpassed with regard to Direct2D and DirectWrite. I think these APIs fill in a large gap in the Windows graphics API collection.

Wednesday, June 3, 2009

Unicode

I like to make my C++ applications Unicode-aware. What do I mean by this? I use UTF-16 where I can, and convert between UTF-16 and UTF-8 if necessary.

C++ has a wide character type, wchar_t, which is used for storing wide characters. The problem with wchar_t is that it its size is platform-specific; on Windows, wchar_t is 2 bytes while on most *nix-based machines it's 4 bytes. In other words, on Windows it would be used for storing UTF-16 and on *nix, it'd be used for storing UTF-32.

This complication is reason enough for many libraries to avoid wchar_t altogether and simply use UTF-8. However, I prefer UTF-16 as I find it to be a nice trade-off between UTF-8 and UTF-32: efficient and more compact than UTF-32 in most cases.

Luckily, C++0x adds two new character types: char16_t for storing UTF-16 characters and char32_t for storing UTF-32 characters. With these new types, it will be possible to write cleaner, portable, Unicode-aware C++0x code.

As an aside, Windows charmap cannot display characters past 0xFFFF, which I find to be annoying. So, I've begun writing my own Unicode character viewer using Direct2D and DirectWrite.

Friday, May 22, 2009

C++0x Lambdas

I recently installed the VS2010 beta so I could experiment with some of the new C++0x features. In this post, I would like to cover a few simple lambda examples.

Let's start off simple - without lambdas. Suppose we have a vector allocated for 100 floats, and we want to fill it up with random numbers in [0,1). The most obvious way is to use a looping construct of some kind.


vector<float> vec(100);

for (unsigned int i = 0; i < 100; i++)
{
    vec[i] = static_cast<float>(rand()) / (static_cast<float>(RAND_MAX) + 1.0f);
}

Okay, I realize I didn't need that many casts, but I like being safe. ;) While this code does the job, we can also use std::generate() to avoid the explicit loop construct:


float randval()
{
    return rand() / (RAND_MAX + 1.0f);
}

vector<float> vec(100);

generate(vec.begin(), vec.end(), randval);

Looks fine, right? What can we possibly do differently? We can use a lambda:


vector<float> vec(100);

generate(vec.begin(), vec.end(), []() -> float
{
    return rand() / (RAND_MAX + 1.0f);
});

Now consider the slightly more complicated example where we have two vectors and want to produce a third vector from them. We can use a variant of std::transform() to do this.

We have vectors invec1 and invec2, and want to produce outvec which is simply the elementwise product of invec1 and invec2.

First try:


float product(float x, float y)
{
    return x * y;
}

transform(invec1.begin(), invec1.end(), invec2.begin(), outvec.begin(), product);

Second try:


transform(invec1.begin(), invec1.end(), invec2.begin(), outvec.begin(), [](float x, float y) -> float
{
    return x * y;
});

Another application for lambdas is to use one as a sort predicate:


sort(vec.begin(), vec.end(), [](float x, float y) -> bool { return x > y; });

Don't forget, we can always assign a lambda to a variable to clean things up a little:


void somefunc()
{
    auto f = [](float x, float y) -> bool { return x > y; };
    sort(vec.begin(), vec.end(), f);
}

Now for some code to put it all together:


vector<float> vec(10);

// Fill up the vector with random numbers in [0,1)
generate(vec.begin(), vec.end(), []() -> float { return rand() / (RAND_MAX + 1.0f); });

// Sort the vector in descending order
sort(vec.begin(), vec.end(), [](float x, float y) -> bool { return x > y; });

// Print the vector
copy(vec.begin(), vec.end(), ostream_iterator<float>(cout, " "));
cout << endl;

Monday, May 4, 2009

Windows 7 APIs

Windows 7 introduces a number of new APIs. In this post I would like to focus on ITaskbarList3. With this interface, it is possible to turn your application's taskbar button into a progress bar, as well as control what shows up as the thumbnail preview. It is even possible to add buttons to the preview window, as depicted below.

The code to add the buttons is quite simple:


DWORD dwMask = THB_TOOLTIP | THB_BITMAP;
THUMBBUTTON tbhButtons[2];
wstring btn1 = L"Button 1";
wstring btn2 = L"Button 2";

tbhButtons[0].dwMask = dwMask;
tbhButtons[0].iId = 0;
tbhButtons[0].iBitmap = 0;
tbhButtons[0].szTip[btn1.length()] = L'\0';
btn1.copy(tbhButtons[0].szTip, btn1.length());

tbhButtons[1].dwMask = dwMask;
tbhButtons[1].iId = 1;
tbhButtons[1].iBitmap = 1;
tbhButtons[1].szTip[btn2.length()] = L'\0';
btn2.copy(tbhButtons[1].szTip, btn2.length());

ITaskbarList3 *ptbl;
CoCreateInstance(CLSID_TaskbarList, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&ptbl));

ptbl->ThumbBarAddButtons(g_hWnd, ARRAYSIZE(tbhButtons), tbhButtons);

HIMAGELIST imglist = ImageList_LoadImage(NULL, L"btns.bmp", 16, 0, CLR_NONE, IMAGE_BITMAP, LR_LOADFROMFILE | LR_CREATEDIBSECTION);
ptbl->ThumbBarSetImageList(g_hWnd, imglist);

Tuesday, April 28, 2009

More Reaction-Diffusion!

I am adding stylized shaders to the reaction-diffusion program:

Friday, April 24, 2009

Reaction-Diffusion

I have recently gotten into partial differential equation (PDE) visualization. In particular, I am focusing on a set of PDEs known as reaction-diffusion systems. These systems have two important terms: a reaction term, and a diffusion term (Laplacian).

One such reaction-diffusion model is the Gray-Scott model. Below is a screenshot from my visualization program, which is applying a palette on the GPU.

The Gray-Scott model can be estimated by trivial numerical methods such as the finite difference method. Because of this, it is very easy to parallelize, which means implementing it in a compute shader, OpenCL or CUDA is a simple task. I have a compute shader solver written, which will be accelerated by D3D11 hardware when it is available. I will share more details in an upcoming post.

Monday, March 9, 2009

D3D11 Stream Types

I have been wanting to cover more of the specifics on the new stream data types in upcoming Direct3D 11. Essentially what these types enable you to do is emit data without having to worry about order. That is, these are unordered data types; order is not preserved.

One application of the structured buffer stream types is emitting pixel data in a structure, from the pixel shader. In this scenario, it is necessary to determine how many structures are emitted - luckily, this can be manipulated without ever reading back from the GPU. D3D11 provides the CopyStructureCount method, to copy the number of written items into a buffer. That buffer can then be used with any of the draw indirect methods.

Saturday, February 21, 2009

Image Convolution

One of the most commonly performed image post-processing effect is the image convolution. A number of tricks are employed to make convolutions more efficient on the GPU, such as using separable convolutions, upscaling a smaller image to fake a blur convolution, etc.

The problem with using the pixel shader to perform convolutions is the redundant texture fetching. Imagine the convolution window being slid to the right by one pixel: each time, there is a large overlap in texture fetches. Ideally, we should be able to fetch the information from the texture once, and store it into a cache. This is where compute shaders come in.

Compute shaders allow access to "groupshared" memory: in other words, memory that is shared amongst all of the threads in a group. Essentially what we can do is fill up a group's shared memory with a chunk of the texture, synchronize the threads, and then continue with the convolution. Only this time, we reference the shared memory instead of the texture.

In a future post, I will provide a more complete example. But for now, I will outline the two methods:

Method A: Pixel shader


Texture2D<float> img;

float result = 0.0f;

int w2 = (w - 1) / 2;
int h2 = (h - 1) / 2;

for (int j = -h2; j <= h2; j++)
{
    for (int i = -w2; i <= w2; i++)
    {
        result += img[int2(x + i, y + j)] * kernel[w * (j + h2) + (i + w2)];
    }
}

return result;

Above, x and y represent the position of the pixel being processed, while w and h are the width and height of the convolution kernel.

Method B: Compute shader


Texture2D<float> img;
RWTexture2D<float> outimg;

groupshared float smem[(BLOCKDIM + 2) * (BLOCKDIM + 2)];

// Read texture data into smem for this group

// Synchronize the threads
GroupMemoryBarrierWithGroupSync();

float result = 0.0f;

for (int j = 0; j < h; j++)
{
    for (int i = 0; i < w; i++)
    {
        result += smem[offset + (BLOCKDIM + 2) * j + i] * kernel[w * j + i];
    }
}

outimg[int2(x, y)] = result;

Here, BLOCKDIM is the width (and height) of threads in a group, and offset is an offset into shared memory, which is a function of the thread ID within a group.

The compute shader method substantially reduces the number of redundant fetches necessary compared to the pixel shader method, especially when using an inseparable kernel.

Friday, February 6, 2009

DirectWrite Text Layouts, Part 2

In my previous post, I briefly covered DirectWrite text layouts. In this post, I would like to go into greater depth.

A text layout essentially enables you to describe many aspects of the contents of a string - text size, text style, text weight, custom drawing effects, inline objects, etc. The methods provided by a text layout enables you to apply specific formatting to specific ranges of text.

To backtrack a little bit, there are multiple ways of rendering text with Direct2D and DirectWrite. The first way, a way which I consider to be at the highest level of abstraction, is the DrawText method provided by a Direct2D render target. This method can be used to draw simple text that requires no extensive formatting. This method does not take a text layout object at all, but instead a simpler text format object.

The second way, which I consider to be mid-level, is the DrawTextLayout method (again provided by a Direct2D render target). This method takes a text layout object and renders it.

The third way, which I consider to be the lowest level, is the Draw method provided by a DirectWrite text layout object. This Draw method takes a custom class (a class which implements the IDWriteTextRenderer class) and uses its callbacks to render. This may seem complex, but it is actually trivial to write a class which acts just like DrawTextLayout does in Direct2D.

I would first like to focus on the lowest level method, since it excites me the most. Using this method, the text rendering possibilities are truly endless. The IDWriteTextRenderer interface defines six functions. I am going to focus on the DrawGlyphRun method in this post.

When the Draw method is called on the text layout object with a custom class, it will call the class's DrawGlyphRun method for contiguous sets of glyphs that have similar formatting. You may be wondering how you are supposed to write a pass-through function that simply renders the glyph run it receives - simple! Direct2D provides a render target method also called "DrawGlyphRun" which is the absolute lowest level glyph rendering function that handles ClearType.

Obviously, this is not a very interesting thing to do; this is basically what Direct2D's DrawText and DrawTextLayout use. An example on MSDN illustrates a more interesting use of a custom renderer. What they have essentially done is retrieved the glyphs' geometric information, and used Direct2D's draw/fill geometry methods.

This brings me to another interesting use case: writing a custom rendering class to "suck out" the geometry from glyphs. This can be done to create vertex buffers for Direct3D for extruded text, as done in this example.

As I mentioned earlier, it is possible to apply custom drawing effects to ranges of text. The way this works is simple: an application-specific object and text range are provided to the SetDrawingEffect method of the text layout object. The object provided is passed to the application-defined DrawGlyphRun method. The data can then be used in any way imaginable. Think brushes, stroke styles, transformation matrix effects, etc.

You may be thinking: is it necessary to write a pass-through custom renderer just to use different brushes as drawing effects? The answer is no - the implementation provided by Direct2D's DrawTextLayout interprets drawing effects as brushes!

Sunday, February 1, 2009

DirectWrite Text Layouts

In experimenting with DirectWrite, I discovered how to apply specific formatting to substrings: DirectWrite text layout objects.

Consider the following code.


m_spBackBufferRT->BeginDraw();

wstring text = L"This is a test of the text rendering services provided by Direct2D and DirectWrite.  I am testing the quality and performance of these new APIs.  So far, they are proving to be quite nice.";
m_spBackBufferRT->DrawText(text.c_str(), text.length(), m_spTextFormat, D2D1::RectF(0.0f, 0.0f, static_cast<float>(width), static_cast<float>(height)), m_spTextBrush, D2D1_DRAW_TEXT_OPTIONS_NO_CLIP);

m_spBackBufferRT->EndDraw();

The results are as expected.

Now, say I want to render the substrings "Direct2D" and "DirectWrite" in bold. One way would be to use the font metric methods of DirectWrite and render the paragraph in multiple pieces, but this feels a bit too tedious for what I want to do. A better approach would be to use a text layout object.

The following code does the trick.


m_spBackBufferRT->BeginDraw();

wstring text = L"This is a test of the text rendering services provided by Direct2D and DirectWrite.  I am testing the quality and performance of these new APIs.  So far, they are proving to be quite nice.";

IDWriteTextLayout *m_spTextLayout;
m_spDWriteFactory->CreateTextLayout(text.c_str(), text.length(), m_spTextFormat, width, height, &m_spTextLayout);

{
    DWRITE_TEXT_RANGE dtr = {58, 8};
    m_spTextLayout->SetFontWeight(DWRITE_FONT_WEIGHT_BOLD, dtr);
}

{
    DWRITE_TEXT_RANGE dtr = {71, 11};
    m_spTextLayout->SetFontWeight(DWRITE_FONT_WEIGHT_BOLD, dtr);
}

m_spBackBufferRT->DrawTextLayout(D2D1::Point2F(0.0f, 0.0f), m_spTextLayout, m_spTextBrush);

m_spBackBufferRT->EndDraw();
m_spTextLayout->Release();

It cannot get much simpler than that. All that is needed is to supply a substring range and method-specific argument and we are set.

Saturday, January 31, 2009

Direct2D and DirectWrite: An example

I heavily use Direct3D in my Windows graphics programs. So why would I want to use an API meant for 2D graphics? The answer is simple: text rendering.

Text rendering has always been a massive pain in 3D APIs, but rightfully so. Why should a low-level GPU API care about text? One solution to this is to write one's own text rendering class. I would much rather use a standard library, though. That's where Direct2D and DirectWrite come into play.

As I mentioned in a previous post, Direct2D is actually independent from Direct3D. You can write an application that only uses Direct2D and never actually touch Direct3D in your code. (Direct2D, of course, uses Direct3D internally). This may sound inflexible when wanting to mix it with Direct3D, but the situation is quite the opposite. Thanks to DXGI, it is possible to obtain the DXGI surface representation of a Direct3D texture and hand it off to Direct2D.

I demonstrate a simple example. In this program (a 3D vector field plotter, as a matter of fact), I am interested in displaying the time it takes to render a frame, as well as a simple performance log graph. I was able to eliminate a good chunk of D3D code and replace it with a small section of D2D code.

Big deal, right? Check this out.

Would you want to try rendering Gabriola by hand in a 3D graphics API? :)

How about a thicker line in the performance graph, and with a dashed stroke style?

I plan on posting more complete code snippets later, but for now I will get right down to the fundamental code.


if (FAILED(DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED, __uuidof(IDWriteFactory), reinterpret_cast<IUnknown **>(&m_spDWriteFactory)))) exit(EXIT_FAILURE);
m_spDWriteFactory->CreateTextFormat(L"Gabriola", NULL, DWRITE_FONT_WEIGHT_NORMAL, DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL, 30, L"", &m_spTextFormat);

The first eye-pleasing line creates a DirectWrite factory object. We then use the factory to create a new text format. A text format encapsulates basic information such as the font, font weight, font style, font size, etc.

We then use Direct2D to draw the string.


wstring mystring = L"Hello, world!";
m_spBackBufferRT->DrawText(mystring, mystring.length(), m_spTextFormat, D2D1::RectF(0.0f, 0.0f, 150.0f, 50.0f), m_spTextBrush, D2D1_DRAW_TEXT_OPTIONS_NO_CLIP);

As can be seen above, one of the arguments to the DrawText function is the DirectWrite text format we created earlier. In a future post I will cover in greater detail how I obtained m_spBackBufferRT.

I have not forgotten about OpenGL: in such situations I highly recommend the QuesoGLC text renderer. I expect to write up QuesoGLC examples as well in the future.

Monday, January 26, 2009

Direct2D and DirectWrite

I recently installed the Windows 7 SDK so I could experiment with Direct2D and DirectWrite. I am very pleased with the APIs; they definitely simplify the text handling in my Direct3D applications.

One of the great design points of Direct2D is that it can inter-operate with Direct3D textures through DXGI.

Expect code snippets and screenshots soon!

Monday, January 5, 2009

D3D11 Types

Direct3D 11 introduces a number of new datatypes to HLSL.

New read-only types:

ByteAddressBuffer
StructuredBuffer

New read-write types:

RWByteAddressBuffer
RWStructuredBuffer
RWBuffer
RWTexture1D/RWTexture1DArray
RWTexture2D/RWTexture2DArray
RWTexture3D

New stream types:

AppendByteAddressBuffer/ConsumeByteAddressBuffer
AppendStructuredBuffer/ConsumeStructuredBuffer

The (RW)ByteAddressBuffer type is a buffer that is DWORD-aligned byte-addressable. What this means is that I can pack an arbitrary mix of scalar and struct types into a buffer, and then pull the data out with a cast.

The (RW)StructuredBuffer type is an extension to the Buffer type in that it allows arbitrary structures to be stored. For example, we might wish to store per-instance data in a structure for cleaner code:


struct Vert
{
   float3 color1, color2;
   float mixamount;
   float3 deform;
};

StructuredBuffer<Vert> data;

PS_INPUT VS(VS_INPUT input)
{
   Vert v = data[input.instanceid];
   // Use v to compute vertex properties
}

The RWBuffer type is simply an extension to the Buffer type in that it allows reading and writing in pixel and compute shaders.

Next, we have the read-write texture types. These new types will have exciting new possibilities and will eliminate the need to ping-pong two textures in some cases. These types are pixel-addressable.

Finally, we have the stream data types. The stream types are meant for applications that deal with variable amounts of data that need not preserve ordering of records. For example, say we want to emit per-fragment data from the pixel shader, but not into a texture. We can define a structure that describes a fragment, and then we can emit the structures.


struct Fragment
{
    float3 color;
    float depth;
    uint2 location;
};

AppendStructuredBuffer<Fragment> data;

void PS(...)
{
    Fragment f;
    f.color = ...;
    f.depth = ...;
    f.location = ...;
    data.Append(f);
}

Now say we would like to process each fragment in a compute shader.


struct Fragment
{
    float3 color;
    float depth;
    uint2 location;
};

ConsumeStructuredBuffer<Fragment> data;
RWTexture2D frame;

void CS(...)
{
    Fragment f;
    data.Consume(f);
    // Compute result and write to texture
    frame[f.location] = ...;
}

Friday, January 2, 2009

2D Imposters

A 2D imposter is a simple representation of a geometric shape. Why would we care about these? Imagine rendering millions of circles. Not only would the vertex shader be a bottleneck, but the quality would not be very good due to multisampling. Instead, we can use imposters.

In this example, I am rendering each circle as a simple quadrilateral. The real magic happens in the pixel shader.

Let's take a look at the shader code.


cbuffer shapes
{
    float2 square[] =
    {
        float2(-1.0f, -1.0f),
        float2(-1.0f, 1.0f),
        float2(1.0f, -1.0f),
        float2(1.0f, 1.0f)
    };
};

PS_INPUT VS(VS_INPUT input)
{
    PS_INPUT output;

    float3 vpos = float3(0.4f * square[input.vertid], 0.0f) + 2.6f * input.position;

    output.position = mul(float4(vpos, 1.0f), mul(World, Projection));
    output.color = input.color;
    output.qpos = 1.1f * square[input.vertid];

    return output;
}

float4 PS(PS_INPUT input) : SV_Target
{
    float dist = length(input.qpos);
    float fw = 0.8f * fwidth(dist);
    float circ = smoothstep(fw, -fw, dist - 1.0f);
    return float4(input.color, circ);
}

The utility of imposter shapes shows through when looking at the vertex shader. The vertex shader need only run 4 times per circle. As you can see, I am not only transforming the square's vertices, I am also passing the raw vertices over to the pixel shader. The pixel shader uses this to measure the distance between the center of the square and each fragment in order to determine which fragments should be visible and which should be blended away.

Why use imposters? Consider a visualization application which needs to render thousands of circles very quickly. This method allows for the rendering of efficient, high-quality circles. In a future post, I will show how this method can be adapted to rendering spheres as quadrilaterals.