I've been looking for references on texture swizzling, but found very little about how it is done.
Does anyone know of any good sources (books, online, whatever) on this?
I'm writing a simple software renderer and wanted to try swizzled textures to potentially increase the performance of the bilinear filtering through texel locality.
I don't know much of the specifics, but it's my understanding that it uses a space-filling curve as mapping from addresses to (x,y) texel locations. This article shows some examples of such curves, though I don't know if any of those is the specific one used for swizzled textures.
What's the CPU architecture you're working on? Modern CPU's have really advanced caches with automatic prefetching optimized for accessing large 'matrices' in complex ways. In my experience swizzling sometimes helps a bit and sometimes hurts a bit. Add to this that it greatly complicates texture management, and you really want to think twice before using it. Michael Abrash came to the same conclusion. Good mipmapping is a lot more crucial.
Anyway, the implementation of texture swizzling is pretty simple. Just swap some of the U and V addressing bits.
For simplicity, assume we have a 16x16 texture with 8-bit color. For regular addressing, we compute the address as U + 16 * V. This gives us the following layout of the addressing bits (MSB first so 'read' from right to left):
Note that incrementing V by one takes us 16 bytes further. To prevent this, we can rearrange the bits like this (for example):
Now a small change in V isn't so drastic. Note that computing the address like this requires some masking and shifting operations though (which is the reason why swizzling is sometimes slower). An even better swizzling layout is this:
Every bit is interleaved, and small changes in either U or V correspond to the smallest possible changes in the address. This swizzling is even harder to compute thougt.
If you can't really visualize the impact of swapping U and V bits, draw a 16x16 matrix on paper and give every element an address according to one of the above layouts. Then connect the numbers. :happy: The first layout groups the texels in blocks of 4x4 x 4x4. The second creates groups of 2x2 x 2x2 x 2x2 x 2x2. :ninja:
What nick talks about is also known as Morton-Order, and it creates the z-order space filling curve. A neat advantage of this order is, that you can implement affine texture mapping with nearly zero overhead.
The usual per pixel task for texture mapping is to extract the texture address and then increment or decrement the uv coordinates. Doing the same in morton order can be done somewhat like this:
(I use 4.4 formated fixed point which is way to imprecise to be practical, but it's ok to show the idea behind it)
That's how the UV coordinates look before converting them:
( w = whole bits, f = fraction bits)
u := wwwwffff
v := wwwwffff
Now interleave zero bits into the uv coordinates:
u := w0w0w0wffff
v := w0w0w0wffff
Convert your deltas for U and V for as well, but interleave with ones:
dudx = w1w1w1wffff
dvdx = w1w1w1wffff
And here comes the trick: The deltas can just be added onto the coordinates if you mask out
the zero bit gaps after addition. This makes the addition work directly in morton order.
for (int x = left; x < right; x++)
// get texture address in morton order:
int addr = (u>>4)|((v>>4)<<1)
putpixel (x,y,texture[addr], yadda yadda)
// increment uv coordinates per pixel.
u = u + dudx;
v = v + dvdx;
// mask out the gap-bits
u &= 010101011111b;
v &= 010101011111b;
The netto cost per pixel are two AND operations. That's near to nothing if you consider that your cache coherency will skyrock. Oh, and you'll loose some bits due to the interleaving with zeros, but that shouldn't be a problem if you put everything in mmx registers or limit the texture size and subtexel precision a bit.
The only expensive part is to interleave the whole bits with ones and zeros. Unfortunately there is no clever way to do this fast unless your cpu offers a special instruction for such stuff, but you only have to do this once per polygon or once per perspective correction, so it's not that bad really.
Hope that sheds some light on how easy it is to implement texture swizzeling effectively.
A neat advantage of this order is, that you can implement affine texture mapping with nearly zero overhead.
That trick also works with other address bit interleaving orders.
The netto cost per pixel are two AND operations. That's near to nothing if you consider that your cache coherency will skyrock. ... The only expensive part is to interleave the whole bits with ones and zeros.
There's no point in 'skyrocketing' cache coherency beyond the cache line size. So it's possible to have an interleaving like UVvu that is reasonably fast to compute and should have optimal behavior.
...once per perspective correction, so it's not that bad really.
Unless you need per-pixel perspective correction. :unsure:
Still a nice bit-hacking trick though...
The funny things is that while perfect interleaving with Morton numbers seems ideally the most efficient, I am not sure that pratically it is the best in term of performance.(I ignore the cost of encoding the numbers here for software implementation)
As an example, let's say that your texture is drawn most of the time with a rotation close to -10, +10 deg, you prefer to have a swizzling scheme that make a favor for "mostly horizontal textures" thus having less misses when you walk closely to the horizontal in the texture.
Another example is that if you draw your image perfectly horizontally, the natural encoding of a bitmap is the most efficient. This prooves that a swizzling scheme really depends on the "average" orientation of textures inside a scene.(in nearest to simplify the talk)
Moreover, the size of a cache line is something to consider also.
(ex. 64 bytes/cache line on ARM cpu)
What could be interesting is to write a small piece of code that simulate the following :
- Time to fill a line of cache.
- Number of cache miss.
- Various cache size.
- Different swizzling scheme.
- Rendering at different angles.
And find what would be the most efficient for a given "architecture".
I would bet that people at NVidia or ATI have done that already,
as swizzling itself in hardware is most likely to cost zero (swap bit = route the wire differently)
By doing that, even in software you can probably find a swizzling scheme that
is quite efficient (= minimize the encoding cost also, maximize cache usage)
It could be funny to have your computer run the simulation for days and have 42 as answer