(Originally posted on reedbeta.com)
The current state of the art in GPU-supported texture compression is a set of seven formats called BC1 through BC7. These formats are used by almost all realistic 3D games to reduce the memory footprint of their texture maps. In this article, I’m going to get under the hood of each of the seven BCn formats, explain how they work, and show how their features make them effective for compressing different kinds of images.
Why Texture Compression?
Given that computer memories have gotten bigger, faster, and cheaper over the years, you might wonder whether there is still a need for textures to be compressed in memory at all. (Compression on disk is another matter, but I’m not talking about that in this article.) But at least for large-scale, realistic 3D games, aesthetic expectations have of course grown in step with the hardware improvements. There’s a continuing push for more textures with higher resolutions, to increase visual detail and reduce repetition; and as game shading models become more sophisticated, each material requires a greater number of textures: diffuse color, normal map, specular color, gloss, emissive glow, and others.
Moreover, despite the vast advancements in hardware capabilities over the last couple of decades, texture sampling in a shader is becoming relatively more expensive, not less! Processors have gotten faster and faster over the years, and so has memory—but memory is falling behind, and the gap between memory and computation has grown wider. Hardware designers are fighting an uphill battle to provide enough memory bandwidth to keep processors supplied with data to chew on. Given this state of affairs, texture compression is not just about cramming more pixels into the game; it’s a crucial performance optimization, since it reduces by a large factor the memory bandwidth required to get those pixels into the GPU’s shader cores.
It’s clear that GPU-supported texture compression is not going to go away. In fact, it is more vital than ever to building a game with high-quality graphics—and new, more sophisticated compression formats are a continuing area of research.
BC stands for “block compression”, and the BCn formats all operate in terms of 4×4 blocks of pixels. All images are sliced up into these small blocks, and each block is self-contained—the data to decode it is all in one contiguous chunk in memory. Moreover, the size of each compressed block is fixed—either 8 or 16 bytes, depending on which BCn format is being used. This represents a 4:1 or 8:1 compression ratio, if the source image is in 8-bit RGBA format.
This standard layout is designed to make it easy for the GPU to use these formats for rendering. GPUs need to be able to quickly access any part of a texture; they don’t read the entire image sequentially, so streaming compression algorithms and those that have variable compression ratios are poorly suited for this application. The fixed block size makes it easy to locate the BCn block containing any given pixel, and the self-contained block data means that the GPU can decompress just the part of the image it needs to access.
Moreover, texture samples often exhibit locality in both dimensions—if one pixel from a texture is accessed, it’s likely that other pixels nearby in 2D will be accessed too. The 4×4 block structure supports this well, since it’s convenient to decompress all 16 pixels at once and then store them in the texture cache, where they can be efficiently reused for additional sampling operations.
The first three BCn formats are better known as DXT1, DXT3, and DXT5, respectively, but BCn (for n between 1 and 7) are the more up-to-date names for these formats.
Endpoints And Indices
All of the BCn formats are based around a simple idea: in many images of interest, within any small area (such as a 4×4 block) there tends to be limited variation in the set of colors present. For example, take a look at this brick texture (from CgTextures). An extract from the texture is shown here, together with several magnified 4×4 blocks from it:
If you look at these blocks, you can see what’s meant by “limited color variation”. Within a block, we often have just darker and lighter shades of a single color, or (at corners and edges) a gradient or blend between two contrasting colors. The BCn formats are designed to exploit this redundancy by separating the definition of the colors in a block from their spatial distribution.
Each block in a BCn image has a very small color palette, with just a few colors to choose from. The pixels are then coded as indices into this palette, which requires only a handful of bits per pixel, since the palette is so small. The palette is further compressed by assuming that all its colors are laid out evenly spaced along a line segment in RGB space. Then the file only has to store the endpoints of that line, and all the other palette entries are reconstructed by blending the two endpoints in different proportions.
All BCn blocks contain two main pieces of data: the endpoints of these color-space line segments, and the per-pixel palette indices that say how far along the line each pixel is.
Even without delving into the details of specific BCn formats, we can already get an idea of the kinds of situations in which they will fail. Most of the BCn formats will give poor-quality results anywhere that three very different colors are present in a single block. For example, a block containing a mix of red, green and blue pixels cannot be represented using the simpler BCn formats, because red, green, and blue do not lie along a straight line in RGB space. This presents a problem for normal maps, which often have exactly this situation. On the other hand, color textures like the brick shown above can often survive BCn compression with very little visible degradation.
As an example, here are the four blocks from above, compressed using BC1 format:
Here you can clearly see how each block has collapsed down to no more than four distinct colors (BC1 uses a four-color palette), and subtle hue variations have vanished (because the four colors must lie on a line in RGB space). The effects are obvious at the level of individual blocks, but at the level of the entire image they are much less apparent:
If you look carefully, you can see definite differences between these two images—particularly along the edges between shadowed and bright areas—but on the whole, the compressed image looks very faithful to the original.
Now that we’ve gone over the basics, let’s jump into the details of specific BCn formats.
BC1 stores RGB data. It technically supports an alpha channel, but the alpha is only 1-bit (that is, it must be either 0 or 255). It uses 8 bytes to store each 4×4 block, giving it an average data rate of 0.5 bytes per pixel. Each block consists of two color endpoints, which are stored in 2 bytes each, using RGB 5:6:5 format. The palette contains four entries generated from those endpoints, so the indices require 2 bits per pixel, making up the other 4 bytes of the block.
BC1 is a good choice for most standard-issue color maps, unless there’s a specific reason to use one of the other formats. One such reason could be that the image requires smooth gradients. Due to the use of 5:6:5 colors, BC1 cannot represent smooth gradients well, as illustrated here:
The top gradient should look relatively smooth (although on many monitors you will still see some banding—with today’s contrast ratios eight bits aren’t enough for perfectly smooth gradients). However, the bottom gradient is much more bandy than the top one. Some bands are even visibly green, since the 5:6:5 encoding gives us colors like (57, 60, 57).
This isn’t an issue with fitting pixels into a four-color palette. The gradients above are 512 pixels wide, which means that each 4×4 block contains exactly two colors—no issue for a four-color palette to handle! The problem is that the endpoints aren’t stored with enough precision. However, this is only an issue for certain kinds of textures that involve very smooth gradients or very subtle color variations, such as skies and human skin. For these kinds of images, BC7 (described later) may be a better choice. The majority of game textures undergo little visible degradation in BC1, like the brick texture in the previous section.
Degeneracy, and Breaking It
I mentioned earlier that BC1 supports a 1-bit alpha channel. How does this fit in? After the 4 bytes of endpoints and 4 bytes of indices described above, there’s apparently no more space in the block for additional data. But there is a loophole that can be exploited to pack even more information into the same space: degeneracy! The system described above is degenerate, meaning that there are multiple ways to encode the same image. In this case, the degeneracy originates in the symmetry of the two color endpoints. Nothing about the BC1 format thus far singles out in what order the endpoints should be stored: if you swap the two, and invert the indices to compensate, you end up with exactly the same 4×4 block of pixels as before. So there is a twofold degeneracy: there are two equally good ways to store the same image.
BC1 cleverly exploits this by breaking the degeneracy: it defines an alternative mode that is triggered for a given block by the order of the endpoints. Although the endpoints are colors in 5:6:5 format, they can also be interpreted as 16-bit unsigned integers. If the first endpoint is numerically greater than the second, the above description of the format holds: the palette contains four colors spaced evenly from one endpoint to the other. But if the first endpoint is less-equal to the second, the palette is modified: its first three entries are three colors spaced evenly from one endpoint to the other (that is, the two endpoints and their average), and the fourth entry is transparent: black with zero alpha.
In this way, BC1 can support a 1-bit alpha by switching into this second mode for specific blocks that contain transparent pixels. However, some precision must be sacrificed for the non-transparent values in these blocks because only three distinct colors remain, rather than four. BC1 also cannot store any color information in the transparent areas. This makes it suitable for storing texture maps for “cutout” materials, such as grates, fences, and vegetation, where alpha testing is used to discard the transparent parts of the image. Care must be taken, however, when using bilinear filtering with BC1 cutout textures. Since the color component of the transparent pixels is always black, a dark fringe will form around the transparent areas where bilinear filtering blends the colored pixels with the transparent ones. This can be avoided by setting the alpha threshold high, so that the dark areas are culled away, or by dividing the interpolated color by the interpolated alpha in the shader, which will cancel out the darkening.
Here is an example of an image with 1-bit alpha, upscaled with bilinear filtering, with an alpha test threshold of 128. On the left, only plain bilinear filtering has been used, producing a dark fringe where the color of the texture is interpolated 50% toward the black of the transparent area. On the right, the interpolated alpha has been divided out.
Rather than going through these formats in numerical order (which corresponds to the chronological order in which they were introduced, in successive generations of GPU hardware), I’m going to go in order from the simplest to most complicated. After BC1, BC4 is the next logical step.
BC4 stores a grayscale image—no RGB, just a single color channel—and uses 8 bytes per block. Its endpoints are one byte each, and it uses an eight-element palette, so it has 3 bits of indices per pixel.
BC4 is the same size as BC1, but it gives much better quality than BC1 when storing a grayscale image, due to both the expanded palette (eight elements instead of four) and the extended endpoint precision (8 bits instead of 5–6). This makes BC4 an excellent choice for height maps, gloss maps, and any other kind of grayscale texture. Compare the quality of the gradient here with that of BC1, above:
There is little or no visible difference between the BC4-compressed gradient and the uncompressed original.
Like BC1, BC4 makes use of degeneracy breaking by defining an alternative mode triggered based on the order of the endpoints. If the first endpoint is numerically greater than the second, the palette consists of eight values evenly spaced from one endpoint to the other. Otherwise, the first six entries in the palette are evenly spaced from one endpoint to the other, and the last two are, respectively, 0 and 255—black and white. BC4 already has excellent quality due to its full 8-bit endpoints and large palette, but allowing this alternative mode can make it even better in certain cases, such as at sharp edges between black and white areas in the map.
BC2, BC3, and BC5
These formats are simply combinations of the previous two. BC3 stores RGBA data, using BC1 for the RGB part and BC4 for the alpha part, for a total block size of 16 bytes, or an average of 1 byte per pixel. It’s the most common format for textures that require a full alpha channel, and can also be used for packing a color texture together with any grayscale image, such as a height map or gloss map. Since the alpha is stored separately from the color, BC3 does not use the BC1 1-bit alpha mode in the color part.
BC5 is a two-channel format in which each block is just two BC4 blocks. This is very useful for tangent-space normal maps, if the the X and Y components are stored and the Z component is reconstructed in the pixel shader. Since each channel has its own endpoints and indices, normal maps—in which the X and Y components are often “doing different things”, so to speak—retain quite a bit more fidelity in BC5 than in BC1. The downside is that BC5 requires twice as much memory, at 16 bytes per block; this can also make it slower for shaders to access because more memory bandwidth is needed to get the texture to the shader cores. But this may be a price worth paying for the substantial increase in quality.
BC2 is a bit of an odd duck, and frankly is never used nowadays. It stores RGBA data, using BC1 for the RGB part, and a straight 4 bits per pixel for the alpha channel. The alpha part doesn’t use any endpoints-and-indices scheme, just stores explicit pixel values. But since each alpha value is just 4 bits, there are only 16 distinct levels of alpha, which causes extreme banding and makes it impossible to represent a smooth gradient or edge even approximately. Like BC3, it totals 16 bytes per block. As far as I can think of, there’s no reason ever to use this format, since BC3 can do a better job in the same amount of memory. I include it here just for historical reasons.
BC6 and BC7
The final two formats were introduced very recently, just within the last couple of years, and are only supported by D3D11-level graphics hardware. They’re also vastly more complex than any of the other formats we’ve discussed. As a result of their newness, complexity, and hardware requirements, they’re not yet well supported in texture compression tools and libraries, and aren’t yet well-known or widely used.
Both of them consume 16 bytes per block, the same as BC3 and BC5. BC7 targets 8-bit RGB or RGBA data, and BC6 targets RGB half-precision floating-point data. BC6 is therefore the only BCn format that can natively store HDR images, and is an excellent replacement for RGBM and other HDR encodings that rendering programmers have heretofore used to shoehorn HDR data into compressed textures.
The reason BC6 and BC7 are so complicated is that they allow a variety of different modes that change the details of the format, such as the palette size and the way the endpoints are stored. Modes are specified by the first few bits of each block, so each block can effectively have a different format! This makes BC6 and BC7 very adaptable to the image contents, as they can choose the best mode for each individual 4×4 block. But the downside is that compressing to BC6 or BC7 is much more difficult and slow, since the compressor has many more options to try to achieve the best-quality representation of each block.
The different modes essentially trade off various features of the format. For example, they trade off endpoint precision versus index precision: some modes have larger palettes, but store the endpoints with fewer bits per component; other modes have higher-precision endpoints, but smaller palettes.
Another enhancement that BC6 and BC7 feature is the ability to have more than one line segment in each block. Now, some of the formats described previously—namely BC3 and BC5—have two line segments per block, but they use them for different channels—in BC3, color and alpha, or in BC5, the two grayscale channels. BC6 and BC7 introduce the concept of partitioning, which allows different line segments to be used for different pixels in the block. There are a variety of spatial partitioning patterns available (the ones for BC6 can be seen here). These predefined patterns assign each pixel in the block to one line segment or another; the indices then control which color out of that line segment’s palette the pixel gets. Partitioning can improve quality in cases where the colors in a block don’t fall very neatly along a single line in RGB space; with multiple line segments, the original range of colors in the block can be reproduced more faithfully.
The number of line segments is controlled by the per-block mode setting. Modes with more line segments have more endpoints to store, so naturally, these modes also tend to have lower-precision endpoints and smaller palettes—everything still has to fit in 16 bytes per block. When applicable, the chosen partition pattern is also stored by a few more bits in the block.
As an example of the higher image quality enabled by the sophisticated BC7 features, here are the four blocks extracted from the brick texture earlier in this article, compressed with BC7:
If you compare this with the earlier BC1 version of this image, there’s a massive difference. The second block from the left still appears visibly degraded relative to the uncompressed version, but it’s much better than in BC1, and the other three blocks hardly have any visible differences with their uncompressed versions.
At the zoomed-out level, there is no perceptible difference between the uncompressed and BC7 versions of the image, even if you examine them quite closely:
Finally, BC7 also works extremely well on the gradient example, where BC1 failed miserably:
The slightly unfortunate thing about BC6 and BC7 is that because they have all these different per-block options, they have to use up some of the precious 16 bytes just to say which options were picked, leaving less space for the actual contents—the endpoints and indices. However, the smart people who invented these formats (presumably some engineers at NVIDIA and AMD…will we ever know who they are?) found a whole bag of tricks for squeezing more data into a block.
For example, recall that BC1 and BC3 both exploit degeneracy breaking to effectively gain one more bit of data, by using the order of the two endpoints in a block as a signal to switch between two modes. BC6 and BC7 also use degeneracy breaking, but not to switch modes; they take advantage of it to eliminate one bit of the indices. When you swap the two endpoints, you have to bitwise-invert all the indices to compensate: indices 00, 01, 10, and 11 become 11, 10, 01, and 00, respectively. But you can also turn this around: you can always invert the indices by swapping the endpoints. So you can declare, for instance, that the most-significant bit of the upper-left pixel’s index will always be 0: if it’s not, swap the endpoints and then it will be! Then you don’t have to actually store the 0 bit; that bit can be used for something else. In partitioned modes, one bit can be saved for each line segment by swapping that segment’s endpoints.
Another space-saving trick, which BC6 uses in most of its modes, is delta compression for endpoints. Rather than storing all the endpoints at a high level of precision, these modes store one endpoint at relatively high precision, then represent the other endpoints as lower-precision delta vectors relative to that base endpoint. (Unfortunately, the spec refers to this feature by the uninformative phrase “transformed endpoints”.) This is an interesting approach because it restricts the possible lengths and orientations of the line segments in RGB space, but still allows their absolute position to be set precisely. Quantized lengths and orientations for the line segments will naturally introduce error in color reproduction, but precise absolute positioning allows the error to be distributed more evenly over all the pixels, as opposed to concentrating in a few.
Incidentally, although BC6 deals in floating-point values, it nevertheless treats them as 16-bit integers throughout almost all stages of the decoding and interpolation process! If you’re wondering how that can even work at all, take a look at this article. The key point is that magnitude order—from lower to higher values—is the same for floats and their integer representations. So interpolating floats as if they were ints actually does something reasonable, although it does not always produce linear interpolation—it effectively interpolates along a piecewise-linear approximation of a logarithmic curve! BC6 does involve some special-casing to handle negative numbers, NaNs and infinities, but it mostly treats its values as ints, relying on this fact about the IEEE float representation to make everything work.
BC7 does not use delta compression for its endpoints, but has a similar mechanism, referred to as “P-bits”. A P-bit is a shared least-significant bit that gets tacked onto the end of all the RGB color values of the endpoints. This is a bit of a head-scratcher, but the end result of it is very like that of BC6′s delta compression: the possible lengths and orientations of the RGB line segments are more coarsely quantized, but the whole line segment can be positioned more precisely, allowing error to be distributed more evenly over the pixels in the block.
Finally, since BC7 can store both color and alpha channels, it’s equipped with modes that offer a few choices for combining the two. Some modes have one set of indices for all four channels, which (roughly speaking) requires color and alpha to have the same spatial distribution within the block. Other modes include two distinct sets of indices, one for color and one for alpha, allowing the two to be relatively independent. Moreover, the separate-index modes also include channel swapping flags: alpha can be swapped with red, green, or blue, or left in place. This effectively allows any of the four channels to use distinct indices from the rest.
To sum up all up, here’s a table listing the major differences between the seven BCn formats:
In this article, I described the data representation of the BCn formats, but not how to write compressors for them. Like many compression techniques, the BCn formats are designed to be simple and fast to decompress—but that often comes at the cost of making compression difficult! Writing high-quality BCn compressors is a big enough topic for its own article, and is a subject of ongoing research. In particular, the BC6 and BC7 formats have a much greater “search space” because they offer so many more options for encoding each block of an image, and high-quality BC6 and BC7 compression involves a lot of brute-force searching for the best (lowest-error) combination of options, making these compressors quite a bit slower than those for BC1–5.
For compressing and decompressing BC1–5, the open-source NVIDIA Texture Tools are probably your best bet. NVTT includes both command-line utilities and a set of libraries that can be linked into your own projects.
It is still surprisingly difficult to find any publicly available tools that support BC6–7 at all. The only publicly available tools I’ve found that can compress to these formats with high image quality are NVIDIA’s internal development compressors, which have been open-sourced and are hosted at the NVTT Google Code repository here. These tools are not the user-friendliest! They run very slowly—taking several minutes to encode a 256×256 image—and they don’t save to a standard file format like DDS, but simply dump the compressed blocks into a raw binary file. Their license status is also unknown, so it may not be safe from a legal perspective to incorporate their source into your own projects. However, they do give very high-quality results, and I used them for all the BC7 comparison images in this article.
Compression and decompression of all the BCn formats is also supported by the Direct3D 11 SDK in the form of the D3DX11CreateTextureFromFile function. However, my experience is that the compressed image quality is not very good with this API, so I would not advise using it for compression (decompression should be fine).
For the full, bit-for-bit specifications of each of the BCn formats, see these references: