JXL Art

JXL Art is the practice of using JPEG XL’s prediction tree to generate art. If you have questions, you can join the #jxl-art channel on the JPEG XL Discord.

The oversimplified summary is that JPEG XL has a modular mode that divides the image it encodes into squares called groups, up to 1024x1024 each. JPEG XL uses a prediction tree to make a prediction what the value of each pixel is in such a square, based on neighboring pixels and their gradients. As a result only the difference (or error) between the actual image and the prediction needs to be encoded. The better the predictions, the smaller the error, the more compressible the data, the smaller the file size. Profit.

In the context of JXL art, however, the error is always assumed to be zero, which makes the image consist of only the prediction tree. That means the predictions effectively generate the image. The process of creating JXL art is writing that prediction tree. A prediction tree is a tree of if-else statements, branching on different properties and selecting a predictor for each leaf of the tree. The flexibility of these prediction trees is intentionally limited because they need to be small and execution needs to be fast, as it is part of the image decoding process. The prediction tree is run for every channel (red, green and blue) and for every pixel. Some predictors incorporate the values of the current pixel’s neighbors, specifically to the left, top-left, top and top-right, which dictates in which order the pixels have to be predicted: Row-by-row, top-to-bottom, left-to-right — just like Westerners read text.

A program starts with an optional header that specifies image properties and transformations to apply. The default header looks like this (everything is optional):

Width 1024: Width of the image (frame)
Height 1024: Height of the image (frame)
RCT 0: Reversible Color Transform.
- 0 is no transform, i.e. RGB.
- 6 is YCoCg.
- Higher numbers can be used (up to 42 excluded).
Orientation 0: Image rotation/flip as specified by EXIF.
1. = 0 degrees
2. = 0 degrees, mirrored
3. = 180 degrees
4. = 180 degrees, mirrored
5. = 90 degrees
6. = 90 degrees, mirrored
7. = 270 degrees
8. = 270 degrees, mirrored
XYB: Use XYB color space instead of RGB (not enabled by default).
CbYCr: Use YCbCr color space. Channel 0 becomes Cb, channel 1 is Y, channel 2 is Cr (not enabled by default).
GroupShift 3: Set the group size to 128 << GroupShift. Values 0-3 are valid.
Bitdepth 8: Self-explanatory. Other bit depths can be used, from 1 to 31.
FloatExpBits 3: Numbers are interpreted as IEEE floats with this many bits for the exponent (not enabled by default).
Alpha: Add a fourth channel (c == 3) for Alpha (not enabled by default).
Squeeze: Apply the Squeeze transform (not enabled by default). Weird things will happen and the image gets many channels; keep predictor values low to avoid blocky images.
FramePos 0 0: The frame position is set to this (x0, y0) position. The image canvas size also gets adjusted so the bottom right corner of the frame remains in the bottom right corner of the image. This is mostly useful with negative values, e.g. FramePos -100 -200, which has the effect of hiding the first 100 columns and the first 200 rows.
NotLast: This is not the last frame/layer (not enabled by default). This flag can be used to do multi-layer images. After encoding this layer, another layer will be encoded, which gets alpha-blended over the first layer and which can itself also have the NotLast option (there is no limit on the number of layers you can create this way). Every layer gets its own tree, so when this flag is used, you should specify not just one tree, but (at least) two. You can change the RCT between layers (they are local transforms), but not the Bitdepth, XYB, Orientation or presence of Alpha, since those are file/global properties/transforms. (without Alpha it is not very useful at the moment to do layering, since only alpha-blending can currently be done, though this will change when we add support for other blend modes and/or animation)
Spline [4 * 32 coefficients] x0 y0 x1 y1 ... xn yn EndSpline: Draws a spline that goes through the points (x0,y0), (x1,y1), ..., (xn,yn) and that has a color and thickness given by 4 times 32 numbers. These 32 numbers are 1D-DCT32 coefficients and floating point numbers, where the first number is the DC (i.e. the average value) and the next numbers correspond to increasing frequencies. The first series of 32 numbers corresponds to the color of the first channel (e.g. Red), where 1.0 is the maximum value. The second series corresponds to the second channel (e.g. Green), the third to the third channel (e.g. Blue). The final series of 32 numbers defines the 'thickness' in pixels (note that it's more like a blur radius than a thickness of a solid line). The spline gets 'added' to the frame, so negative numbers (both in colors and in thickness) correspond to darkening while positive numbers correspond to brightening.

It is then followed by a tree description, which starts with a decision node — an if-else-like statement. Technically you can also just give a single predictor, but that’s rarely interesting. A decision node looks like this:

if [property] > [value:int]
   (THEN BRANCH)
   (ELSE BRANCH)

Both the THEN branch and the ELSE branch can either be another decision node or a leaf node.

The following properties can be used in a decision node:

c: the channel number, where 0=R, 1=G, 2=B, 3=A (if Alpha was enabled in the header)
g: the group number (useful in case the image is larger than one group). Modular group numbers usually start with 21.
x, y: coordinates
N: value of pixel above (north)
W: value of pixel to the left (west)
|N|: absolute value of pixel above (north)
|W|: absolute value of pixel to the left (west)
W-WW-NW+NWW: basically the error of the gradient predictor for the pixel on the left
W+N-NW: value of gradient predictor (before clamping)
W-NW: left minus topleft, i.e. error of the N predictor for the pixel on the left
NW-N: topleft minus top, i.e. error of W predictor for the pixel above
N-NE: top minus topright, i.e. error of W predictor for pixel on top right
N-NN: top minus toptop, i.e. error of N predictor for pixel above
W-WW: left minus leftleft, i.e. error of W predictor for the pixel on the left
WGH: signed max-absval-error of the weighted predictor
Prev: the pixel value in this position in the previous channel
PPrev: the pixel value in this position in the channel before the previous channel
PrevErr: the difference between pixel value and the Gradient-predicted value in this position in the previous channel
PPrevErr: like PrevErr, but for the channel before that
PrevAbs, PPrevAbs, PrevAbsErr, PPrevAbsErr: same as the above, but the absolute value (not the signed value)

Leaf nodes are of the following form:

  - [predictor] +-[offset:int]

The following predictors are supported in leaf nodes:

Set: always predicts zero, so effectively sets the pixel value to [offset]
W: value of pixel on the left
N: value of pixel above
NW: value of topleft pixel
NE: value of topright pixel
WW: value of pixel to the left of the pixel on the left
Select: predictor from lossless WebP
Gradient: W+N-NW, clamped to min(W,N)..max(W,N)
Weighted: weighted sum of 4 self-correcting subpredictors based on their past performance (warning: not clamped so can get out of range)
AvgW+N,AvgW+NW,AvgN+NW,AvgN+NE: average of two pixel values
AvgAll: weighted sum of various pixels: (6 * top - 2 * toptop + 7 * left + 1 * leftleft + 1 * toprightright + 3 * topright + 8) / 16

Edge cases:

If x=y=0, W is set to 0. Otherwise, if x=0, W is set to N.
If y=0, N is set to W.
If x=0 or y=0, NW is set to W.
Similarly, NE and NN fall back to N and WW falls back to W.