OpenAI's 4o Image Generation capability launched two days ago on March 25th, 2025. The quality of the images it creates is the best among all contemporary models by a large margin. However, what is most interesting is the way it seems to generate the image: first it quickly creates a blurry version of the final image, then sharpens it from top to bottom, as if loading an image on a dial-up connection. This shouldn't be possible using a standard latent diffusion model, which would generate the whole image at once. What could they be doing different? As why is it better?
Before we get into it, we have to acknowledge that this post is speculative and that we are assuming that the generation animation has something to do with the generation process itself. It's possible that OpenAI is simply generating the full image at once, using something like beefed-up latent diffusion, and displaying the animation for fun. In that case, this post has nothing to do with 4o, but has still led me down an interesting direction in modern generative model research.
What happens when images are generated?
Here's a rough timeline of what happens:
- 0 seconds: We see a blank image with a loading animation
- 9 seconds: The model has decided on the dimensions and displays a very blurry image whose color pallette is similar to that of the final image
- 22 seconds: The sharp version of the image starts being generated top to bottom in thin strips, or perhaps pixel-by-pixel as in a raster scan. The blurry portion of the image near the raster line changes as well and becomes slightly more detailed to match the portion which is already generated.
- 77 seconds: At the point where the image generation is 3/4 done, the top-to-bottom animation is cut short and the full sharp image is displayed.
Is it simple Autoregression?
When we see this animation, the immediate thought is that it is an autoregressive model, similar to GPT-4o itself, which predicts the next pixel or row in the raster scan based on the ones which have already been generated. If this is the case, then it is either trained on continuous or discrete tokens.
Continuous Autoregression
If the model is trained with a MSE loss, then in the simplest case, it will output the mean of the posterior distribution over the next token conditional on the image generated so far. This will lead to the final image being very blurry, and in many cases, the predictions will diverge completely since the partially generated image will be outside of the training distribution.
The MAR paper gets around this by predicting a vector which is used to condition a lightweight diffusion model which predicts the next token. This has the unfortunate effect of requiring ~100 diffusion steps in sequence for each token in the image. Fortunately, incorporating techniques like MAGE can predict multiple future tokens at a time, improving generation speed.
Discrete Autoregression
Generative autoregressive models on discrete tokens don't have the same risk of divergence. If the model is outputting a probability distribution over every possible color (16777216 of them), then there is no fundamental limitation in generating images this way. One can factorize the output distribution, but then we are only approximating the conditional posterior of the next token.
Many image models which try to operate in discrete space instead transform patches into a discrete vocabulary of latent tokens. The Fluid paper (which examines the scaling of the MAR model) notes that while continuous and discrete autoregressive models exhibit good scaling laws in validation loss (the absolute validation losses cannot be compared), it is the continuous model which performs best at large scales on measures of qualitative image fidelity such as FID and GenEval score. However, if we are true believers in the Scaling Hypothesis, We should only be concerned with minimizing Negative Log-Likelihood and trust that all other metrics will fall out naturally.
What is so good about autoregressive image models?
Inference
In diffusion models, inference speed is quadratic in the number of patches and linear in the number of diffusion steps. However in the MAR/Fluid model, while inference is still quadratic in the number of patches (though smaller by a factor of 2), it might be the case that the diffusion model is lightweight enough to be negligible, since each patch only has to attend to itself rather than the whole image.
In addition, if the inference is done in raster order, as it appears to be according to the animation, it is possible to implement such a model using a GPT-Like decoder-only transformer, which takes advantage of key-value caching, something diffusion models cannot do. This greatly improves inference speed.
Performance
But inference speed isn't why we are here. We are impressed by the quality of its generated images.
Raster vs Random order inference
In addition to discrete/continuous tokens, The Fluid paper compares two generation orders: Raster (sequential, with GPT-Like decoder-only transformers) and Random (with BERT-like encoder-only transformers) and finds that the random raster order with continuous tokens performs best at scale out of all four configurations on FID and GenEval score.
However, the raster order model exhibits much lower validation loss and better scaling, dropping from 0.255 to 0.235 when scaling from 0.1 to 3B parameters whereas the random order model drops only from 0.306 to 0.294 over the same parameter range.
My guess is that this has to do with model capacity: a raster order model needs to learn to sample from posterior distributions over a single patch, and the number of such calculations it must learn is linear in the number of patches. A random order model must learn to sample from each patch that it was not conditioned on, and it must be able to do this for each subset of patches (exponential in the number of patches.)
The largest model tested in the paper is only 3B. The images are tokenized such that each image is fed to the transformer as a sequence of only 256 tokens and 16 channels. Granted, the Fluid paper used only 256x256 resolution images, but compared to the text inputs GPT-4o deals with, this is extremely small. Considering OpenAI's blazing fast inference infrastructure and the notably long time it takes for 4o to generate an image, 4o's image generation model must represent a further scaling over the Fluid paper by many orders of magnitude, to the point where the significantly better scaling by the raster model in validation loss would have completely outstripped that of the random order model and (again if we are followers of the scaling hypothesis), begun to reflect in FID and GenEval score as well.
Autoregressive Raster vs. Diffusion Models
Diffusion models have to allocate space in their parameters to being able to reconstruct every possible noised version of an image. In an autoregressive model, there is only a single way to generate every possible image (ignoring the lightweight diffusion step, which is handled by a separate model), most likely, vastly increasing its capacity. Diffusion models can be thought of as a method of performing extremely good data augmetmation in a clever way, but perhaps this is another example of a Bitter Lesson: It is possible that at such large datasets, clever data augmentation does not matter nearly as much as capacity.
Loose ends
What about the blurry initial picture? My guess is that it has nothing to do with the model. Most likely there is an extremely lightweight model which directly predicts the blurry image and its dimensions from the prompt which gives us something to look at while we wait for the image to load.
What about the blurry part changing as the image loads? What about the sudden generation of the full image once the 3/4 point is reached? My guess is that these two are connected: The sharpening of the blurry part of the image is on a 10 or so second delay, allowing for another model to monitor the content before it is shown. We still see a heavily blurred version of the generated image in real time. Once the blurred portion has reached the 3/4 mark, that is when the full image has been generated on the server side. At this point, it can then be checked by the content moderation model and the full image can be shown to the user immediately.
Conclusion
Especially considering the how good the validation loss scaling of the continuous raster model is in the Fluid paper, it seems quite likely that 4o Image Gen is a further scaled variation of the MAR/MAGE architecture. If true, what's most striking is that it achieves the state of the art by modelling images simply as a sequence of patches with no additional inductive priors (other than maybe positional encoding). This is certainly not the most natural representation for such datasets and leads us to believe that the performance gap of similar models in true forecasting domains should be even wider. This seems to be the basic structure of DeepMind's GenCast Weather prediction model, which, assuming we are correct in our guess as to what 4o is doing, shouldn't be too surprising.
Lately, I have been working on Diffusion Forcing models for forecasting of chaotic PDEs. Unfortunately, for sequential prediction alone, it seems to fall prey to many of the same presumed drawbacks as diffusion and random order autoregressive models. One unique feature it might still have is the option to condition the generation of a sequence based on partially denoised tokens far into the future, possibly being a novel avenue for planning in an RL setting.