Diffusion Transformer
Diffusion Transformer, or DiT, is an adaptation of the Vision Transformer model that can be used to learn the noise to be removed at each step of a generative diffusion process. In the original Latent Diffusion Model, for example, the core generative module is based on the U-Net architecture, which is a convolutional model. The Diffusion Transformer authors replaced that U-Net module with their proposed adaptation of the Vision Transformer. They also described a few different approaches to condition the generation process to text prompts.