Latent Diffusion Model

A Latent Diffusion Model is a generative model that deploys the diffusion process in latent space, as opposed to the raw-data space (for example, pixel space in the case of images). Such models have three main components: (1) an auto-encoder model, trained separately, which takes data to and from the latent space; (2) the diffusion model itself, which learns to iteratively generate data representations from noise and conditional information; (3) a conditioning model, which maps conditioning information (such as a text prompt) into an embedding to be fed into the diffusion model in order to control the output. When deployed, the model starts from a noise sample and the prompt embedding in the latent space, passes them through the diffusion model, then through the decoder to obtain the final result.