Masked Autoencoder

Masked Autoencoder is a deep learning architecture used to build image representations for downstream recognition tasks. Training involves the following steps: (1) mask a random subset of image patches, (2) run the remaining visible patches through an encoder; (3) introduce mask tokens in the appropriate positions; (4) run these tokens plus the encoded patches through a decoder. Once the model is trained, an image representation is obtained by applying the encoder on all the patches, without masking.