Oversimplified notes on some image models
Just started to get to work on a research project focused on generative image models.
There are a lot more concepts than GANs that I needed to know about so I’ve placed them here for future reference (and also if some poor soul happens to find this useful).
Mostly just as a quick reference for some of the newest and best stuff being done with the deep learning on images.
(at the time of writing, of course)
Also not going to be too mathematical, hopefully.
GLIDE
GLIDE is a text-to-image generative model based on diffusion models. The text is encoded by a Transformer and concatenated onto the image input, before they are passed through a modified ADM model. There is also some guidance involved to push the model towards generating the image corresponding with the label. Of the two guidance methods, the classifier-free method worked better than when guided with the CLIP model (as described below). GLIDE also features a separate upsampling diffusion model to increase the resolution of the output from 64x64 to 256x256.
Compared to DALL-E, the images GLIDE generates are much more detailed, especially in the texture of furs/hairs and other surfaces. DALL-E’s output is muddier and smooth overall.
GLaM
GLaM is based on the Transformer architecture with a Mixture of Experts (MoE) model replacing the feedforward layer in the encoder. Each expert model are just MLPs like the aforementioned feedforward layer, but handle inputs differently. The MoE has gating model that decides which expert models (2/64) is best to process incoming inputs. Training compute power is higher than GPT-3 but inference compute power is much lower, because experts are only sparsely activated.
CLIP
CLIP does zero-shot image labelling, meaning it’s been pretrained on 400 million image-text pairs on internet and asked to label new images from datasets it’s never seen before.
Both images and text encoded with ViT and Transformer respectively (best performance).
For training, the model is asked to match which image matches which label best in current batch (massive batch size to simulate global data).
This is different from the traditional convention of assign image a label from pre-defined set.
It allows the model to score up the correct label and avoid wrong labels.
Ultimately, performs as well on ImageNet as ResNet models, but generalizes to new wackier photos (e.g. banana vs. sketches of banana).
DALL-E
DALL-E does image generation from text as well but on a different architecture:
- VQ-VAE to compress 256x256 image down to 32x32 tokens
- Autoregressive transformer takes 256 word tokens and 1024 image tokens during training
- 8192 possible image tokens from codebook (vocab size)
Also tokens generated by transformer is passed through some decoder (most likely of the VQ-VAE) to produce images on inference. DALL-E can do fabric textures and multiple viewpoints, but is not very good with fine details.