Aesthetic predictor models

Experiments in aesthetic prediction model architecture aim to build better expressive aesthetic predictions for creating better generative models through aesthetic quality filtering.

Building off of aesthetic predictor model from LAION Aesthetic Predictor.

Baseline architecture

Continuing from LAION Aesthetic Predictor using a linear architecture. Adding in different activation functions to test performance.

Linear
Linear with SiLU
Linear with ReLU

UNet architecture

We use a CNN to have a representation of the pixels in the image. This could create a more expressive model that uses the small details in the images to predict the aesthetics.

Using images resized to 224×224 we use a UNet convolutional neural network to predict the score between 0 and 10.

UNet Convolution from resized images

VAE architecture

Using images passed through a variational autoencoder KL VAE (same one used for Stable Diffusion) to get latent representation. Then use this latent representation to get predict the aesthetic score between 0 and 10.

UNet Convolution from VAE latents

UNet Transformers

Use attention mechanism to infer aesthetics from images. Transformers can be scalable to how expressive we want the model to be but this can make it be much larger than may be necessary. Improvements to Transformers modeling could be a trade-off here.

Aesthetic predictor models

Baseline architecture

UNet architecture

VAE architecture

UNet Transformers

Links