Aesthetic predictor models

Experiments in aesthetic prediction model architecture aim to build better expressive aesthetic predictions for creating better generative models through aesthetic quality filtering.

Building off of aesthetic predictor model from LAION Aesthetic Predictor.

Baseline architecture

Continuing from LAION Aesthetic Predictor using a linear architecture. Adding in different activation functions to test performance.

UNet architecture

We use a CNN to have a representation of the pixels in the image. This could create a more expressive model that uses the small details in the images to predict the aesthetics.

Using images resized to 224×224 we use a UNet convolutional neural network to predict the score between 0 and 10.

VAE architecture

Using images passed through a variational autoencoder KL VAE (same one used for Stable Diffusion) to get latent representation. Then use this latent representation to get predict the aesthetic score between 0 and 10.

UNet Transformers

Use attention mechanism to infer aesthetics from images. Transformers can be scalable to how expressive we want the model to be but this can make it be much larger than may be necessary. Improvements to Transformers modeling could be a trade-off here.