Aesthetic predictor models
Experiments in aesthetic prediction model architecture aim to build better expressive aesthetic predictions for creating better generative models through aesthetic quality filtering.
Building off of aesthetic predictor model from LAION Aesthetic Predictor.
Baseline architecture
Continuing from LAION Aesthetic Predictor using a linear architecture. Adding in different activation functions to test performance.
- Linear
- Linear with SiLU
- Linear with ReLU
UNet architecture
We use a CNN to have a representation of the pixels in the image. This could create a more expressive model that uses the small details in the images to predict the aesthetics.
Using images resized to 224×224 we use a UNet convolutional neural network to predict the score between 0 and 10.
- UNet Convolution from resized images
VAE architecture
Using images passed through a variational autoencoder KL VAE (same one used for Stable Diffusion) to get latent representation. Then use this latent representation to get predict the aesthetic score between 0 and 10.
- UNet Convolution from VAE latents
UNet Transformers
Use attention mechanism to infer aesthetics from images. Transformers can be scalable to how expressive we want the model to be but this can make it be much larger than may be necessary. Improvements to Transformers modeling could be a trade-off here.