January 9, 2024 (updated April 16, 2025)

Impacts of fine-tuning text encoder for Stable Diffusion LoRA

How does training the text encoder impact our fine-tuning for Stable Diffusion LoRA

The text encoder is used as conditioning to the UNet through cross attention. This takes the final logits (final layer of the text encoder from CLIP or OpenCLIP) to the K and Q of the cross attention. This gives us some impacts of the cross attention as it impacts the UNet.

Cross attention

Cross attention is a way of crossing the attention of 2 distinct mediums so we can merge the 2 together to create something new. This is used in Stable Diffusion to take text and have it impact the resulting image denoising. This is also how text to 3d, text to video and so on are utilizing these aspects.

Cross attention injects conditioning into the model
Cross attention seems to be the best way to inject text conditioning

Why train the text encoder?

We can note that original version of LoRA training did not include text-encoder training. This is because the text encoder from CLIP is zero-shot roughly meaning it’s good at classifiying new things without being trained on it. So this means it is roughly really good at knowing what your trying to produce in the image from the text without being trained.

Downsides of training the text encoder

Flexibility

Training the text encoder has downsides of moving away from the zero shot and moving it more towards the training dataset. This is to say that as we train the text encoder we remove some flexibility that the original CLIP text encoder has.

Language drift

Text encoder language drift where you train tokens shared by multiple words in different directions. This can make training more difficult as you may be training certain tokens away from their original usage to your usage during training. The more aligned your tokens are the less difficult and less moving away from the zero shot flexibility.

Conditioning without a direct loss

When training the text encoder with the UNet we do not have a direct control over the loss value for the text encoder. We only have the loss from the UNet prediction. This is because the impacts of the text encoder is limited on the results but it’s directly related to denoising or creating the image through denoising.

This is why it is suggested to limit the learning rate for the text encoder training because it is not directly calibrated but in combination to the UNet training.

Alternatives to training the text encoder

Only UNet training

Only training the UNet can work fine, even with various new terminologies. This is because the zero-shot nature of the text encoder works well enough to condition the image appropriately because we are changing the denoising to match the representation of the tokens from the text encoder. We may find that this is not adequate for various words/tokens but it may work better than you originally may expect.

Embeddings through Textual Inversion

Training the embeddings using Textual Inversion allows us to create new embeddings which can better follow our goals for how the text impacts producing the image.

Pivotal tuning inversion training

Pivotal tuning inversion is an adaptation to Pivotal tuning paper where you get the alignment of the image in editable space. For Stable Diffusion we use textual inversion embeddings to get the latent space localized to a good edibility space. This allows high flexibility of the resulting image creation by getting the latent space aligned.

Assessing impacts to training the text encoder for Stable Diffusion LoRA

Deciding on how much or little you may want to train the text encoder will depend on your goals. Experiment with the results at the extremes and see some of the impacts yourself. At inference you could disable the text encoder to see the impacts between with and without the text encoder.

Test X/Y

Make X/Y charts with and without the text encoder at inference.

Test text encoder LR

Test different learning rates for the text encoder.

Test embedding training impacts

Test making embeddings and using that at inference or during training to capture the impacts of your text without the language drift.

Pivotal tuning inversion training

Pivotal tuning inversion works by training an embedding and then training the LoRA after. Continuing the embedding training is a possibility but could cause overtraining of the embedding.