January 22, 2024 (updated April 16, 2025) by Dave

Stable diffusion pipeline from video

The major purpose is utilizing videos in Stable Diffusion: first, we need to extract images by using FFmpeg. During image extraction, issues such as blurriness or unrelated content could arise. We have the following solutions:

Capturing frames

1. Keyframes

Keyframes help solidify the frames according to the video creator for when to bring the video together. Ideally these won’t have bad encoded frames and also limit the number of frames we capture.

2. Canny edge detection

We can use canny edge detection to find the edges of our objects in the frame. This can be a fast way to detect blurriness by checking the complexity of the canny edge and look for clarity of the edge. Adjust the canny edge distinction threshold to match your target goals.

Target goal frames

1-12 second keyframes. We reduce our number of frames from every 24 per second to 1-12 per second. We reduce our known blurry frames. The canny edge detection is a fast method to filter our frames down.

Reduced frames to non-blurry, reduced # of frames.

Image quality filtering

1. Object detection

Detect if we have our target object in our frame. We can filter out poorly captured frames that do not have our object in them. We may want to bridge object detection with the next phase to do object masking if we have the capability with our model.

2. Aesthetic predictor

Use an aesthetic predictor type model to predict the quality of the image. Train your own model to get the quality filters you want on your dataset. Adjust our threshold of aesthetic quality depending on the balance of quality and number of frames you want to capture.

Target goal frames

This can be the biggest reduction in frames. Depending on the filters we use we could come up with 0 frames but adjust how many you want to have to pass this. Maybe train different models to filter properly or adjust thresholds to allow more frames to get through. Aesthetic predictive models between 0 and 10 might not have a normalized dataset or enough quality ratings to pick things.

We may also want frames without the object in them for context without our object detection.

Assess capturing and filtering goals

At this point, we finished capturing and filtering frames. We can now assess if we have captured the frames we want from the video. Go back and adjust where required if we want more frames or fewer frames at this stage. We may have a limited number of frames or a lot of frames.

Object or text inpainting

Now let’s imagine we want to detect objects that are not our target and remove them from the images. This can help clarify our captioning and training process. We can use segmentation to detect the different objects of a frame and remove objects vs backgrounds. This allows us to inpaint those areas to be more plain or more simple.

Removing cars, background people, pets, houses and other objects that may be in the frame.

We may have menus or traffic signs with text. We may want to train the model to get these to be blank, so we can write our own labels on them later.

Be sure to use a proper fine-tuned VAE to get good details and clear representations in the inpainting.

1. Object detection masking and inpainting

Using models like segment anything or rembg we can mask our objects. This mask can then use with inpainting with Stable Diffusion to create new images. Removing those background objects to create more simple frames.

2. Text detection using OCR, masking and inpainting

Using models like mmocr, we mask the text. Then we can inpaint over this text. Can be helpful for removign the complexity of the images and removing text from areas we do not want to train into the model. Or we may want to train it in a way to be able to put our own text in those areas.

Improving inpainting results

Inpainting can affect the original image at times. So we can use the original image and the inpainted image and merge the 2 images together. We use the inpainted area with a larger mask and blend the edges together. This allows us to have crisp original features while having the inpainted area added back in.

Captioning and labeling

We now can label these images using BLIP or other image captioning model. If the captions are not appropriate enough we can Fine-tune BLIP using a LoRA or other parameter efficient method with the revised captions from our dataset. Then we can caption the rest of the image with this fine-tuning. If we need to we can iterate through until we have a good captioning model as well as all our current images captioned.

If we have other videos we want to extract things from, we can use this same captioning model to be able to caption these other videos frames.

Training

Now we have the great images, captioned and ready for Stable Diffusion training.

We can train a LoRA or other parameter efficient method or fine tune the whole model.

Training guide is outside the scope of this walk-through.

Post training analysis

Afterward we can test our results. This can help us assess our goals.

1. CLIP score

CLIP score to make sure we have appropriate flexibility in our model.

2. CLIP IQA

CLIP IQA can test how well the prompts can affect the resulting images from our training.

3. Sampling

Make image samples of our images and make assessments of the results. We could make preferences, or ratings of these images to help train a scoring model in the future.

4. FID

If required we can test using FID to discover the distance from the original image dataset and our trained images.