close
close

first Drop

Com TW NOw News 2024

(D) Noise generation when training stable diffusion from scratch
news

(D) Noise generation when training stable diffusion from scratch

(D) Noise generation when training stable diffusion from scratch

Hello,

I am training stable diffusion 1.4 from scratch on CIFAR10. I am using the CompVis implementation with the training script taken directly from: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py

with the necessary modification to remove VAE from training & inference, and changing sample size 64->32, in/out channel 4->3, and reducing model size (removing one block of crossattn in both up and downblocks). Everything else is the same as the original code. I’m doing fp16 training.

However, after enough number of steps (e.g. 170K steps * 8 batch size), the results are still very noisy and highly saturated. For example, the first image is generation of SD1.4 but with (256, 512, 1024) channels and (down/up2d, crossattn, crossattn), 2 layers per block, trained on plane class only. The second one is (64, 128, 256) and 1 layer per block, trained on 10 classes.

Since the results are so bad, I think I’m doing something fundamentally wrong. This is the first time I’ve trained a SD from scratch, so I don’t have an educated guess as to what’s causing it. If anyone has any ideas, they’d be greatly appreciated 🙂

Thank you very much!!

https://preview.redd.it/y2wbvta1uxhd1.png?width=257&format=png&auto=webp&s=16e4892b46452499363835029944f01c90b9588b

https://preview.redd.it/1u5uigp1uxhd1.png?width=1006&format=png&auto=webp&s=570e7e3f0f9ee23aa61f5183e659862173682793

submitted by /u/ImaginaryAd9209
(link) (reactions)