Lesson 9A 2022 - Stable Diffusion deep dive

hello everyone my name is Jonathan and today I'm going to be taking you through this stable diffusion Deep dive notebook looking into the code behind the kind of popular high-level apis and libraries and tools and so on to see what exactly does the generation process look like and how can we modify that how do each of the individual components work so feel free to run along with me if you haven't before this might take a little while to run just because it's downloading these large models if they aren't already downloaded and loading them up and so we're going to start by just kind of recreating what it looks like to generate an image using say one of the existing pipelines in hugging face so we're going to basically have copied the code from the core method of the default stable diffusion Pipeline and so if you go and view that here you'll see that we're going to be basically replicating this code but now we'll be doing it on our own sort of notebook and then we'll slowly understand what each of these different parts is doing so we've got some set up we've got some sort of loop running through a number of sampling time steps and regenerating an image so this is supposed to be a watercolor picture of an otter and it's very very cool that this model can just do this but now we want to know how does that actually work what's going on so the first component is the autoencoder now this stable diffusion is a latent diffusion model and what that means is that it doesn't operate on pixels it operates on go in the latent space of some other autoencoder model in this case a variational autoencoder that's been trained on a large number of images to compress them down into this latent representation and bring them back up again so I have some functions to do that we're going to look at what it's like in action by just downloading a picture from the internet opening it up with pil so we have this 512 by 512 pixel image and we're going to load it in and then we're going to use our function defined above to encode that into some latent representation and what this is doing is calling the vae.encode method on a tensor version of the image and that gives us a distribution and so be sampling from that distribution and we're scaling by this because that's what the authors of the the model did they scaled the latents down before they fit them to the model and so we have to do that scaling and then the reverse when we're decoding just to be consistent with that but the key idea is that we go from this big image down to this 4x64 by 64 latent representation right so we've gone from this much larger image down and if we visualize what the four channels here and this four different 64 by 64 channels what that looks like we'll see that it's capturing something of the image right you can sort of see the same shapes and things there but it's not quite a direct mapping or anything for example there's this weirdness going on the beak some of the channels look slightly Stranger than the others so there's some sort of Rich information captured there and if we decode this back what we'll see is that the decoded image looks really good you really have to look closely to tell the difference between our input image here and the decoded version so very very impressive compression right this is a factor of eight in each Dimension so 512 by 512 down 64 by 64. it's like a factor of 64 reduction in in data but it's still somehow capturing most of that information it's a very information Rich representation and this is going to be great because now we can work with that with our diffusion model and get nice high resolution results even though we're only working with these 64 by 64 latents now it doesn't have to be 64 by 64. you can go and modify this to say what if this is you know 640 and encode that down and you'll see that it's just that same factor of 8 reduction um and there we go now we have 80 by 64. and this just has to be a multiple of eight otherwise you'll get I think an error okay so we have our encoded version of this image and that's pretty great the next component we're going to look at is the scheduler and I'll look more closely at this later but for now we're going to focus on this idea of adding noise right so during training we add some noise to an image and then the model tries to predict what that noise is and we're going to do that to different amounts so here we're going to recreate the the same type of schedule and you can try different schedulers from the library oops and these parameters here Beta start beta end beta schedule that's how much noise was added at different time steps and how many time steps are used during training for sampling we don't want to have to do a thousand steps so we can set a new number of time steps and then we'll see how these correspond with the scheduler.time steps attribute to the original um training time steps so here we're going to have 15 sampling steps and that's going to be equivalent to starting at times at 909 and just moving linearly down to time steps zero we can also look at the actual amount of noise present with the sigmas attribute so again starting High moving down and if you want to see what that schedule looks like we can plot that here and if you want to see the time steps you'll see that it's just a linear relationship so there we go we're going to start at a very high noise value and we're going to slowly slowly try and reduce this down until ideally we get an image out okay so this Sigma is the amount of noise added let's see what that looks like so I'm going to start with some random noise that's the same shape as my latent representation my encoded image and then I'd like to be equivalent to sampling step 10 out of 15 here so I'm going to go and look up what time set that equates to and that's going to be one of the arguments that I passed the scheduler.add noise function so I'm calling scheduler.add noise giving it my encoded image the noise and what time step I'd like to be noising equivalent to and this is going to give me this noisy but still recognizable version of the image and you can go and say okay what if I look at somewhere earlier in the process right does it look more noisy what about right at the beginning right at the end and feel free to play around there um okay so this adding noise what are we actually doing what does the code look like let's inspect the function and you'll see that there's some setup for different types of argument and shapes but the key line is just this noisy samples is equal to original samples plus the noise scaled by the sigma parameter right so that's all it is it's not always the same different papers and implementations will add the noise slightly differently but in this case that's all it's doing so scheduler.add noise just adding noise that's the same shape as the latents scaled by the sigma parameter um okay so that's what we're doing um so if we want to start from random noise instead of a noisy image we're going to scale it by that same Sigma value so that it looks the same as an image that's been scaled by that amount but then before we feed that to the actual model we then have to handle that scaling again you could do it like this but now we have this scale model input function associated with the scheduler just to hide that complexity away okay so now we're going to look at the same kind of sampling loop as before but we're going to start now with our image we're going to take our encoded image we're going to noise it to some time step and then we only can denoise from there so in code we are now preparing our text and everything the same as before which we'll look at we're setting our number of inference steps to 50 right I'm inference steps is equal to 50 here and we're saying I'd like to start at the equivalent of set 10 out of 50. so I'll look up what time step that equates to I'll add noise to my image equivalent to that step and then we're going to run through sampling but this time we're only going to start doing things once we get above that start step so I'm going to ignore the first 10 out of 50 steps and then beyond that I'm now going to start with this noisy version of my input image and I'm going to denoise it according to this prompt and the Hope here is that by starting from something that has some of the sort of rough structure and color of that input image I can kind of fix that into my generation but I've got a new prompt a National Geographic photo of a colorful dancer and here we go we see this is the same sort of thing as the parrots but now we have this completely different actual content thanks to a different prompt and so that's a fun kind of use of this image to image process you might have seen this for taking drawings adding a bunch of noise and then denoising them into fancy paintings and so on so again this is something that there's existing tools for this right the strength parameter and the image to image pipeline that's just something like this um what step are we starting at how many steps are we skipping um but you can see that this is a pretty powerful technique for getting a bit of extra control over like composition and color and a bit of the structure um okay so that's that's that trick with adding noise and then using that as image to image the next big section I'd like to look at is how do we go from a piece of text that describes what we want into a numerical representation that we can feed to the model so we're going to trace out that Pipeline and along the way we'll see how we can modify that for a bit of fun so step number one we're taking our prompt and we're turning it into a sequence of discrete tokens so here we have in this case 77 because that's the maximum length um discrete tokens it's always going to be that if your prompt is longer it'll truncate it and if we decode these tokens back we'll see that we have a special token for the start of the text then a picture of a puppy and then the rest is all the same token which is this kind of end of text padding token right so we have this special token for puppy this special token has its own meaning end of text and the the prompts are always going to be padded to be the same length so before in the code that we were using there we always jump straight to the so-called output embeddings which is what we fed to the model as conditioning and so somehow this captures some information about this prompt and but now we want to say well how do we get there how do we get from this sequence of tokens to these output embeddings what is this text encoder um forward pass doing right so we can look at this and there's going to be multiple steps the first is going to be some embeddings so if we look at the text encoder.txt model.embeddings we'll see there's a couple of different ones we have token embeddings right and so this is to take those individual tokens token 49408 or whatever and map it into a representation that's a numerical representation so here it's a London bedding there are seven sorry about 50 000 rows one for each token and for each token we have 768 values so that's the embedding of that token and if we want to feed one in and see what the embedding looks like here's the token for puppy and here's the token embedding right 768 numbers they somehow capture that meaning of that token on its own and we can do the same for all of the tokens in our prompt so we feed them through this token embedding layer and now we get 77 768 dimensional representations of this of each token now these are all on their own no matter where in the sentence is it is the token embedding will be the same so the next step is to add some positional information some models will do this with some kind of like learned pattern of positioning but in this case the positional embedding is just another London bedding but now instead of having one embedding for every token we have one embedding for every position out of all 77 possible positions and so just like we did for the tokens we can feed in the position IDs one for every possible position and we'll get back out an embedding for every position in the prompt um combining them together there's again multiple ways people do this in the literature but in this case it's as simple as adding them that's why they made them the same shape so that you can just add the two together and now these input embeddings have some information related to the token and some related to the position so so far we haven't seen any big model just two London beddings but this is getting everything ready to feed through that model and so we can check that this is the same as if we just called the embeddings layer of that model which is going to do both of those steps at once but we'll see just now why we want to separate that out into individual ones okay so we have these individual tokens and they have some positional information we have these final embeddings now we'd like to turn them into something that has a richer representation thanks to some big Transformer model and so we're going to feed these through and I made this little diagram here each token is going to turn into a token embedding get combined with the positional embedding and then it's going to get fed through this Transformer encoder which is just a stack of these blocks and so each block has some magic like attention has some feed forward components there's additions and normalizations and skips and so on as well but we're going to have some number of these blocks all stacked together and the outputs of each one get fed into the next block and so on until we get our final set of hidden States these encoder hidden States aka the output embeddings and this is what we feed to our units to make its predictions so the way we get this I just copied the text encoder.textmodel.forward method pulled out the relevant bits we are going to take in those input embeddings combined positional and token embeddings and we're going to feed that through the textmodel.encoder function with some additional parameters around attention masking and telling it that we'd like to Output the hidden States rather than the final outputs so if we run this we can just double check these embeddings are going to look just like the output embeddings we saw right at the beginning so we've taken that one step tokens to output embeddings and we've broken it down into this number of smaller steps where we have tokenization getting our token embeddings combining with position embeddings feeding it through the model and then that gives us those final outputs so why have we gone through this trouble well there's a couple of things we can do one demoed here I'm getting the token embeddings but then I'm looking up um where is the token for puppy and I'm going to replace it with a new set of embeddings and this is going to be another just learned embedding of this particular token here two three six eight right so I'm kind of cutting out the token embedding for puppy slipping in this new set of token embeddings and I'm going to get some output embeddings which at the start look very similar to the previous ones in fact identical but as soon as you get past the position of puppy in that prompt you're going to see that the rest have changed right so we've somehow messed with these embeddings by slipping in this new token embedding right at the start and if we generate with those embeddings which is what this function is doing we should see something other than a puppy and sure enough drumroll uh we don't we get a cat and so now you know what token two three two three six eight means um we've managed to slip in a new token embedding and get a different image okay what can we do with this why is this fun um well a couple of Tricks first off we could look up the token embedding for skunk right which is this number here and then instead of now just replacing that in place of puppy what if I make a new token embedding that's some combination of the embedding of puppy and the embedding of skunk right so I'm taking these two token embeddings I'm just averaging them and I'm inserting them into my set of token embeddings for my prompt um in place of just the word puppy and so hopefully when we generate with this we get something that looks a bit like a puppy a bit like a skunk and this doesn't work all the time but it's pretty cute when it does there we go puppy skunk hybrid okay so that's not the real reason we're looking at this the main application at the moment of being able to mess with these token embeddings is to be able to do something called textual inversion so in textual inversion we're going to have our prompt tokenize it and so on and but here we're going to have a special learned embedding for some New Concept right and so the way that's trained is going to be outside of the scope of this notebook but there's a good blog post and Community notebooks and things for doing that but let's just see this in application here so I've um there's a whole library of these Concepts stable diffusion concept Library where you can browse through um tons and tons and tons look over 1 400 different Community contributed token embeddings that people have trained and so I'm going to use this one here this verb style here's some example outputs and then these are the images it was trained on so these pretty little bird paintings done by my mother I've trained a new token embedding they tries to capture the essence of the style and that's represented here in this learned embeds.bin so if you download this and then upload it to wherever your notebook's running I have it here londonbeds.bin we can load that in and you'll see that it's just a dictionary where we have one key that's the name of my new style and then we have this token embedding 768 numbers and so now instead of slipping in the token embedding for cat we're going to slip in this new embedding which we've loaded from the file into this prompt so masking the style of puppy tokenize get my token embeddings and then I'm going to slip in this replacement embedding um in place of the embedding for puppy and when we generate with that we should hopefully get a mouse in the style of this kind of cutesy watercolor on rough paper image and sure enough that's what we get very cute little drawing of a mouse in an apron apparently uh okay so very very cool application again there's a nice inference notebook that makes this really easy you can say a katoya in the style of verb style you don't have to worry about manually replacing the token embeddings yourself but it's good to know what the code looks like under the hood right how are we doing that what stage of the text embedding process we're modifying very fun to get a bit of extra control and a very useful technique because now we can kind of augment our model's vocabulary without having to actually retrain the model itself we're just learning a new token embedding so very very powerful idea and really fun to play with and like I said there's thousands of community contribute contributed tokens but you can also train your own I think I link The Notebook from here but it's also in all the docs and so on here's the training notebook okay final little trick with embeddings rather than messing with them at the token embedding level we can push the whole prompt through that entire process to get our final output embeddings and we can mess with those at that stage as well so here I have two prompts a mouse and a leopard tokenizing them encoding them with the text encoder so that's that whole process and these final output embeddings I'm just going to mix them together According to some factor and generate with that and so you can try this with you know a cat and a snake um and you should be able to get some really fun uh different chimeras and oops a snail apparently okay well I can't spell um but yeah have fun with that doesn't have to be animals I'd love to see what you create with these weird mixed up Generations okay we should look at the actual model itself the key unit model the diffusion model um what is it doing what is it predicting what is it accepting as arguments so this is the kind of cool signature we call our units forward pass and we feed in our noisy latents the time step and it's like the training time step and the encoder hidden States right so those text embeddings that we've just been having fun with so doing that without any Loops or anything I'm sitting in my scheduler getting my time step getting my noisy latent and my text embeddings and then we're going to get our model prediction and we'll look at the shape of that and you'll see that this prediction has the same shape as the latents and given these noisy latents what the model is predicting is the noise component of that and actually it's predicting the noise component scaled by Sigma so if we wanted to see what the original image looks like we could say well the the denoise latents is going to be the current noisy latents minus Sigma times the model prediction right um and so when we're denoising we're not going to go straight to that upward prediction we're going to just remove a little bit of the noise at a time but it might be useful to visualize what that final prediction looks like so that's what we're doing here making a folder to store some images preparing our text scheduler and input and then we're going to do this Loop but now we're going to get the model prediction and instead of just updating our latency by one step we're also going to store an image all right I'm decoding these two images an image of the predicted completely denoised like original sample so that's this predicted original sample here you could also calculate this yourself latent 6 error is equal to the current latents minus Sigma times the noise prediction right so those two should work equivalently but this Loop is going to run and it's going to save those images to the steps folder which we can then visualize and so once this finishes in a second or two on the left we're going to see the kind of noisy input to the model at each stage and on the right we're going to see the noisy inputs minus the noise prediction right so the denoised version and so just give it a second or two to run it's taking it a little bit longer because it's decoding those images each time saving them um but once this finishes we should have a nice little preview video okay here we go so this is the noisy latent and if we take the model's noise prediction and subtract it from that we get this very blurry output and so you'll see as we play this oh I've left some modifications in from last time sorry um when you see this this guidance scale we'll be back at uh I think it was eight um in the next section we'll talk about classified free guidance and so I've been modifying that example my bad I might cut this out of the video we'll see so I've got to wait a few seconds again for that to generate and I'll do so as patiently as I can okay so here we go again the noisy inputs the predicted denoise version and you can see at the start it's very blurry but over time it gradually converges on our final output um and you'll notice that on the left these are the latents as they are each step they don't change particularly drastically a little bit at a time but at the start when the model doesn't have much to go on its predictions do change quite a bit at each step right it's much less well defined and then as we go forward in time it gets more and more refined better and better predictions and so it's got a a more accurate estimation of the noise to remove and we remove that noise gradually until we finally get our output quite fun to visualize the process and hopefully that helps you understand why we don't just make one prediction and do it in one step right because we get this very blurry mess but instead we do this kind of iterative um sampling there which we'll talk about very shortly before then though the final thing I should mention classify free guidance what is that well like you saw when I accidentally generated the version with a much lower guidance scale um the way classifier free guidance works is that in all of these Loops we haven't actually been passing one set of noisy latents through the model we've been passing two identical versions and as our text embeddings we've not just been passing the embeddings of our prompts right these ones here we've been concatenating them with some unconditional embeddings as well and what the unconditional embeddings are is just a blank prompt right no text whatsoever so just all padding passing that through so when we get our predictions here we've given in two sets of latents and two sets of text embeddings we're gonna get out two predictions for the noise so we'll be splitting that apart one prediction for the unconditional like no prompt version and one for the prediction based on the prompt and so what we can do now is we can save all my final prediction is going to be the unconditional version plus the guidance scale times the difference right so if you think about it if I predict without the noise I'm predicting here if I predict with the noise sorry with the text encoding with the prompt I get this prediction instead and I'd like to move more in that direction I'd like to push it even further towards the prompt version and and Beyond and so this guidance scale can be larger than one to push it even more in that direction and this it turns out is kind of key for getting it to follow the prompt nicely and I think it was first um brought up in the Glide paper AI coffee break on YouTube has a great video on that but yeah really useful trick or really neat hack depending on who you talk to but it does seem to work and the higher the guidance scale the more the model will try and look like the prompts kind of in the extreme versus the lower guidance scale it might just try and look like a generic good picture okay we've been hiding away some complexity in terms of this scheduler.step function so I think we're going to step away from The Notebook now and scribble a bit on some paper to try and explain exactly what's going on with sampling and so on and then we'll come back to the notebook for one final trick all right so here's my take on sampling and to start with I'd like you to imagine the space of all possible images so this is a very large High dimensional space for 256 by 256 by 3 image that is 200 000 dimensional and my paper unfortunately is only two dimensional so we're going to have to squish this down a fair bit and use our imagination now if you just look at a random point in this space this is most likely not going to look like anything recognizable it'll probably just look like gobbled noise but if we map an image into this space um we'll see that it has some sort of fixed point and a very similar image almost pixel equivalent it's going to be very close by now there's this theory that you'll hear talked about called manifold Theory which says that for most real images like a data set of images these are going to lie on some lower dimensional manifold within this higher dimensional space right in other words if we map a whole bunch of images into the space they're not going to fill the whole Space they're going to be kind of clustered onto some surface now I've drawn it as a line here because we stuck with 2D but this is a much higher dimensional plane equivalent okay so each of these ones here is some image and the reason that I'm starting with this is because we'd like to generate images we'd like to generate plausible looking images not just random nonsense and so we'd like to do that with diffusion models so where did they come in well we can start with some image here some real image from our training data and we can push the way from the manifold of like plausible existing images by corrupting it somehow so for example just adding random noise that's equivalent to like moving in some random Direction in this space of all possible images and so that's going to push the image away and then we can try and predict using some model what this noise looks like right how do I go from here back to a closable image what is this noise that's been added and so that's going to be our our big unit that does that prediction that's going to be our diffusion model great and so that's um in in this language going to be called something like a score function right how do I get from wherever I am what's the noise that I need to remove to get back to a plausible image um okay so that's all well and good um we can train this model with a number of examples because we can just take our training data add some random noise predict predict try and predict the noise update on model parameters so we can hopefully learn that function fairly well now we'd like to generate with this model right so how do we do that well we can start at some random point right like let's let's start over here um and you might think well surely I can just now predict the noise remove that and then I get my output image and that's great except that you've got to remember now we're starting from a random point in the space of all possible images it just looks like garbled nonsense and the model's trying to say well what does the noise look like and so you can imagine here for training the further away we get from our examples the sparse of our training will have been um but also it's not like it's very obvious how we got to this noisy version right we could have come from this image over here added a bunch of noise we could have come from one over here one over here and so this model is not going to be able to make a perfect prediction at best it might say well somewhere in that direction right it could Point towards something like the data set mean or at least the edge that's closer but it's not going to be able to perfectly give you one nice solution and sure enough that's what we see if we sample diffusion models just in one step we get the predictions look at what that corresponds to as an image it's just going to look like a blurry mess maybe like the mean of the data or you know some sort of garbled output definitely not going to look like a nice image so how do we do better and the idea of sampling is to say well there's a couple of framings so I'll start with the existing framing that you'll see talked about a lot of score based models and so on and then we'll talk about some other ways to think about it as well so this process of gradually corrupting our images away adding a little bit of noise at a time people like to talk of this as a stochastic differential equation stochastic because there's some Randomness right we're picking random amounts of noise random directions to add and a differential equation because it's not talking about anything absolute just how we should change this from moment to moment to get more and more corrupted right so that's why it's a differential equation and with that framing the question of well how do I go now back to the image that's framed as solving an ordinary differential equation that corresponds to like the reverse of this process now you can't solve Odes in a single step but you can find an approximate solution um and the more sort of sub steps you take the better your approximation and so that's what these samples are doing given like okay we said this image over here here's my prediction rather than moving the whole way there in one go we'll remove some of that noise right do a little updates and then we'll get a new prediction right and so maybe now the prediction is slightly better it says up here so we move a little bit in that direction and now it makes an even better prediction because as we get closer to the manifold right as we have less and less noise and more and more of like some image emerging the model is able to get more more accurate predictions and so in some sort of number of steps we divide up this this process and we get closer and closer and closer until we ideally find some image that looks very plausible as our output and so that's what we're doing here with a lot of these samples they're effectively trying to solve this ode in some number of steps by um yeah breaking the process up and only moving a small amount at a time now you get sort of first order solvers right where all we're doing is just linearly moving within each one and this is equivalent to something called Euler's method or Euler's method if you're like me and you've only ever read it and this is what some of the most basic Samplers are doing just linear approximations for each of these little steps um but you also get additional approaches so for example um maybe if we were to make a prediction from here it might look like something like this and if we were to make a prediction from here it might look like something like that so we have our error here but as we move in that direction it's also changing right so there's like a derivative of a derivative a gradient of a gradient and that's where this kind of second order solver comes in and says well if I know how this prediction changes as I move in this direction like what is the derivative of it then I can kind of account for that curvature when I make my update step and maybe know that it's going to curve a bit in that direction and so that's where we get things like these so-called second order solvers and higher order solvers the upside of this is that we can get you know do a larger step at a time because we have a more accurate prediction we're not just doing a first order linear approximation we have this kind of curvature taken into account the downside is that to estimate that curvature for a given point we might need to call our model multiple times to get multiple estimates and so that takes time so we can take a larger step but we need more model evaluations per step a kind of hybrid approach is to say well rather than trying to estimate the curvature here I might just take a linear step look at the next prediction but I'll keep a history of my previous steps and so then over here it predicts like this so I have now this history and I'm going to use that to better guess what this trajectory is so I might keep a history of the past you know three or four or five predictions and know that since they're quite close to each other maybe that tells me some information about the curvature here and I can use that to again take larger steps and so that's what we see the circle you know linear multi-step sampling coming in just keeping this buffer of past predictions to try and do a better job estimating than the simple you know one step linear type first order solvers okay so that's the school-based um sampling version and all of the variants and Innovation comes down to things like how can we do this in as few steps as possible you know maybe we have a schedule that says we take larger steps at first and then gradually smaller steps as we get closer um you know there's I think now some Dynamic methods and can we estimate how many steps we need to take and so on um so that's all trying to attack it from this kind of score based ode solving framework but there's another way to think of this as well and that's to say okay well I don't really care about solving this exact reverse ode right all I care about is that I end up with an image that's on this manifold like a plausible looking image and so I have a model that estimates how much noise there is right and if that noise is very small then that means I've got a good image and if that noise is really large then that means I've got some work to do and so this kind of starts bringing up some analogies to training neural networks because the neural networks we have the space of all possible parameters and we're trying to adjust those parameters not to solve the gradient flow equation right although that that's you know possible in theory that you might try and do that we don't care about that we just want to find a Minima we want to find a point where our loss is really good and so when we're training a neural network that's exactly what we do we set up an Optimizer we take some number of steps trying to reduce some loss and once that loss gets sort of you know levels off right reduce over time levels off okay cool I guess we found a good neural network um and so we can apply that same kind of thinking here to say all right I'll start at some point and I have an estimate of the gradient right it's like maybe pointing over here but remember that estimate is not very good just like the first gradients estimated when training in your network are pretty bad because it's all just these randomly initialized weights but hopefully at least points in a useful Direction so then I'll take some step and the length of this step I won't try and do some fancy schedule I'll just offload this to a sort of off-the-shelf Optimizer right so I have some learning rate maybe something like momentum that determines how big of a step I take and then I update my prediction right take another step in that direction and so on so now instead of following a fixed schedule we can use tricks that have been developed for training neural networks right adaptive learning rates momentum weight Decay and so on and we can apply them back to this kind of sampling case and so it turns out this works okay I've tried this for stable diffusion needs some tricks to get it working but it's a slightly different way of thinking about sampling rather than relying on sort of a hard-coded ode solver that you've figured out yourself just saying why don't we treat this like an optimization problem where if the model predicts almost no noise that's good we're doing a good job and if the model predicts lots of noise then we can use that as a gradient and take a gradient update step according to our Optimizer and try and sort of converge on a good image as our output and this this is you know you can stop early once your model prediction is sufficiently low for the amount of Noise Okay cool I'm done and so I found you know in 10 15 Steps you can get some pretty good images out um yeah so that's a different way of viewing it not so popular at the moment but maybe hopefully something we'll see um yeah just a different Framing and for me at least that helps me think about what we're actually doing with these Samplers we're trying to find a point where the model predicts very little noise so starting from a bad prediction moving towards it getting better by looking at this estimated amount of noise as our sort of gradient and solving that um just kind of iteratively removing bits at a time so I hope that helps elucidate the different kinds of Samplers and the goal of that whole thing and also illustrate at least why we don't just do this in a single step right why we need some sort of iterative approach otherwise we'd end up with just very bad blurry predictions all right I hope that helps now we're going to head back to the notebook to talk about our final trick of guidance okay the final part of this notebook guidance how do we add some extra control to this generation process right so we already have control via the text and we've seen how we can modify those embeddings we have some control by starting at a noisy version of an input image rather than Pure Noise to kind of control the structure but what if there's something else what if we'd like a particular style or to enforce that the model looks like some input image or maybe sticks to some color palette um it would be nice to have some way to add this additional control and so the way we do this is to look at um some loss function on the decoded denoised predicted image right the predicted denoise meets final output and use that loss to then update the noisy latents as we generate in a direction that tries to reduce that loss so for demo we're going to make a very simple loss function I would like the image to be quite blue and to enforce that my error is going to be the difference between the blue channel right red green blue blue is the third channel of the color channels and the difference between the blue Channel and 0.9 so the closer all the blue values are to 0.9 the lower my error will be so that's going to be my kind of guidance loss and then during sampling here what I'm going to do everything is going to be the same as before but um every few iterations you could do it every iteration but that's a little slow so here every five iterations I'm going to set requires grad equals true on the latents I'm then going to compute my predicted denoised version I'm going to decode that into image space and then I'm going to calculate my loss using my special blue loss and scale it with some scaling Factor then I'm going to use torch to find the gradient of this loss with respect to those latents those noisy latents and I'm going to modify them right and I want to reduce the loss I'm going to subtract here this gradient multiplied by Sigma squared because we're going to be working at different noise levels and so if we run this we should see hopefully it's going to do that same sort of sampling process as before but we also are occasionally modifying our latents by looking at the gradient of the loss with respect to those latents and updating them in a direction that reduces that loss and sure enough we get a very nice blue picture out and if I change the scale here down to something lower and run it we'll see that um scale is lower so the loss is lower so our modifications to the latents are smaller and we'll see that we get out a much less blue image there we go so that's the default image very red and dark because it's the prompt is just a picture of a campfire but as soon as we add our um our additional loss or guidance we're going to get out something that better matches that additional constraint that we've imposed right so this is very useful not just for making your images blue but like I said color palettes or using some classifier model to make it look like a specific class of image or using a model like clip to again associate it with some text so lots and lots of different things you can do now a few things I should note one we decoding the image back to image space calculating a loss and then tracing back that's very computational Lee intensive compared to just working in latent space and so we can do that on every fifth operation to reduce the time but it still is much slower than just your generic sampling and then also um we're actually still cheating a little bit here because what we should do is set requires grad equals true on the latents and then use those to make our noise prediction use that to calculate the denoise version and decode that calculator loss and Trace back all the way through the decoder and the process and the unit back to the latents right the reason I'm not doing that is because that takes a lot of memory so you'll see for example like the clip guided diffusion notebook from the hugging face examples they do it that way but they have to use tricks like gradient checkpointing and so on to kind of keep the RAM usage under control and for simple losses it works fine to do it this way because now we're just tracing back through denoise latents is equal to latents minus Sigma times this noise prediction right so we don't have to trace any gradients back through the unit but if you wanted to get more accurate gradients maybe it's not working as well as you'd hoped you can do it that other way that I described um but however you do it very very powerful technique fun to be able to again inject some additional control into this generation process by crafting a loss that expresses exactly what you'd like to see all right that's the end of The Notebook for now if you have any questions feel free to reach out to me I'll be in the forums and you can find me on Twitter and so on but for now enjoy and I can't wait to see what you make

Lesson 9A 2022 - Stable Diffusion deep dive

Full Transcript

Need a transcript for another video?