Lesson 25: Deep Learning Foundations to Stable Diffusion

Jeremy Howard15,777 words

Full Transcript

hi everybody and welcome to the last lesson of part two greetings jono and greetings Tunisia how are you guys doing good thanks doing well excited for the last lesson it's been an interesting fun Journey yeah I should explain we're not quite completing all of stable diffusion in this part of the course there's going to be one piece left for the next part of the course which is the clip embeddings and that's because clip is NLP and so the next part of the course we will be looking at NLP so we will end up finishing stable diffusion from scratch but we're going to have to have a significant diversion and what we thought was given everything that's happened with um gpt4 and stuff since we started this course we thought it makes more sense to delve into that quite deeply more more more soon um and delay clip as a result so hopefully people there feel comfortable with that decision but I think we'll have a lot of yeah exciting NLP material coming up so that's the rough plan um all right so [Music] um I think what we might do is maybe start by looking at a really interesting and quite successful application of pixel level diffusion by applying it not to pixels that represent an image that pixels that represent a sound which is pretty crazy so maybe jono of course it's going to be jono he does the crazy stuff which is great so jono show your your crazy and crazily successful approach to diffusion for pixels of sounds plates sure thing right so this is um going to be a little bit of Show and Tell and most of the code in the notebook is just copied and pasted from I think Netbook 30. um but we're going to be trying to generate something other than just images and so specifically I'm going to be loading up a data set of bird calls these are just like short samples of I think 10 different classes of birds calling um and so we need to understand like okay well this is a totally different domain right this is audio if you look at the data like let's look at an example of the data this is coming from a hugging face data set so that that line of code will download it automatically if you haven't got it before right yeah yeah so this will download it into a cache and then sort of handle um you created this data set right did you is this already a data set you found somewhere else or you made it or what um this is a subset that I made from a much larger data set of longer call recordings um from an open website called xenocanto so they collect all of these sound recordings from people they have experts who help identify what birds are calling um and so all I did was find the audio Peaks like where is there most likely to be a bird call and clip around those just to get a smaller data set of things where there's actually something happening yeah not a particularly amazing data set um in terms of like the recordings have a lot of background noise and stuff but a fun small audio one to play with um yeah and so when we talk about audio you've got a microphone somewhere it's reading like a pressure level essentially um in the air with these sound waves and it's doing that some number of times per second so we have a sample rate and in this case the data has a sample rate of 32 000 samples per second so every second the waveform that's being approximated there's lots of little up across up across up across kind of things basically yeah right yeah um and so that's great for you know capturing the audio um but it's not so good for modeling because we now have 30 000 values per second in this one big 1D array um and so yeah you can try and find models that can work with that kind of data but what we're going to do is a little hack and we're instead going to use something called a spectrogram so to illustrate the original data is the main issue it's it's it's too big and slow to work with it's it's too big um but also you have some like some sound waves are at 100 Hertz right so they're they're going up and down 100 times a second and some are at a thousand and some are ten thousand and often there's background noise they can have extremely high frequency components and so if you're looking just at the waveform there's lots and lots of change second to Second and there's some very long range dependencies of like oh they're generally High here it's generally low there and so it can be quite hard to capture those patterns um and so part of it is it's just a lot of samples to deal with um but part of it also is that it's not like an image where you can just do like convolution and things nearby each other tend to be related or something like that um it's quite tricky to disentangle what's going on um and so we have this idea of something called a spectrogram uh this is a fancy 3D visualization but it's basically just taking that audio and mapping time on one axis so you can see as time goes by we're moving along the x-axis and then on the y-axis is frequency and so the um the Peaks here show like intensity at different frequencies and so if I make a pure note [Music] you can see that that screen maps in the frequency domain but when I'm talking there's lots and lots of Peaks and that's because our voices tend to produce a lot of overtones so if I go you can see there's a main note but there's also these subsequent notes and if I play something like a chord [Applause] you can see this you know maybe three main Peaks and then each of those have these harmonics as well so it captures a lot of information about the signal um and so we're gonna turn our audio data into something like this where even just visually if I'm a bird you can see this really nice spatial pattern yeah and the hope is if we can generate that and then if we can find some way to turn it back into audio and then we'll be Off to the Races and so yeah that's what I'm doing in this notebook we have um I'm leaning on the diffusers dot pipelines.audiodafusion.mel class and So within the realm of spectrograms there's a few different ways you can do it so this is from the torch audio docs but this notebook is from the hugging face diffusion models class so we had that waveform that's those raw samples and we'd like to convert that into what they call the frequency domain um which is things like these spectrograms um and so you can do a normal normal spectrogram a power spectrogram or something like that but we often use something called the Mel spectrogram which is exactly the same it's actually probably what's being visualized here and it's something that's designed to map The View like the frequency ranges into a range that's um like tied to what human hearing is based on and so rather than trying to capture all frequencies from you know zero Hertz to 40 000 Hertz a lot of which we can't even hear it focuses in on the range of values that we tend to be interested in as as humans and also it does like a transformation into into kind of like a log space um so that the the intensities like highs and lows correspond to loud and quiet for human hearings it's very tuned for um the types of audio information that we actually might care about rather than you know tens of thousands of kilohertz that only bats can hear um okay so we're going to rely on a class to abstract this away but it's going to basically give us a transformation from waveform to spectrogram and then it's also going to help us go from spectrogram back to waveform um and so let me show you my data I have this two image function that's going to take the audio array it's going to use the Mel um class to handle turning that into um spectrograms and the class also does things like it splits it up into chunks based on you can set like a desired um resolution I'd like 128 by 128 spectrogram it says okay great I know how many I know you need 128 like frequency bins for the frequency axis and 128 um steps on the on the time axis so it kind of handles that converting and resizing um and then it gives us these audio slice to image so that's taking a chunk of audio and turning it into the spectrogram and it also has the inverse um so our data set is fairly simple we're just referencing our original audio data set but we're calling that to image function and then returning it into a tensor and we're mapping it to minus 1.5 to 0.5 so similarly to what we've done with like the grayscale images in the past um so if you look at a sample from that data we now have instead of an audio waveform of 32 000 or 64 000 if it's two seconds samples we now have this 128 by 128 pixel um spectrogram which looks like this and it's just it's gray scale so this is just matplotlibs colors um but we can test out going from the spectrogram back to audio using the image to audio function that the the Mel class has and that should give us um now this isn't perfect because the spectrogram shows the intensity at different frequencies but with audio you've also got to worry about something called the phase and so this image to audio function is actually behind the scenes doing a kind of iterative approximation with something called the Griffin Lin algorithm so I'm not going to try and describe that here but it's just it's approximating it's guessing what the phase should be it's creating a spectrogram it's comparing that to the original it's updating it's doing the sort of like iterative very similar to like an optimization thing to try and generate an audio signal that would produce the spectrogram which we're trying to invert so just to clarify that so my understanding what you're saying is that the spectrogram is a lossy conversion of the sound into an image and specifically it's lossy because it tells you the kind of intensity at each point but it's not it's kind of like is it like the difference between a sine wave and a cosine wave like they're just shifted in different ways we don't know how much it's shifted so coming exactly the sound you do have to get that that shifting the phase correct and so it's trying to guess something and it sounds like it's not doing a great guess from the the original audio is also not that amazing um but yes the the spectrogram back to audio task this these dotted lines are like highlighting this is yeah it's an approximation and there are deep learning methods now that can do that um better or at least that sound much higher quality um because you can train a model somehow to go from this image like representation back to a an audio home signal and but we just use the approximation for this notebook um okay so now that we can represent our data as like a grayscale 128 by 128 pixel image um everything else becomes very much the same as the previous diffusion models examples we're going to use this noisify function to add different amounts of noise and so you can see now we have our spectrograms but with varying amounts of noise added we can create a simple diffusion model I'm just copying and pasting the results but with one extra layer um just with very few channels to go from 128 to 64 to 32 to 8 to 6 I mean to 16 by eight to eight um no attention just I think pretty much copied and pasted from notebook 30 and train it for in this case 15 epochs it took about I don't know this is interesting you're using simple diffusion yes um so specifically this is the simple diffusion model that you um I think I've already introduced maybe not um yeah I think we briefly looked at it so let's remind ourselves of what it does here oh okay yeah um so we have some number of down blocks with a specified number of channels and then the key Insight from simple diffusion was that you often want to concentrate the computes in the sort of middle at the low resolution so that's these these mid blocks and they're Transformers yes um yeah um and so we can stack some number of those and then um the corresponding path and this is a Unix so we're passing in the features from there the down path as you go through those up blocks um and so we're gonna go take in um image and time step we can embed the time step we're going to go through our down blocks and saving the results we're going to go through the mid blocks there we go through the Note Blocks yeah before that you've also got the um embedding of the uh locations at self.le is the loanable embeddings using scale and shift I remember all right so this is preparing it to go through the Transformer blocks by adding some learnable embeddings cool um right and then we're reshaping it to be effective gear sequence since that's how we had written our transform it to have expected 1D sequence of embeddings and so once you've gone through those mid blocks we can reshape it back and then we go through the up blocks passing in and also are saved outputs from the down path um yeah so it's a nice it's a nice model you can really control how much parameters and compute you're doing just by setting like what are the number of features or channels at each of those down block stages and how many mid blocks are you going to stack um and so if you want to scale it up it's quite easy to term let me just add more mid blocks maybe I'll add more channels and there's a very easy Mouse to tweak to get a larger or smaller model fun fun thought Shadow is um simple diffusion only came out a couple of months ago and I don't think I think ours might be the first publicly available code Forex I don't think the authors released the code I suspect this is probably the first time maybe it's ever been used to generate audio before possibly yeah I guess I know a couple of people who've at least privately done their implementations when I asked the author if he was releasing Cody said uh but it's important it's just a bunch of Transformer blocks you can get yourself I'll release it eventually um no maybe maybe not I've done it malignant but they were like oh you can see the pseudocode it's pretty easy apparently yeah it is pretty easy yeah cool so trains the last goes down as we hope um sampling is exactly the same as generating images normally and that's going to give us the spectrograms so I'm using DDM sampling with 100 steps um and to actually listen to these samples we're then are just going to use that image to audio function again to take our grayscale image um and in this case actually it expects a pil image so first converted it to pil um and then turn that back into audio and so we can play some of the generated samples wow yeah that's so cool I don't know that I could guarantee what bird is making these calls and some of them are better than others like some of them um yeah some of the original samples sounds like that so exactly yeah so yeah that's generating and fake bird calls with with um spectrogram diffusion there's projects that do this on music um so the refusion projects and yeah there's various other like pre-trained models that do diffusion on spectrograms to produce um you know music clips or Voice or whatever um I may have Frozen refusion is actually the stable diffusion model that's that's fine-tuned specifically for for this for the spectrogram generation which is which I find very impressive it's like a model that was originally for you know text to image is instead can also generate these spectrograms I guess there's still some useful information in you know this sort of text image model that kind of generalizes or you can still be used for text to audio so I found that a very interesting impressive application as well so refusion is an awesome name yeah indeed it is yeah cool and I guess since it's a latent model that leads us on to the next topic right I was just gonna say we've got a natural segue there yes so we're um if we want to replicate refusion then um we'll need latents yeah so the the final non-nlp part of stable diffusion is this ability to use the more compressed representation created by a vae called latents instead of pixels so we're going to start today by creating a vae taking a look at how it works um so to remind you as we learned back in the the first lesson of this part of part two um the vae model converts the um 256 by 256 pixel 3 Channel into a um is it 64 by 64 by 4 it'll be 32 if it's 256. I think it's 512 to 16. oh 512 64. okay so do a 32 by 32 by four so dramatically smaller which makes life so much easier um which is which is really nice um having said that you know simple diffusion does the first you know few in fact you know all the down sampling pretty quickly and and all the hard work happens you know at a 16 by 16 anyway so maybe it's you know with simple diffusion it's not as big a deal as it used to be but it's still you know it's very handy particularly because for us folks with more normal amounts of compute we can take advantage of all that hard work that the uh the uh stability.ai computers did for us by creating the the stable diffusion vae um so that's what we're going to do today but first of all we're going to create our own um so let's do a vae using fashion mnist so the first or the first stuff is just the the normal one thing I am going to do for this simple example though is I'm going to flatten the um fashion mnist pixels into a vector to make it as simple as possible um okay so uh We've we're going to end up with vectors of length 784 because 28 by 28 784. we're going to create a single kitten layer MLP with um 400 hidden and then 200 outputs so here's a linear layer so it's a sequential containing a linear and then an optimal activation function then an optional normalization we'll update init weights so that we initialize linear layers as well so before we create a vae which is a variational autoencoder we'll create a normal Auto encoder we've done this once before and we didn't have any luck in fact we were so unsuccessful that we decided to go back and create a loaner and come back a few weeks later once we knew what we were doing so here we are we're back we think we know what we're doing so we're just going to recreate an auto encoder just like we did some lessons ago so there's going to be an encoder which is a sequential which goes from our 768 inputs to our 400 hidden and then a linear layer with our 400 hidden and then an output layer from the 400 hidden to the 200 outputs of the encoder so there we've got our latents and then the decoder will go from those 200 latents to our 400 hidden have our hidden layer and then come back to our 768 inputs um all right so we can optimize that in the usual way using atom and we'll do it for 20 epochs runs pretty quickly because it's quite a small data set and quite a small model um and so what we can then do is we can grab a batch of our X who actually grab the batch of X earlier uh way back here so I've got my batch of images and we can put it through our model pop it back on the CPU and we can then have a look at our original Mini batch and we have to reshape it to 28 by 28 because we previously had flattened it so there's our original and then we can look at the result after putting it through our model and there it is and as you can see it's you know very roughly regenerated and so this is um not a massive compression it's compressing it from 768 to 200. um and it's also not doing an amazing job of recreating the original details um but you know this is the simplest possible Auto encoder so it's doing you know it's a lot better than our previous attempt so that's good um so what we could now do is we could just generate some noise and then we're not even going to do diffusion we're just going to go and say like okay we've got a decoder so let's just decode that noise and see what it creates and the answer is not anything great I mean I could kind of recognize that might be the start of a shoe maybe that's the start of a bag I don't know but it's not doing anything amazing so we have not successfully created an image generator here but there's a very simple step we can do to make something that's more like an image generator the problem is that um these 200 um this Vector of length 200 we're creating there's no particular reason that things that are not in the data set are going to create items of clothing we haven't done anything to try to make that happen we've only tried to make this work for the things in the data set you know and um therefore when we just randomly generate a bunch of you know a vector of length 200 or 16 vectors of length 300 in this case and then decode them there's no particular reason to think that they're going to create something that's recognizable as clothing so the way a vae tries to fix this is by we've got the exact same encoder as before except it's just missing its final layer its final layer has been moved over to here I'll explain why there's two of them in a moment so we've got the inputs to Hidden the hidden to Hidden and then the hidden to latents the decoder is identical okay latency hidden head into hidden head into inputs um and then just as before we call the encoder um but we do something a little bit weird next which is that we actually have two separate final layers we've got one called mu for the final of the encoder and one called LV which stands for log variance so our encoder has two different final layers so we're going to call both of them Okay so we've now got two encoded 200 long lots of latents what do we do with them what we do is we use them to Generate random numbers and the random numbers have a mean of mu um so when you take a round of zero one so this creates zero one random numbers mean zero standard deviation one so if we add mu to it they now have a mean of mu or approximately and if you multiply the random numbers by half of log variance e to the power of that right so given this log of variance this is going to give you standard deviation so this is going to give you a standard deviation of e to the half LV and a mean of mu y the half it doesn't matter too much but if you think about it standard deviation is the square root so the variance is squared so when you take the log you can move that um that uh half into the multiplication because of the log trick that's why we've just got the half here instead of the square root which would be to the power of a half um so this is just yeah this is just the standard deviation so we've got the standard deviation times normally distributed random noise plus mu so we end up with normally distributed numbers we're going to have 200 of them for each element of the batch where they have a standard deviation of the result of this final layer and a variance which is the result or log variance so if the result of this final layer and then finally we passed that through the decoder as usual I'll explain why we passed back three things but for now we'll just worry about the fact we passed back the result of the decoder so what this is going to do is it's going to generate um the the result of calling encode is going to be a little bit random on average you know it's still generating exactly the same as before which is the result of a sequential model with you know MLP with one hidden layer but it's also going to add some Randomness around that right so this here's the bit which is exactly the same as before this is the same as calling in code before but then here's the bit that adds some randomness to it and the amount of Randomness is also itself random okay so then that gets run through the decoder um okay so if we now just [Music] um or you know train that right using the result of the decoder and using um I think we didn't use MSE last we used uh binary cross entropy loss which we've seen before so if you've forgotten you should definitely go back and re-watch that are really part one um or we've done a bit of it in part two as well binary cross entropy loss um with loggetts means that you don't have to worry about doing the softmax it does the soft Max for you um so if we just um optimize this using BCE now you would expect and it would I believe I haven't checked that it would basically take this final like this layer here and turn these all into zeros as a result of which it would have no variance at all and therefore it would behave exactly the same as the previous Auto encoder does that sound reasonable to you guys yeah okay um so that wouldn't help at all because what we actually want is we want some variants and the reason we want some variance is we actually want to have it generate some latents which are not exactly our data they're around our data but not exactly our data and when it generates latents that are around our data we want them to decode to our to the same thing we want them to decode to the correct image and so as a result if we can train that right something that it it does include some variation and still decodes back to the original image then we've created a much more robust model and then that's something that we would help then that we would hope then when we say okay well now decode some noise it's going to decode to something better than this so that's the idea of a vae so how do we get it to create um a log variance which doesn't just go to zero um well we have a second loss term it's called the KL Divergence loss we've got a care called kod loss and what we're going to do is our vae loss is going to take the branding across entropy between the actual decoded bit so that's input 0 and the target okay so that's this is exactly the same as before is this binary cross entropy and we're going to add it to this kld loss K all diversions now KL Divergence the details don't matter terribly much what's important is when we look at the kld loss it's getting past the input and the targets but if you look it's not actually using the targets at all so if we pull out in the the input into it three pieces which is our predicted image our mu and our log variance we don't use this either so the BCE loss only uses the predicted image and the actual image the KL Divergence loss only uses mu and log variance and all it does is it returns a number which says for each item in the batch is Mu close to zero and is log variance close to one how does it do that well for me it's very easy mu squared so if if mu is close to zero then minimizing mu squared does exactly that right if mu is one then mu squared is one if mu is minus one mu squared is one If U is zero mu squared is zero that's the lowest you can get for a squared um Okay so we've got a mu squared piece here um and we've got a dot mean so we're just taking that's just basically taking the mean of all the mules and then there's another piece which is we've got log variance minus e to the power of log variance so if we look at that so let's just grab a bunch of numbers between neg3 and 3 and do number minus e to the power of that number and I'm just going to pop in the 1 plus and the 0.5 times as well they matter much and you can see that's got a minimum of zero so when that's a minimum of zero e to the power of that which is what we're going to be using it's actually half times e to the power of that but that's okay is what we're going to be using in our um dot forward method that's going to be e to the power of 0 which is going to be one so this is going to be minimized where log variance x equals one so therefore this whole piece here will be minimized when mu is zero and LV is also zero and and so therefore LV e to the power of LV is one now the reason that it's specifically this form is basically because um there's a specific mathematical thing called the KL Divergence which Compares how similar to distributions are and so a normal distribution can be fully characterized by its mean and its variance and so this is actually more precisely calculating the similarity that specifically the KL Divergence between the actual mu and LV that we have and a distribution with a mean of zero and a variance of one um but you can see hopefully why conceptually we have this mu dot power 2 and where we have this LV dot X LV minus lv.x yeah um so that is our Bae loss did you guys have anything to add to any of that description so maybe to highlight the the objective of this is to say rather than having it so that the exact point that that an input is encoded to decodes back to that input to be saying number one the space around that point should also decode that input because we're going to transform some variance and number two the overall variance should be like yeah the the overall space that a user should be roughly zero mean and units and variants right so instead of people to like map each input to like an arbitrary point and then decode only that exact point to an input we're now mapping them to like a restricted range and we're saying that not not just each point but its surroundings as well it should also decode back to something that looks like that image um and that's trying to like condition this latent space to be much nicer so that any object Point within that range will hopefully map to something useful which is a harder problem to solve right so we would expect given that this is exactly the same architecture we would expect its ability to actually decode would be worse than our previous attempt because it's a harder problem that we're trying to solve which is to just we've got random numbers in there as well now but we're hoping that this ability to generate images will improve um thanks johno okay so I actually asked Bing about this um which is this this is more of an example of like I think for you know now that we've got gpt4 and Bing and stuff I find they're pretty good at answering questions that like I wanted to explain to students what would happen if the variants of the latents was very low or what if they were very high so why do we want them to be one and they thought like oh gosh this is hard to explain so maybe Bing can help so I actually thought it's pretty good so I just see what being said so Bing says if the variance of the latents are very well low then the encoder distribution would be very peaked and concentrated around the means that was the thing we were describing earlier if we had trained this without the kld loss at all right it would probably make the variance zero and so therefore the latent space would be less diverse and expressive and limit the ability if the decoder to reconstruct the data accurately make it harder to generate new data if it's different from the training data which is exactly what we're trying to do and if the variance is very high then the encoder would be very spread out and diffuse it would be more the latents would be more noisy and random um make it easier to generate new data that's unrealistic or nonsensical okay so that's why we want it to be exactly at a particular point so when we train this we can just pass vae last as our loss function but it'd be nice to see how well it's going at reconstructing the original image and how it's going at creating a 0 1 distribution data separately so um what I ended up doing was creating just a really simple thing called func metric which I derived from the capital M mean um class in the torch I'm just trying to find it here from the torch eval.metrix so they've already got something that can just calculate means it's obviously this stuff's all very simple and we've created our own metrics class ourselves back a while ago that since we're using torch eval I thought this is useful to see how we can create one a custom metric where you can pass in some function to call before it calculates them in so if you call so you might remember that the way torch eval works is it has this thing called update which gets past the input and the targets so I add to the weighted sum the result of calling some function on the input into targets um so we want two kind of new metrics one is the um we're going to print it out as kld which is a func metric on kld loss someone in which we'll print out as BCE which is a funk metricon BCE loss um and so the actual when we call the learner the loss function we'll use is VE loss but we're going to pass in as metrics um this list of additional metrics to print out so I'm just going to print them out and in some ways it's a little inefficient because it's going to calculate kld loss twice and BCE lost twice one to print it out and one to go into the you know actual loss function but it doesn't take long for that bit so I think that's fine so now when we call learn.fit you can see it's printing them all out so the BCE that we got last time was 0.26 and so this time yeah it's not as good as 0.31 because it's a harder problem and it's got random randomness in it um and you can see here that the BCE and kld are pretty similar scale when it starts that's a good sign if they weren't you know I could always in the loss function scale one of them up or down but they're pretty similar to start with so that's fine so we trained this for a while and then we can use exactly the same code for sampling as before and yeah as we suspected its ability to decode is worse so it's actually not capturing the l e e at all in fact and the shoes got very blurry um but the hope is that when we call it on noise called the decoder on random noise Ah that's much better we're getting it's not amazing but we are getting some recognizable shapes so you know vaes are you know um not generally going to get you as good a results as um diffusion models are although actually if you train really good ones for a really long time they can be pretty impressive but yeah even in this extremely simple quick case we've got something that can generate recognizable items of clothing did you guys want to add anything before we move on to the stable diffusion vae okay so this yeah so this pie is very crappy um and as we mentioned like one of the one of the key reasons to use a vae is actually that you can benefit from all the compute time that somebody else has put into training a good vae maybe just also like one one thing when we say good vae the one that we've checked trained here is is good at generating because it Maps down to this like one 200 dimensional vector and then back um in a very useful way and like if you look at vaes for generating they'll often have a pretty small dimension in the middle um and they and it'll just be like this Vector that gets mapped back up and so vae that's good for generating is slightly different to one that's good for compressing and like the stable diffusion one we'll see has this like spatial component still it doesn't map it down to a single Vector it Maps it down to a 64 by 64 or whatever um I think that's smaller than the original but for generating we can't just put random noise in there and hope like a cohesive image will come out um so it's less good as a generator but it is good because it has this like compression and reconstruction ability cool yeah so let's take a look um now um to demonstrate this we want to move to a more difficult task um because we want to show off how using latent let us do stuff we couldn't do well before um so the more difficult task we're going to do is generating bigger images and specifically generate images of bedrooms using the L Sun bedrooms data set so lsan is a really nice data set which has many many millions of images across 10 scene categories and 20 object categories um and so very it's very rare for people to use of the object character categories to be honest but people quite often use the the scene categories um they're a little um well more than little okay can be extremely slow to download the the website they come from is very often down so what I did was I put a subset of 20 of them um onto AWS they kindly provide some free data set hosting for our students um and also the original L signs in a slightly complicated form it's in something called an lmdb database and so I turned them into just normal images in folders so you can yeah download them directly from from the AWS data set site that they've provided for us um so I'm just using fast core to save it and then using Python's shutl to unpack the gzipped tar file um okay so that's given us once that runs which is going to take a long time um and you know if um it might be you know even more reliable just to do this in the Shell with wget or Aria 2C or something um then doing it through python so this will work but if it's taking a long time or whatever maybe just delete it and do it in the Shell instead um okay so then I thought all right how do we turn these into latents well we could create a data set in the usual ways it's going to have a length so we're going to grab all the files so blob is a built into python which we'll search for in this case star.jpg and if you've got star star slash that's going to search recursively as long as you pass recursive so we're going to search for um all of the JPEG files inside our data slash bedroom folder so that's what this is going to do it's going to put them all into the files attribute and so then when we get an item the I item it will find the I file it will read that image so this is pi torch's read image it's um it's the fastest way to read a JPEG image people often use Pao but it's quite hard to find a really well optimized pil version that's really compiled fast where else the pytorch torch Vision team have created a very very fast read image so that's why I'm using theirs and if you pass in image read mode.rgb it'll automatically turn any one channel black and white images into three Channel images for you or if there are four Channel images with transparency it'll turn those so this is a nice way to make sure they're all the same and then this turns it into floats from not to not to one and these images are generally very close to 256 by 256 pixels so I just crop out the top 256 by 256 bit because I didn't really care that much um and we do need them to all be the same size in order that we can then pass them to the vae stable diffusion vae decoder as a batch otherwise it's going to take forever so I can create a data loader that's going to go through a bunch of them at a time um so 64 at a time and use however many CPUs I have as the number of workers it's going to do it in parallel and so the parallel bit is the bit that's actually reading the jpegs which is otherwise going to be pretty slow so if we grab a batch here it is here's what it looks like okay generally speaking they're just bedrooms although we've got one pretty risque situation in the bedroom but uh on the hair on the whole they're not not safe for work this is the first time I've actually seen a an actual bedroom scene taking place as it were um all right so as you can see this mini batch of um uh if I just grab the first 16 images has um three channels and 256 by 256 pixels so that's how big that is for 16 images so that's uh seven two eight okay so three point one four five million floats to represent this um okay so as we learned in the first lesson of part two we can grab [Music] um an auto encoder directly using diffusers using from pre-trained we can pop it onto our GPU and importantly we don't have to say with no grad anymore if we pass the requires grad false and remember this neat trick in pie torch if it ends in an underscore it actually changes the thing that you're calling in place so this is going to stop it from Computing gradients which would take a lot of time and a lot of memory otherwise so let's test it out let's encode our mini batch and so just like jono was saying this has now made it much smaller it's got just in in our 16 batch of 16 it's now a four Channel 32 by 32 so if we can compare the previous size to the new size it's 48 times smaller so that's 48 times less memory it's going to need and it's also going to be a lot less compute for a convolution to go across that image so there's no good unless we can turn it back into the original image so let's just have a look at what it looks like first now it's a four Channel image so we can't naturally look at it but what I could do is just grab the first three channels and then they're not going to be between naught and one so if I just dot sigmoid now they're between one and one and so you can see that our you know risque bedroom scene you can still recognize it right or this bedroom this bed here you can still recognize it so there's still that kind of like the basic geometry is still clearly there but it's um yeah it's clearly changed it a lot as well so importantly we can call decode on this 48 times smaller tensor and it's really I think absolutely remarkable how good it is I I can't tell the difference um to the original let me if I zoom in a bit her face is a bit blurry with her face always a bit blurry uh it's always a bit blurry first second third oh hang on did that used to look like a proper ND yeah okay so you can see this used to say that clearly there's an ND here and now you can't see those letters so so and this is actually a classic thing that's known for this particular vae is it's um not able to regenerate writing correctly at small font sizes I think it's also pretty it's like I think we hear what the faces are already pretty low resolution but if you are at a higher resolution the faces also would probably not uh be converted appropriately okay cool but overall yeah it's done a great job um a couple of other things I wanted to note was like so like you mentioned like a 40 I guess a factor of 48 degrees uh oftentimes people will refer to mostly at the spatial resolution so since it's going from 256 to uh by 256 to 32 by 32 so that's like a factor of uh eight so they sometimes will know like I think it's like f8 or something like this they'll note the the spatial resolution so sometimes you may see that written out like that um and of course is an eight squared decrease in the number of pixels which is interesting right right and then the other thing I wanted to note was that the vae is also trained with uh with a perceptual loss objective as well as uh technically like a like a a discriminator again objective I don't know if you were going to go into that yeah so um yeah so perceptual loss we've we've already discussed right so the vae is going to you know when they trained it um so I think this was trained by confers right the you know Robin and um gang um uh and use stability.ai donated compute for that and uh they went so typically actually no the the VA was actually trained separately and it's actually a trend on the open images data set and it was just this vae that they trained by themselves on you know a small subset of data but because the VA is so powerful it's actually able to be applied to all these other um data sets as well okay great yeah so they um so they would have had um a a KL diversion loss and they would have either had an MSC or BCE like something might have been an MSE loss they also had a perceptual loss which is the thing we learned about when we talked about super resolution which is where when they compared the uh the output images to the original images they would have run that through a you know imagenet trained or similar classifier and confirmed that the activations they got through that model were similar and then the final bit um is as Tunisia was mentioning is the adversarial loss um which is also known as a as a gan loss so again is a generative adversarial Network and um the the Gan loss what it does is it grabs it is actually um more specifically what's called a patchwise patch wise again loss and what it does is it takes like a little section of an image right and what they've done is they train it's um let's just simplify it for a moment and imagine that they've pre-trained a classifier right where they've basically got something that you can pass it a real you know patch from a bedroom scene and a um and a fake patch from a bedroom scene and they both go into the the What's called the discriminator and this is just a normal you know resnet or whatever which basically outputs something that either says uh yep um uh the the image is real or nope the image is fake so sorry I said it passes in two things you just oh that was wrong you just pass in one thing and it returns either it's real or it's fake and specifically it's going to give you something like the probability that it's real there is another version I don't think it's what they used to be passing to and it tells you which one's relative do you remember tanishq is it a relativistic gan or a normal gam it's I think it's a normal one yeah so the relativistic end is when you pass in two images and it says switch is more real the one we think that we if we remember correctly they use as a regular can which just tells you the probability that it's real and so you can just train that by passing in real images and fake images and having it learn to classify which ones are real and which ones are fake um so now that once you've got that model trained then as you train your Gan you pass in the patches you know of of each image into the discriminator just call D here right and it's going to spit out the probability that that's real and so if it's bad out you know 0.1 or something then you're like oh dear that's terrible our vae is spitting out pictures of bedrooms where the patches of it are easily recognized as not real but the good news is that's going to generate derivatives right and so those derivatives then is going to tell you how to change the pixels of the original generated image to make it trick again better and so what it'll do is it'll then use those derivatives as per usual to update our our vae and the vae in this in this case is going to be called a a generator right that's the thing that's generating the pixels and so the generator gets updated to be better and better at tricking the discriminator and after a while what's going to happen is the generator is going to get so good that the discriminator gets fooled every time right and so then at that point you can you can fine tune the discriminator better by putting in your better generated images right and then once your discriminator learns again how to recognize the difference between real and fake you can then use it to find the generator and so this is kind of ping-ponging back and forth between the discriminator and the generator back when Gans were first created um you know people were finding them very difficult to train and actually a method we developed at fast AI I don't know if we were the first to do it or not um was this idea of kind of pre-training a a generator just using perceptual loss and then pre-training a discriminator to be able to fool the generator and then ping-ponging backwards and forwards between them after that basically whenever the discriminator got too good start using the generator anytime the generator got too good start using the discriminator nowadays that's pretty standard I think to do it this way and so yeah this this this this this this Gan loss which is basically saying penalized for failing to fool the discriminator is called an adversarial adversarial Plus to maybe motivate why why you do this if you just did it with like a mean squared error or even a perceptual loss um with such a high compression ratio the vaes tend to like produce a fairly blurry output because it's not sure whether there's texture or not you know in in this image or the edges on like super well defined where they'll be because it's going from like one four dimensional thing up to like this whole patch of the image um and so it tends to be a little bit blurry and hazy because it's it's kind of hitching its bits whereas that's something that the discriminator can quite easily pick up oh like it's blurry it must be fake you know and so then it's having the discriminator that is adversarial loss is just kind of saying like even if you're not sure exactly where this texture goes rather go with a sharper looking texture that looks real um than with some blurry thing that's going to maximize your MSE um and so it like tricks it into kind of Faking this high resolution looking sharper output yeah and um I'm not sure if we're going to come back and like train our own and at some point but um if you're you know interested in training your own Gan or advocation called again right I mean nowadays we never really just use again we have an adversarial loss as part of a training process so if you want to learn how to use adversarial loss um but in detail and see the code the 2019 first AI course lesson seven at part one um has a has a walk through so we have sample code there and you know yeah maybe given time we'll come back to it um Okay so quite often people will call the vae encoder when they're training a model which to me makes no sense right because the encoded version of an image never changes um unless you're using data augmentation and want to do augmentation on sorry to you know encode augmented images um I think it makes a lot more sense to just do a single run through your whole training set and encode everything once so naturally the question is then well what do you where do you save that so it's going to be a lot of ram if you put this leave it in Ram and also if you you know as soon as you restart your computer we've lost all that work there's a very Nifty file format you can use called a memory mapped numpy file which is what I'm going to use to save our latents a memory mapped numpy file is basically what happens is you you take the memory in Ram that numpy would normally be using and you literally like copy it onto the hard disk basically that's what they mean by memory mapped there's a mapping between the memory and RAM and the memory and hard disk and if you change one it changes the other and vice versa they're kind of two ways of seeing the same thing and so if you and so if you create a memory mapped numpy array then when you modify it it's actually modifying it on disk but thanks to the magic of your operating system it's using all kinds of beautiful caching and stuff to not make that slower than using a normal numpy array and it's going to be very clever at um it doesn't have to store it all in Ram it only stores the bits in Ram that you need at the moment or that you've used recently it's really Nifty at kind of caching and stuff so it's kind of it's like magic but it's using the your operating system to do that magic for you so we're going to create a memory mapped file using np.memap and so it's going to be stored somewhere on your disk right so we're just going to put it here and we're going to say okay so create a memory map file in this place it's going to contain 32-bit floats so write the file and the shape of this array is going to be the size of our data set so 303 125 images and each one is four by 32 by 32. okay so that's our memory mapped file and so now we're going to go through our data loader One Mini batch of 24 at a time and we're going to vae encode that minibatch and then we're going to grab the means from its latents right we don't want random numbers we want the actual you know the midpoints the means so this is using the diffusers version of that pae so pop that onto the CPU after we're done and so that's going to be mini batch of size 64. um as Pi torch let's turn that into numpy because pytorch doesn't have a memory mapped thing as far as I'm aware but numpy does and so now that we've got this memory mapped array called a then everything from initially from zero up to 64 or 60 yeah 0 to 64. not including the 64. that whole sub part of the array is going to be set to the encoded version so it's we're just it looks like we're just changing it in memory but because this is a magic memory map file it's actually going to save it to disk as well um so yeah that's it amazingly enough that's all you need to create a memory mapped numpy array of our latents when you're done you actually have to call Dot Flash and that's just something that says like anything that's just in Cache at the moment make sure it's actually written to disk and then I delete it because I just want to make sure that then I read it back correctly so that's only going to happen once if the path doesn't exist and then after that this whole thing will be skipped and instead we're going to call mp.memap again with their own path but this time in the same data type and the same shape this time we're going to read it mode equals R means read it and so let's check it let's just grab the first 16 latents that we read and decode them and there they are okay so this is like not a very well known technique I would say sadly but it's a really good one um you might be wondering like well what about like compression like shouldn't you be zipping them or something like that but actually remember these latents are already the whole point is they're highly compressed so generally speaking zipping latents from a good VA he doesn't do much um because they're like they kind of almost look a bit random number-ish Okay so we've now saved our entire lsan bedroom it's a 20 subset the bit that I've provided now latents so we can now run it through and this is the nice thing we can use exactly the same process from here on in as usual Okay so we've got the noiseify of our usual collated version um now the blatants are much higher than one standard deviation so if we about divided by five that takes it back to a standard deviation of about one um I think in the paper they use like 0.18 or something but this is close enough to make it a unit standard deviation um so we can split it into a training and a validation set so just grab the first 90 of the training set in the last 10 percent for the validation set so those are our data sets we use a batch size of 128 so now we can use our data loaders class we created with the get DLS we created so these are all things we've created ourselves with the training set the validation set the batch size and alkylation function so yeah it's kind of nice it's amazing you know how easy it is like you know a data set has the same interface as an Umpire array or a list or whatever so we can literally just use the numpy array directly as a data set which I think is really neat this is why it's useful to know about these like foundational Concepts because you don't have to start thinking like oh I wonder if there's some torch Vision thing to use mem map nump profiles or something it's like oh wait they already do provide a data set interface they don't have to do anything I just use them so that's pretty magical so we can test that now by grabbing a batch and so this is being noiseified and so here we can see our noiseified images and so here's something crazy is that we can actually decode noiseified images and so you know here's I guess this one wasn't noisified much because it's a recognizable bedroom and this is what happens when you just decode random noise something in between so I think that's pretty fun um yeah this next bit is always copied from our previous notebook create a model initialize it train for a while so this took me a few hours on a single GPU everything I'm doing is on a single GPU literally nothing in this course other than the stable diffusion stuff itself is trained on more than one GPU um the loss is much higher than usual and that's not surprising because it's trying to generate latent pixels which rare like it's it's it's much more precise as to exactly what it wants you know it's not like lots of pixels where the ones next door are really next to each other really similar or the whole background looks the same or whatever a lot of that stuff it's been compressed out it's a more difficult thing to predict latent pixels so now we can sample from it in exactly the same way that we always have using ddam but now we need to make sure that we decode it because the thing that it's predict that it's sampled latents because the thing that we asked it to learn to predict a latents and so now we can take a look and we have bedrooms ah and some of them look pretty good I think this one looks pretty good I think this one looks pretty good this but I don't have any idea what it is um and this one like clearly there's bedroomy bits but there's something I don't know there's weird bits um so the fact that we're able to create 256 by 256 pixel images where at least some of them look quite good in a couple of hours I can't remember how long it took to train but it's a small number of hours and a single GPU is something that was not previously possible and we're in a sense we're totally cheating because we're using the stable diffusion vae to do a lot of the hard work for us um but that's fine you know because that vae knows how to create all kinds of natural images and drawings and portraits and oil paintings or whatever so you can I think work in that latent space quite comfortably yeah do you guys have anything you wanted to add about that oh actually tanishq I know you've trained this for longer I only trained it for 25 epochs how long did you how many hours did you train it for because you did you did 100 epochs right yes I did 100 it blocks I didn't keep trying exactly but I think it was about 15 hours on an a100 okay a single a100 yeah uh I'd argue I mean the results yeah I'll show it it's it's um it's it's yeah it's I guess maybe slightly better um but you know I guess you can the good ones are certainly slightly better yeah yeah like the bottom left one better than any of mine I think so it's possible maybe at this point we just need to use more data I guess because I guess we were using a 20 subset so maybe having more of that data to provide more diversity or something like that maybe yeah or maybe have you tried doing the diffusers one for 100 yeah so I've got um all right so I'll share my screen if you want to stop sharing yours um so I do have if we get around to this maybe we can add the results back to this notebook because I do have a version that uses diffusers so everything else is identical um 25 epochs except for the model for the previous one I was using our our own Unit Model so we have to change the channels now to four and a number of filters I think I might have increased it a bit um so then I tried using yeah the diffuses unit just with whatever their defaults were and so I got what did I get here 243 with diffusers I got a little bit better 239 um and um yeah I don't know if they're obviously better or not like this is a bit weird this is a bit weird um I think like um actually another thing we could try maybe is um do 100 epochs but use the diffusers number of channels and stuff that they used for because I think the defaults that they use actually for diffusers is not the same as stable diffusion so maybe we could try stable diffusion matched unit for 100 epochs and if we get any nice results maybe we can paste them into the bottom to show people yeah yeah cool um yeah do you guys have anything else to add at this point all right so I'll just mention one more thought in terms of like a bit of a interesting project people could play with [Music] um I don't know if this is too crazy I don't think it's been done before but my thought was like there was a huge difference in our super resolution remember a huge difference in our super resolution results when we used a pre-trained model and when we used um perceptual loss but particularly when we used a pre-trained model um I thought we could use a pre-trained model but we would need a pre-trained latents model right we would want something where our you know down sampling backbone was pre-trained model on latents and so I just wanted to show you what I've done and you guys you know if anybody watching wanted to try taking this further I've just done the first bit for you to give you a sense which is I pre-trained an imagenet model not tiny imagenet but a full image net model on latents as a classifier and if you use this as a backbone you know and also try maybe some of the other tricks that we found helpful like having res Nets on the cross connections There's real things that I don't think anybody's done before I don't know the scientific literature is vast and I might have missed it but I've not come across anybody do these tricks before um so obviously like we're one of the interesting parts of this which is designed to be challenging is that we're using bigger data sets now but they're data sets that you can absolutely like run on a single GPU you know um a few tens of gigabytes which fits on any modern hard drive easily so these like uh good tests of your ability to kind of like move things around and if you're somewhere that doesn't have access to a decent internet connection or whatever this might be out of the question in which case don't worry about it but if you can yes try this because it's good practice I think to make sure you can use these larger data sets um so imagenet itself you can actually grab from kaggle nowadays um so they call it the object localization challenge but actually this contains the Full image net data set well the the version that's used for the imagenet competition um so I think you know people do anything in that one case you just have to accept the terms because it has like some distribution terms yeah exactly so you've got a kind of sign in and then join the competition and then yeah accept the terms um so you can then download the data set or you can also download it from hugging face [Music] um it'll be in a somewhat different format but that'll work as well so I think I grabbed my version from from kaggle so on kaggle you know it's just a zip file you unzip it and it creates an IL svrc um directory which I think is what they called the competition yeah image net large scale visual recognition challenge okay um so then um inside there there is a data and inside there there is a CLS lock and that's actually where the that's where actually everything is going to be um so just like before I wanted to turn these all into latents so I created in that directory I created a latency subdirectory and this time partly just to demonstrate how these things work I wanted to do a slightly different way okay so again we're going to create our pre-trained VRE plop it on the GPU turn off gradients for it and I'm going to create a data set now one thing that's a bit weird about this is that because this is really quite a big um uh data set like it's got 1.3 million files the thing where we go glob star star.jpg takes a few seconds you know and particularly if you're doing this on like um you know an AWS file system or something it can take really quite a long time um on mine it only took like three seconds but I don't want to wait three seconds so I you know a common trick for these kind of big things is to create a cache which is literally just a list of the files so that's what this um this is so I decided that Z pickle means a g zipped pickle so what I do is if if the cache exists we just gzip.open the files if it doesn't we use blob exactly like before to find all the files and then we also save a gzip file containing pickle.dump files so pickle.dump is what we use in Python to take basically any python object list of dictionaries and dictionary of lists whatever you like and save them and it's super fast right and I use gzip with compressed level one to basically be like compress it pretty well but pretty fast so this is a really nice way to create a little cache of that um so this is the same as always and so our get item is going to grab the file it's going to read it in turn it into a float and what I did here was you know I'm doing a little bit lazy but I just decided to Center crop the middle you know um so let's say it was a 300 by 400 file it's going to Center crop the middle 300 by 300 section and then resize it to 256 by 256. um so they'll be the same size so yeah we can now oh I managed to create the vau twice let's delete that one so I can now just confirm I can grab a batch from that data loader encode it and here it is and then decode it again and here it is surpass the first category must have been computer or something so here's as you can see the ve is doing a good job of decoding pictures of computers so I can do something really very similar to what we did before if we haven't got that destination directory yet created go through our daily loader encoder batch and this time I'm not using a memapped file I'm actually going to save separate um numpo files for each one so go through each element of the batch each item um so I'm going to save it into the destination directory which is the latents directory and I'm going to give it exactly the same path as the original one contained because it contains the you know um uh the folder of like what what the label is um uh make sure that the directory exists that we're saving it to and Save that just is a numpy file this is another way to do it so this is going to be a separate numpy file for each item does that make sense so far okay cool so I could create a thing called a numpy data set which is exactly the same as our images data set but to get an item we don't have to use you know open a JPEG anymore we just call mp.load so this is a nice way to like take something you've already got and change it slightly so it's going to return the let's do this file Jeremy just out of Interest sorry whether you do this versus the memory mapped file was it just a different way just to show a different way yeah yeah absolutely no particularly good reason honestly um yeah I like to kind of like demonstrate different approaches and I think it's good for people's python coding if you make sure you understand what all the lines of code do and yeah they both they both work fine actually um it's partly always also forward to my own experimental interest it's like oh which one seems to kind of feel better um yeah all right so create out training and validation data sets by grabbing all the numpy files and so the training and validation folders and um then I'm going to just create a training data loader for the training data set just to see what the mean and standard deviation is on the channel Dimension so this is every Dimension except Channel what I mean over and so there it is and as you can see there the mean and standard deviation are not close to zero and one so we're going to store away that mean and standard deviation such that we then we've seen transform data set before this is just applying a transform to a data set we're going to apply the normalization transform in the past we've just we've used our own normalization that torch Vision has has one as well so this is just demonstrating how to just use torch Visions version but it's literally just subtracting the mean and dividing by the standard deviation um we're also going to apply some data augmentation we're going to use the same trick we've used before for images that are very small which is we're going to add a little bit of padding and then randomly crop our original image size from that so it's just like shifting it slightly each time and we're also going to use our random erasing and it's nice because we did it all with broadcasting this is going to apply equally well to a four Channel image as it is to a three or I think we did originally for one um now you know I don't think anybody as far as I know has built classifiers from latents before so like I didn't even know if this is going to work so I visualized it so so we created from X and a different y so for two from X you can optionally add augmentation and if you do then apply the augmentation transforms now this is going to be applied one image at a time but augmentation transforms some of them expect a batch so we create a extra unit axis on the front to be a batch of one and then remove it again and then Tiff and Y very much like we've seen before we're going to turn those path names into IDs so there's our validation and training transform data sets so so that we can look at our results we need a denormalization um so let's create our data loaders and grab mini batches and show us and so I was very pleased to see that the random arrays Works actually extremely nicely so you can see you get these kind of like weird patches you know weird patches um but it's still they're still recognizable so this is this is like something I very very often do is to and select oh is this like thing I'm doing in computer vision reasonable it's like what can my human brain recognize it so if I couldn't recognize this with a drilling platform myself that I shouldn't expect a computer to be able to do it or that this is a compass or whatever um I'm so glad they got otters so cute and you can see the cropping it's done has also been fine like it's a little bit of a fuzzy Edge but basically like it's not destroying the image at all they're still recognizable it's also a good example here of how difficult like this problem is like the fact that this is Seashore I would have called this Surfers you know but maybe surface is not an imagenet category but um yeah okay this could be food but actually it's a refrigerator okay so our augmentation seems to be working well so then I um yeah basically I've just copied and pasted you know our basic pieces here and I've kind of wanted to have it all in one place just to remind myself of exactly what it is so this is the pre-activation version of convolutions the reason for that is if I want this to be a backbone for a diffusion model or a unit then I remember that we found that pre-activation works best for units so therefore our backbone needs to be trained with pre-activation so we've got pre-activation conf got a res block I rest blocks model with dropouts this is what is copied from previous um so I decided like I wanted to try to you know use the basic trick that we learned about from simple diffusion of trying to put most of our work in the later layers so the first layer just says one block then two blocks and then four blocks and then I figured that we might then delete these final blocks these maybe you're going to just end up being for classification this might end up being our pre-trained backbone or maybe we keep them I don't know you know it's like as I said this hasn't been done before so anyway I try to design it in a way that we've got some you know we can mess around a little bit with how many of these we we keep and so also I try to use very few channels in the first blocks and so I jump up for the channels that are aware of the work's going to do I jump from 128 to 512. um so that's why I designed it this way you know I haven't even taken it any further than this so I don't know if it's going to be a useful backbone or not I didn't even know if this is going to be possible to classify it seemed very likely it was possible to classify even based on the fact that you can still kind of recognize it almost like I could probably recognize it's a computer Maybe so I thought it was going to be possible but yeah this is all new so that was the model I created and then I yeah trained it for 40 epochs um and you can see after one epocket was already 25 accurate and that's it recognizing which one over a thousand categories is it so I thought that was pretty amazing and so after 40 epochs I ended up at 66 which is really quite fantastic because a resnet 34 is kind of like 73 or 74 accuracy when trained for quite a lot longer um you know so to me this is extremely encouraging that you know this is a really pretty good resnet at recognizing images from their latent representations without any decoding or whatever so from here you know if you want to you guys could try yeah building a better bedroom diffusion model or whatever you like let's not be bedrooms actually um one of our colleagues Molly um I'm just going to find it so one of our colleagues Molly actually used the um do you guys remember was it the Celeb faces that she used so there's a celeb a HQ data set that consists of images of faces of celebrities and so what Molly did was she basically yeah used this exact notebook but used this faces data set instead and although this one's really pretty good isn't it you know um this one's really pretty good they certainly look like celebrities that's for sure um so yeah you could try this data set um or whatever but um yeah try yeah maybe try it with the pre-trained backbone try it with um resonates on the cross connections with all the tricks we used in super res try it with perceptual loss some folks we spoke to about the perceptual I think it it won't help with latents because the underlying vae was already trained with perceptual loss but we should try you know or you guys to try all these things um yeah so be sure to check out the Forum as well to see what else the people who've already tried here because it's a whole new world but it's just an example of the kind of like fun research ideas I guess we can play with um yeah what do you guys think about this are you like surprised that we're able to quickly get this kind of accuracy for enlightens or do you think this is a useful research path what are your thoughts yeah I think it's very interesting go ahead the Legends are already like a a slightly compressed richer representation of an image right um so it makes sense that that's a useful thing to train on um 66 I think Alex net is like 63 or something like that so you know we're already at state of the art but eight years ago whatever um it might be more like 10 years ago I know Titan passes quickly yeah yeah I guess next year yeah next year it is 10 years ago um but yeah I'm kind of curious with the pre-training the whole the whole value for me for like using a pre-trained network was someone else has done lots and lots of compute on image to learn some features and now I'm gonna use that so it's kind of funny to be like oh well let's pre-train for ourselves and then try and use that um I'm curious whether like how best you'd allocate that compute whether you should if you've got 10 hours of GPU just do 10 hours of training versus like five hours of pre-training and five hours of training um I mean based on our super is thing the pre-training like it was so much better so that's why I'm feeling somewhat hopeful about this direction yeah yeah I'm really curious to see how it goes I guess I was gonna say it's like yeah I think there's just a lot of opportunities for I guess the late 10 doing stuff in the latents um and like I guess maybe like yeah you could I mean here are your trade classifier as a backbone but you could think of like credit classifiers on other things for you know guidance or things like this yeah of course we've done some experiments with that I know jono has his Midview guidance approach for some of this sort of things but uh there are different approaches that you can play around here um that you know exploring in the latent space can can make it computationally cheaper than you know having to decode it anytime you want to you know like you have to look at the image and then maybe apply a classifier or apply some sort of guidance on the image but if you can do it directly in the latent space a lot of interesting opportunities there as well yeah and you know now we're showing that indeed yes everything on latents models like that's something I've done to make a latent clip is just have it like try and mirror an image space clip and so for classifiers as well you could like distill an imagenet classifier rather than just having the label you try and like copy the logins um and then that's like an even richer signal like you get more value per per example um so then you can create your you know your blatant version of some existing image classifier or object detector or yeah like multimodal model like clip um I feel like I feel funny about this because I'm like both excited about simple diffusion on the basis that it gets rid of latents but I'm also excited about latency on the basis it gets rid of most of the pixels I don't know how I can be cheering for both but somehow I am I guess May the best method win so um you know for folks that are finishing this course well first of all congratulations because it's been a journey particularly part two it's a journey which requires a lot of patience and tenacity um you know if you've kind of zipped through by binging on the videos that's totally fine it's a good approach but you know maybe go back now and do it more slowly and do the you know build build it yourself and really experiment um but assuming you know for folks who have got to the end of this and feel like okay I I get it more or less yeah do you guys have any sense of like what kind of things make sense to do now you know where would you guys go from here I think that uh great opportunities implementing papers that I guess come along these days and and and I think at this point my way yeah um but also at this stage I think you know we're already discussing research ideas and it I think you know we're in a solid position to come up with our own research ideas and explore explore those ideas so I think that's a that's a real opportunity that we we have here yeah but yeah I think that's best done often collaboratively so I'll just you know mention that um fast AI has a Discord which if you've got to this point then you're probably somebody who would benefit from from being there um and yeah just pop your head in and say like there's an introduction straight to say hello and you don't you know maybe say what you're interested in or whatever because it's it's nice to work with others I think um I mean both jono and tanishq I only know because of the the Discord and the forums and so forth so that would be one and we also have a we have a generative channel so anything related to generative models that that's the place so for example Bali was posting some of her experiments in that channel uh I think there are other fast AI members posting their experiments so if you're doing anything generative model uh related that's a great way to also get uh feedback and thoughts from from the community yeah I'd also say that like this this if if you're at the stage where you finish this course you actually understand how diffusion models work you've got a good handle on what the different components and like stable diffusion are and you know how to Wrangle data for training and play for these things you're like so far ahead of most people who are building in the space and I've got lots of lots of companies and people reaching out to me to say do you know anybody who has like more than just oh I know how to like load stable diffusion and make an image like you know someone who knows how to actually like Tinker with it and make it better and if you've got those skills like don't feel like oh I'm definitely not qualified to like apply or like there's lots of stuff where yeah just taking these ideas now and like just simple sensible ideas that we've covered in the course that have come up and saying like oh actually maybe I could try that maybe I could play with this you know take this experimentalist approach I feel like there's actually a lot of people who would love to have you helping them build the million and one little stable diffusion based apps or whatever right everyone's working out and particularly like the thing we always talk about at first AI which is particularly if you can combine that with your domain expertise you know whether it be from your your hobbies or your work in some completely different field or whatever you know there'll be lots of interesting ways to combine you know you probably are one of the only people in the world right now that understand your areas of of passion or of vocation as well as these techniques so um and again that's a good place to kind of get on the Forum or the Discord or whatever and start having those conversations because it can be yeah it can be difficult when you're at The Cutting Edge which you now are by definition all right well we better go away and start um figuring out how on Earth gpt4 Works I don't think we're going to necessarily build the whole dpt4 from scratch uh at least not at that scale but I'm sure we're going to have some interesting things happening with NLP and um jono Denise thank you so much it's been a real pleasure it was nice doing things with the um with a live audience but I gotta say I really enjoyed this uh experience of doing stuff with you guys the last two lessons so thank you so much yeah thanks for having us it was really really fun all right bye see you in part three bye

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 25: Deep Learning Foundations to Stable Diffusion ...