AI Captain – Set Sail for the Neighbors Yard!

Prompt: "Change the drawing style to historical village, high contrast, watercolors, cabins, huts, streets, historical map"

·         Negative Prompt: "ugly, unfinished, photorealistic, low contrast, frame, water, camera",

·         Steps: 30,

·         Sampler: Euler a (tried others also, but Euler A seemed to bring the best results),

·         Image CFG scale: 1.3-2,

·         Text CFG scale: 10-14, Seed: 64373(random for each),

·         Model hash: ffd280ddcf, Model: instructpix2pix, Model Type: instruct-pix2pix

Makes sense right? I think the CFG scale is off and can pipe this back in as the input to further refine it to a true pirate map with X's and treasure. 

If it were all only this simple.



I set sail on an adventure to apply my tools and prior learning to something a bit easier than realistic photomanipulation of my real heroes into fictional superheroes.

I revisited Stable Diffusion and performed some R&D on what each element of the UI means and how Stable Diffusion presents it as such. While learning about prompts, CFG scales and samplers I added to my pirate’s booty of knowledge about denoising architectures vs GANs. I can dive deeper into this later but here are some high level definitions.

Diffusion Model - A diffusion model is a deep neural network that holds latent variables capable of learning the structure of a given image by removing its blur or noise.

Stable Diffusion models use a denoising architecture, where the model is trained to remove added noise from the data, gradually refining the generated image over time. Essentially it adds noise to data and then removes noise to produce new data similar to the original.

GANs or Generative Adversarial Networks in Machine Learning. These neural networks use each other to train. The generator learns to create plausible data and provides the discriminator negative training to decipher fake data from real data. Once the discriminator cannot figure out if the data is fake, it updates the generator. Whichever network fails is updated while its rival remains unchanged. There is always a winner in game theory, and it allows both sides to get better. This feedback loop produces higher-quality and more believable output.

With that understanding I was able to dive deeper into custom configurations and produce an outcome from a real image that depicted what I wanted in the form, style and size. I can take that image and refine it through inpainting with direct inputs from previous outputs. Interestingly I was able to also take that style and apply it to other images which is where the model learned from previous iterations and applied it to new requests.

 

Once I had a method for pirate maps I went back to my super heroes. I was able to take the method a step further and create my own model trained on a dataset that I specifically wanted to use. In order to do so you need lots of images which is why ChatGPT (Generative Pre-trained Transformer) does this so well. Does it make a little more sense now?

 

Chat GPT has endless information from the 328 million terabytes of data created daily. From this you can see why the Dall-E 2 diffusion model works so well. It’s the trained data set. Same goes for ChatGPT.

 

But once again, I don’t want a semblance of Will Ferell with cats in space. I wanted a trusted source to illustrate Matt as the real superhero he is.

 

In comes Dreambooth. Introduced by Google in 2022 it allows the fine tuning of diffusion models by introducing custom content.

 

With Dreambooth you add a set of input images and create specific text prompts including the class of the subject. The identifier would be “Matt” and class name would be “person”.

 

The source images are best raw, plentiful, diverse in nature and 512x512. They should be of different angles and backgrounds in well lit environments without hats or objects blocking the face so that the learning mechanism can distinguish the model from the background. This will create a custom model checkpoint based on your source data.

 

Save your files in google drive, reference it by using your identifier in your prompt and put it all together using the Automatic1111 (Stable Diffusion) GUI.

 

The end result is not Matt but closer to what I want. Everything including this blog is a work in progress. The important aspect to take away is the quality of the input data is the differentiator in the quality of the output. This follows my standard approach of people, process and then technology. If you don’t have the right people or the right processes, technology will never deliver what you are looking for.

 


The next update will move away from superheroes and AI image generation and apply these same principles to business processes.

Comments