Prompt: "Change the drawing style to historical village, high contrast, watercolors, cabins, huts, streets, historical map"
·
Negative Prompt:
"ugly, unfinished, photorealistic, low contrast, frame, water,
camera",
·
Steps: 30,
·
Sampler: Euler a (tried
others also, but Euler A seemed to bring the best results),
·
Image CFG scale: 1.3-2,
·
Text CFG scale: 10-14,
Seed: 64373(random for each),
·
Model hash: ffd280ddcf,
Model: instructpix2pix, Model Type: instruct-pix2pix
Makes sense right? I think the CFG scale is off
and can pipe this back in as the input to further refine it to a true pirate map with X's and treasure.
If it were all only this simple.
I set sail on an adventure to apply my tools and
prior learning to something a bit easier than realistic photomanipulation of my
real heroes into fictional superheroes.
I revisited Stable Diffusion and performed some
R&D on what each element of the UI means and how Stable Diffusion presents
it as such. While learning about prompts, CFG scales and samplers I added to my
pirate’s booty of knowledge about denoising architectures vs GANs. I can dive
deeper into this later but here are some high level definitions.
Diffusion Model - A diffusion model is a deep
neural network that holds latent variables capable of learning the structure of
a given image by removing its blur or noise.
Stable Diffusion models use a denoising
architecture, where the model is trained to remove added noise from the data,
gradually refining the generated image over time. Essentially it adds noise to
data and then removes noise to produce new data similar to the original.
GANs or Generative Adversarial Networks in
Machine Learning. These neural networks use each other to train. The generator
learns to create plausible data and provides the discriminator negative
training to decipher fake data from real data. Once the discriminator cannot
figure out if the data is fake, it updates the generator. Whichever network
fails is updated while its rival remains unchanged. There is always a winner in
game theory, and it allows both sides to get better. This feedback loop produces
higher-quality and more believable output.
With that understanding
I was able to dive deeper into custom configurations and produce an outcome
from a real image that depicted what I wanted in the form, style and size. I
can take that image and refine it through inpainting with direct inputs from previous
outputs. Interestingly I was able to also take that style and apply it to other
images which is where the model learned from previous iterations and applied it
to new requests.
Once I had a method for
pirate maps I went back to my super heroes. I was able to take the method a
step further and create my own model trained on a dataset that I specifically
wanted to use. In order to do so you need lots of images which is why ChatGPT
(Generative Pre-trained Transformer) does this so well. Does it make a little more sense now?
Chat GPT has endless
information from the 328 million terabytes of data created daily. From this you
can see why the Dall-E 2 diffusion model works so well. It’s the trained data
set. Same goes for ChatGPT.
But once again, I don’t
want a semblance of Will Ferell with cats in space. I wanted a trusted source
to illustrate Matt as the real superhero he is.
In comes Dreambooth.
Introduced by Google in 2022 it allows the fine tuning of diffusion models by
introducing custom content.
With Dreambooth you add
a set of input images and create specific text prompts including the class of
the subject. The identifier would be “Matt” and class name would be “person”.
The source images are
best raw, plentiful, diverse in nature and 512x512. They should be of different
angles and backgrounds in well lit environments without hats or objects
blocking the face so that the learning mechanism can distinguish the model from
the background. This will create a custom model checkpoint based on your source
data.
Save your files in
google drive, reference it by using your identifier in your prompt and put it
all together using the Automatic1111 (Stable Diffusion) GUI.
The end result is not Matt
but closer to what I want. Everything including this blog is a work in
progress. The important aspect to take away is the quality of the input data is
the differentiator in the quality of the output. This follows my standard approach
of people, process and then technology. If you don’t have the right people or
the right processes, technology will never deliver what you are looking for.
The next update will
move away from superheroes and AI image generation and apply these same principles
to business processes.
Comments
Post a Comment