Thursday, November 13, 2025

Image generation by AI

So, I asked AI chatbot to recreate the sketch I posted earlier in context to COP30 (it was haphazardly drawn within few minutes, more than aesthetic sensibility communicating the idea was the intent, also I was occupied with few things) with some prompts, and it created this!

Quite rudimentary but I am sure there are much sophisticated version behind paywall. How does AI create these is an interesting question. So here is how this broadly works. Quite sure it is much sophisticated than what I explain here but it is enough to get the gist of how AI is functioning in the mainstream interface. First issue is when you give a prompt it has to be understand by the computer. This is done by Natural Language Processing (NLP). NLP enables computer to comprehend and interpret (as also produce a reply) human language. NLP uses Machine Learning (ML) and Deep Learning (DL) algorithm to discern a word’s semantic meaning by deconstructing the sentence grammatically, relationally and structurally, to understand the context of use. It can also understand intend and emotion (whether irritated, frustrated, confused so on) to draw inference from broad array of linguistic models and algorithms. To get deeper into its working. NLP breaks sentence into chunks or each word separated. This is called Tokenization. So, if the sentence has 8 words it will be 8 tokens. Next is Stemming i.e., to derive stem of each word or token. Suffix, prefix and tense are removed to get the stem. For instance, for the words sitting, sits, sat…the stem is sit. But there is a problem, stemming cannot be correct always. There are words where stem can mean different. Like for instance Universal and University don’t stem down to the word Universe. For such situation alternative tool come into play called Lemmatization. A given token’s meaning is learned through dictionary definition and then the root (or lemma) is derived. So, the stem of word better is bet while lemma is good. Stemming and lemmatization therefore is carefully done according to the context of the token. The context is derived from speech tagging -i.e., where the token is used in the sentence -whether noun or verb. Next is finding whether the word has any entity associated with it (entity recognition), for instance, token Kerala has entity Indian state associated or Sanjay has entity of person’s name. These are some of the tools used by NLP to convert unstructured human speech into structured data that is understood by computer hence applied for any AI application. 

So now that computer understands, what happens next? This is where Deep Learning (DL) comes into play, and indeed has revolutionized AI. DL is a specialized subset of Machine Learning (ML) that layers algorithm to create Neural Network -computational model replicating brain’s structure and functionality. It is DL that enables NLP’s understanding capabilities -the context and intent of what is conveyed. ML was very much active in 1980s while Neural Network (NN) came into being in 1990s but was stuck for a long time. The arrival of Big Data (through internet), enhanced computation power (GPUs) and ofcourse Deep Learning NN is what led to AI revolution (essentially Generative AI). The traditional ML learning algorithm had major issue of efficiency and performance plateauing as dataset grows. DL algorithm on the other hand continues to learn and improve with more data. DL NN consist of interconnected nodes known as neurons that take incoming data and learn to make decision over time. NN consist of input layer, hidden layer (that varies, giving more depth as it increases) and output layer. Each hidden layer transforms the input data by applying ‘activation function’ -a mathematical function that allows network to learn complex patterns. NN is trained by feeding data, error is sent back through the network to adjust internal parameters (weights and biases) helping to reduce error in future predictions. There are different types of NN. In the above case where it analyses the drawing (image) ‘Convolutional NN’ (CNN) is used (convolution is a mathematical operation done by each layer on output of previous layer, mixing two functions). Another popular NN is ‘Recurrent NN’ (RNN) in here each neuron (in hidden layer) receives input with a specific delay in time -allowing RNN to consider context of input -access previous information in current iterations. RNN is used in predicting next word in a sentence. 

Next is to generate entirely new image. This is done through Generative AI (ofcourse all these are interconnected; I am trying to separate it for convenience of understanding) that is the rage now. GenAI creates new content whether text, images, music, audio or video. GenAI is a class of AI system that learn from large data set using ML and DL algorithm to recognize patterns and trends to create new content. The design of GenAI model changes depending on what it is designed to do and how it will be used. They specifically crafted to generate new content. There are many types of GenAI models or GenAI architectures. Where images have to be generated (after Convoluted Neural Network CNN is used to analyze the drawing) Variational Auto-Encoders (VAEs) model or Generative Adversarial Networks (GANs) is used. Both are equally interesting, and it’s very likely that VAEs is being used here in this specific case where I fed the image. VAEs model works by transforming input data through Encoding and Decoding. Encoder takes the input data and turns it into a simpler form called latent space representation -which holds the key features of the data. The decoder then uses this latent space representation to create new outputs. It therefore creates new realistic images based on the pattern it has learned from the data. GANs meanwhile involves two neural networks. The Generator and the Discriminator. Generator tries to make new data samples that looks real. Discriminator verifies the generated data to tell the difference between real and fake data. The process is continued until Generator becomes so good at producing realistic data that the Discriminator is no longer able to distinguish. These are used to generate high quality realistic images. Another type of GenAI that really started the AI craze into mainstream is Transformer architecture. AI exploded into popular imagination through ChatGPT in 2022. GPT stands for Generative Pretrained Transformer. All the chatbots use Transformer architecture. Transformers are used in NLP tasks using encoder and decoder layers that enables model to effectively generate text sequence.

It is also important to know about Foundational Model (FM). These are the core of contemporary AI (except ofcourse reinforced learning -that makes my chess engine evaluate or I can play against!). All the big data, large computation and energy are going into maintaining FM. They are very large NN trained using ML and DL on tetrabytes of unstructured data in an unsupervised manner. The datasets are diverse capturing range of knowledge hence can be adapted to wide ranging tasks. Earlier each AI model was trained on very specific data to perform very specific task. Now FM is able to transfer to multiple different tasks and perform multiple different functions. They serve as base or foundation for a multitude of applications. When introduced with small amount of labelled data you can tune FM to specific task. Asking or prompting chatbot is a way of tuning information from FM. So, when I am asking chatbot to recreate the image. It is fine tuning the data from FM for this specific task of generating image. Large Language Model (LLMs) is text version of FM that fuels GenAI chatbot revolution. Different domains like models for vision, code, or science or climate change etc. are achieved by tuning FM.