So, I asked AI chatbot to
recreate the sketch I posted earlier in context to COP30 (it was haphazardly
drawn within few minutes, more than aesthetic sensibility communicating the
idea was the intent, also I was occupied with few things) with some prompts,
and it created this!
Quite rudimentary but I am sure
there are much sophisticated version behind paywall. How does AI create these
is an interesting question. So here is how this broadly works. Quite sure it is
much sophisticated than what I explain here but it is enough to get the gist of
how AI is functioning in the mainstream interface. First issue is when you give
a prompt it has to be understand by the computer. This is done by Natural
Language Processing (NLP). NLP enables computer to comprehend and interpret (as
also produce a reply) human language. NLP uses Machine Learning (ML) and Deep
Learning (DL) algorithm to discern a word’s semantic meaning by deconstructing
the sentence grammatically, relationally and structurally, to understand the
context of use. It can also understand intend and emotion (whether irritated,
frustrated, confused so on) to draw inference from broad array of linguistic
models and algorithms. To get deeper into its working. NLP breaks sentence into
chunks or each word separated. This is called Tokenization. So, if the
sentence has 8 words it will be 8 tokens. Next is Stemming i.e., to
derive stem of each word or token. Suffix, prefix and tense are removed to get
the stem. For instance, for the words sitting, sits, sat…the stem is sit. But
there is a problem, stemming cannot be correct always. There are words where
stem can mean different. Like for instance Universal and University don’t stem
down to the word Universe. For such situation alternative tool come into play
called Lemmatization. A given token’s meaning is learned through dictionary
definition and then the root (or lemma) is derived. So, the stem of word better
is bet while lemma is good. Stemming and lemmatization therefore is carefully
done according to the context of the token. The context is derived from speech
tagging -i.e., where the token is used in the sentence -whether noun or verb.
Next is finding whether the word has any entity associated with it (entity
recognition), for instance, token Kerala has entity Indian state associated or
Sanjay has entity of person’s name. These are some of the tools used by NLP to
convert unstructured human speech into structured data that is understood by
computer hence applied for any AI application.
So now that computer understands,
what happens next? This is where Deep Learning (DL) comes into play, and indeed
has revolutionized AI. DL is a specialized subset of Machine Learning (ML) that
layers algorithm to create Neural Network -computational model replicating brain’s
structure and functionality. It is DL that enables NLP’s understanding
capabilities -the context and intent of what is conveyed. ML was very much
active in 1980s while Neural Network (NN) came into being in 1990s but was
stuck for a long time. The arrival of Big Data (through internet), enhanced
computation power (GPUs) and ofcourse Deep Learning NN is what led to AI
revolution (essentially Generative AI). The traditional ML learning algorithm
had major issue of efficiency and performance plateauing as dataset grows. DL
algorithm on the other hand continues to learn and improve with more data. DL NN
consist of interconnected nodes known as neurons that take incoming data and
learn to make decision over time. NN consist of input layer, hidden layer (that
varies, giving more depth as it increases) and output layer. Each hidden layer
transforms the input data by applying ‘activation function’ -a mathematical
function that allows network to learn complex patterns. NN is trained by
feeding data, error is sent back through the network to adjust internal
parameters (weights and biases) helping to reduce error in future predictions. There
are different types of NN. In the above case where it analyses the drawing
(image) ‘Convolutional NN’ (CNN) is used (convolution is a mathematical
operation done by each layer on output of previous layer, mixing two functions).
Another popular NN is ‘Recurrent NN’ (RNN) in here each neuron (in hidden
layer) receives input with a specific delay in time -allowing RNN to consider
context of input -access previous information in current iterations. RNN is
used in predicting next word in a sentence.
Next is to generate entirely new
image. This is done through Generative AI (ofcourse all these are
interconnected; I am trying to separate it for convenience of understanding)
that is the rage now. GenAI creates new content whether text, images, music,
audio or video. GenAI is a class of AI system that learn from large data set using
ML and DL algorithm to recognize patterns and trends to create new content. The
design of GenAI model changes depending on what it is designed to do and how it
will be used. They specifically crafted to generate new content. There are many
types of GenAI models or GenAI architectures. Where images have to be generated
(after Convoluted Neural Network CNN is used to analyze the drawing)
Variational Auto-Encoders (VAEs) model or Generative Adversarial Networks
(GANs) is used. Both are equally interesting, and it’s very likely that VAEs is
being used here in this specific case where I fed the image. VAEs model works
by transforming input data through Encoding and Decoding. Encoder takes the
input data and turns it into a simpler form called latent space representation
-which holds the key features of the data. The decoder then uses this latent
space representation to create new outputs. It therefore creates new realistic
images based on the pattern it has learned from the data. GANs meanwhile
involves two neural networks. The Generator and the Discriminator. Generator
tries to make new data samples that looks real. Discriminator verifies the
generated data to tell the difference between real and fake data. The process
is continued until Generator becomes so good at producing realistic data that the
Discriminator is no longer able to distinguish. These are used to generate high
quality realistic images. Another type of GenAI that really started the AI
craze into mainstream is Transformer architecture. AI exploded into popular
imagination through ChatGPT in 2022. GPT stands for Generative Pretrained
Transformer. All the chatbots use Transformer architecture. Transformers are
used in NLP tasks using encoder and decoder layers that enables model to
effectively generate text sequence.
It is also important to know about Foundational Model (FM). These are the core of contemporary AI (except ofcourse reinforced learning -that makes my chess engine evaluate or I can play against!). All the big data, large computation and energy are going into maintaining FM. They are very large NN trained using ML and DL on tetrabytes of unstructured data in an unsupervised manner. The datasets are diverse capturing range of knowledge hence can be adapted to wide ranging tasks. Earlier each AI model was trained on very specific data to perform very specific task. Now FM is able to transfer to multiple different tasks and perform multiple different functions. They serve as base or foundation for a multitude of applications. When introduced with small amount of labelled data you can tune FM to specific task. Asking or prompting chatbot is a way of tuning information from FM. So, when I am asking chatbot to recreate the image. It is fine tuning the data from FM for this specific task of generating image. Large Language Model (LLMs) is text version of FM that fuels GenAI chatbot revolution. Different domains like models for vision, code, or science or climate change etc. are achieved by tuning FM.








.jpg)


