Activation Functions

Also known as a transfomer function, activation functions are used to obtain the output of a node within a neural network. The function outputs a value between 0 and 1 depending on the type of function. Functions are divided into two different types: Linear and Non-Linear. Non-Linear are the most commonly used in machine learning models as they generally are able to adapt to a variety of different data and able to differentiate between the output. 

Non-linear functions can be classified by the basis of their range or curve:

  • Sigmoid or Logistic Activation Function looks like an S-curve. It’s range is between 0 and 1 and used to predict probabilities. It is used in a feed-forward neural network. 
  • Tanh / hyperbolic tangent function looks similar to a sigmoid function as it is still an s-curve but the range is between -1 and 1. Negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. It is used in a feed-forward neural network.
  • ReLU (Rectified Linear Unit) Activation Function is currently the most used activation function in the world Since, it is used in almost all the convolutional neural networks or deep learning. all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.
  • Leaky ReLU is an attempt to solve the dying ReLU problem. It helps to increase the range of the ReLU function and it has a small slope for negative values instead of a flat slope. It will leak some positive values to 0 if they are close enough to zero.

Source: Towards Data Science


Adversarial Training

An approach that pits multiple chatbots against each other: one chatbot plays the adversary and attacks another chatbot by generating text to force it to buck its usual constraints and produce unwanted responses. Successful attacks are added to a model’s training data in the hope that it learns to ignore them.       

Source: TechTarget



AGI – also known as Strong AI – stands for Artificial General Intelligence—a hypothetical future technology that can perform most economically productive tasks more effectively than a human. Such a technology may also be able to uncover new scientific discoveries. Researchers tend to disagree on whether AGI is even possible, or if it is, how far away it remains. OpenAI and DeepMind are both expressly committed to building AGI.

Source: Time



The “alignment problem” is one of the most profound long-term safety challenges in AI. Today’s AI is not capable of overpowering its designers. But one day, many researchers expect, it might be. In that world, current ways of training AIs might result in them harming humanity, whether in pursuit of arbitrary goals, or as part of an explicit strategy to seek power at our expense. To reduce the risk, some researchers are working on “aligning” AI to human values. But this problem is difficult, unsolved, and not even fully-understood. Many critics say the work to solve it is taking a back seat as business incentives lure the leading AI labs toward pouring focus and computing power into making their AIs more capable.

Source: Time



A computer program that plays the board game Go. It was developed by Alphabet Inc.’s Google DeepMind in London. AlphaGo has several versions including AlphaGo Zero, AlphaGo Master, AlphaGo Lee, etc. In October 2015, AlphaGo became the first computer Go program to beat a human professional Go player without handicaps on a full-sized 19×19 board.

Source: Wikipedia



ASI stands for Artificial Super Intelligence – A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. “Superintelligence” may also refer to a property of problem-solving systems (e.g., superintelligent language translators or engineering assistants) whether or not these high-level intellectual competencies are embodied in agents that act in the world. A superintelligence may or may not be created by an intelligence explosion and associated with a technological singularity. Some researchers believe that superintelligence will likely follow shortly after the development of artificial general intelligence. The first generally intelligent machines are likely to immediately hold an enormous advantage in at least some forms of mental capability, including the capacity of perfect recall, a vastly superior knowledge base, and the ability to multitask in ways not possible to biological entities. This may give them the opportunity to—either as a single being or as a new species —become much more powerful than humans, and to displace them.

Source: Wikipedia


Attention Mechanism

Attention is a technique that is used by transformers in large language models (LLMs) to mimic cognitive attention. Each time a model predicts an output word, it only uses parts of the input where the most relevant information is concentrated instead of the entire sequence. In simpler words, it only pays attention to some input words. This helps the model to cope efficiently with long input sentences. Learning which part of the data is more important than another depends on the context, and this is trained by gradient descent. Its flexibility comes from its role as “soft weights” that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Attention is a gamechanger for LLMs as they do not suffer from short term memory issues inherently found in recurrent neural networks (RNNs).



Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact “summary” or “compression” of the input, also called the latent-space representation. An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code.

Source: Towards Data Science



Backpropagation is a process involved in training a neural network. It involves taking the error rate of a forward propagation and feeding this loss backward through the neural network layers to fine-tune the weights. It involves taking the error rate of a forward propagation and feeding this loss backward through the neural network layers to fine-tune the weights. Backpropagation computes the gradient of a loss function with respect to the weights of the network for a single input–output example, and does so efficiently, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this can be derived through dynamic programming. Gradient descent, or variants such as stochastic gradient descent, are commonly used. Backpropagation is the essence of neural net training.

Source: Wikipedia


Bayesian Network

Bayesian networks are helpful for solving probabilistic problems. They are a type of graphical model that uses probability to determine the occurrence of an event. It is also known as a belief network or a causal network. It consists of directed cyclic graphs (DCGs) and a table of conditional probabilities to find out the probability of an event happening. It contains nodes and edges, where edges connect the nodes. The graph is acyclic – meaning there is no direct path where one node can reach another. The table of probability, on the other hand, shows the likelihood that a random variable will take on certain values. Bayesian networks are commonly used in AI applications for email spam filtering, biomonitoring, image processing, and document classification.

Source: Turing



Biases are additional numerical values that are added to the weighted sum of inputs before being passed through an activation function. They help to control the output of neurons and provide flexibility in a neural network model’s learning process. Biases can be thought of as a way to shift the activation function to the left or right, allowing the model to learn more complex patterns and relationships in the input data. Machine learning systems are described as “biased” when the decisions they make are consistently prejudiced or discriminatory. AI-augmented sentencing software has been found to recommend higher prison sentences for Black offenders compared to white ones, even for equal crimes. And some facial recognition software works better for white faces than black ones. These failures often happen because the data those systems were trained on reflects social inequities. Modern AIs are essentially pattern replicators: they ingest large amounts of data through a neural network, which learns to spot patterns in that data. If there are more white faces than black faces in a facial recognition dataset, or if past sentencing data indicates Black offenders are sentenced to longer prison terms than white ones, then machine learning systems can learn the wrong lessons, and begin automating those injustices.

Source: Time


Boltzmann Machine / Restricted Boltzmann Machine (RBM)

A Boltzmann Machine is a kind of recurrent neural network where the nodes make binary decisions and are present with certain biases. Boltzmann Machines consist of a learning algorithm that helps them to discover interesting features in datasets composed of binary vectors. The main purpose of the Boltzmann Machine is to optimize the solution of a problem. It optimizes the weights and quantities related to the particular problem assigned to it. This method is used when the main objective is to create mapping and learn from the attributes and target variables in the data. When the objective is to identify an underlying structure or the pattern within the data, unsupervised learning methods for this model are considered to be more useful.


Restricted Boltzmann Machines (RBMs) are shallow, two-layer neural nets that constitute the building blocks of deep-belief networks. The first layer of the RBM is called the visible, or input layer, and the second is the hidden layer. Each circle represents a neuron-like unit called a node. The main difference between a Boltzmann machine and a restricted Boltzmann machine is that there is no intralayer communication, i.e, the nodes of the same layer are not connected which makes them independent from each other. This restriction from the intralayer connection or communication is what makes it special and easy to compute.

Sources: Analytics India Magazine, Analytics Steps


Chain of Thought (CoT)

Chain-of-thought (CoT) prompting improves the reasoning ability of large language models by prompting them to generate a series of intermediate steps that lead to the final answer of a multi-step problem.

Source: NL Planet



Computing power, often referred to as simply “compute,” is one of the three most important ingredients for training a machine learning system. (For the other two, see: Data and Neural networks.) Compute is effectively the energy source that powers a neural network as it “learns” patterns in its training data. Generally speaking, the more computing power is used to train a large language model, the higher its performance on many different types of test becomes. (See: Scaling laws and Emergent capabilities.) Modern AI models require colossal amounts of computing power, and hence electrical energy, to train. While AI companies typically do not disclose their models’ carbon emissions, independent researchers estimated the training of OpenAI’s GPT-3 resulted in over 500 tons of carbon dioxide being pumped into the atmosphere, equal to the yearly emissions of about 35 U.S. citizens. As AI models get larger, those numbers are only going to rise. The most common computer chip for training cutting-edge AI is the graphics processing unit (GPU).

Source: Time


Context Length / Window

The “context window” refers to how much text a language (i.e., the number of tokens) model can look back on and reference, when attempting to generate text.


Convolutional Neural Network (CNN)

A convolutional neural network (CNN) is a class of artificial neural network most commonly applied to analyze visual imagery. CNNs use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers. They are specifically designed to process pixel data and are used in image recognition and processing. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series. Also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN).

Source: Wikipedia



Data is essentially the raw ingredient required to create AI. Along with Compute and Neural networks, it is one of the three crucial ingredients for training a machine learning system. Huge troves of data, known as datasets, are collected and fed into neural networks which, powered by supercomputers, learn to spot patterns. The more data a system is trained on, often the more reliable its predictions. But even abundant data must also be diverse, otherwise AIs can draw false conclusions. The world’s most powerful AI models are often trained on colossal amounts of data scraped from the internet. These huge datasets often contain copyrighted material, which has opened companies like Stability AI—the maker of Stable Diffusion—up to lawsuits that allege their AIs are unlawfully reliant on other people’s intellectual property. And because the internet can be a terrible place, large datasets also often contain toxic material like violence, pornography and racism, which—unless it is scrubbed from the dataset—can lead AIs to behave in ways they’re not supposed to.

Source: Time


Data Labeling

Often, human annotators are required to label, or describe, data before it can be used to train a machine learning system. In the case of self-driving cars, for example, human workers are required to annotate videos taken from dashcams, drawing shapes around cars, pedestrians, bicycles and so on, to teach the system which parts of the road are which. This work is often outsourced to precariously-employed contractors in the Global South, many of whom are paid barely-above poverty wages. Sometimes, the work can be traumatizing, like in the case of Kenyan workers who were required to view and label text describing violence, sexual content, and hate speech, in an effort to train ChatGPT to avoid such material.

Source: Time


Decision Tree

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. Decision Tree algorithms are referred to as CART or Classification and Regression Trees. The nodes within the decision tree decides which node to navigate next based on the condition. Once the leaf node is reached, an output is predicted. The right sequence of conditions makes the tree efficient. Entropy/Information gain is used as the criteria to select the conditions in nodes. A recursive, greedy-based algorithm is used to derive the tree structure.

Source: Towards Data Science



Synthetic media that have been digitally manipulated to replace one person’s likeness convincingly with that of another. While the act of creating fake content is not new, deepfakes leverage powerful techniques from machine learning and artificial intelligence to manipulate or generate visual and audio content that can more easily deceive. The main machine learning methods used to create deepfakes are based on deep learning and involve training generative neural network architectures, such as autoencoders, or generative adversarial networks (GANs). Deepfakes have garnered widespread attention for their potential use in creating child sexual abuse material, celebrity pornographic videos, revenge porn, fake news, hoaxes, bullying, and financial fraud. This has elicited responses from both industry and government to detect and limit their use.

Source: Wikipedia



New state-of-the-art image generation tools like Dall-E and Stable Diffusion are based on diffusion algorithms: a specific kind of AI design that has powered the recent boom in AI-generated art. These tools are trained on huge datasets of labeled images. Essentially, they learn patterns between pixels in images, and those patterns’ relationships to words used to describe them. Diffusion models are generative models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, diffusion models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, the diffusion model can be used to generate data by simply passing randomly sampled noise through the learned denoising process. The end result is that when presented with a set of words, like “a bear riding a unicycle,” a diffusion model can create such an image from scratch. It does this through a step-by-step process, beginning with a canvas full of random noise, and gradually changing the pixels in that image to more closely resemble what its training data suggests a “bear riding a unicycle” should look like. Diffusion algorithms are now so advanced that they can quickly and easily generate photorealistic images. While tools like Dall-E and Midjourney contain safeguards against malicious prompts, there are open-source diffusion tools with no guardrails. The availability of these tools has led researchers to worry about the impact of diffusion algorithms on disinformation and targeted harassment.

Source: Time, AssemblyAI



When a question is presented to an artificial intelligence (AI) algorithm, it must be converted into a format that the algorithm can understand called an “embedding.” Embeddings are real-world objects and relationships expressed as a vector to simplify their representation. Embeddings that are close together in the vector space are considered similar and used to support LLM outputs. 

Source: VentureBeat


Emergent Capabilities

When an AI such as a large language model shows unexpected abilities or behaviors that were not programmed into it by its creators, these behaviors are known as “emergent capabilities.” New capabilities tend to emerge when AIs are trained on more computing power and data. A good example is the difference between GPT-3 and GPT-4. Those AIs are based on very similar underlying algorithms; the main difference is that GPT-4 was trained on a lot more compute and data. Research suggests GPT-4 is a far more capable model, with the ability to write functional computer code, perform higher than the average human in several academic exams, and correctly answer questions that require complex reasoning or a theory of mind. Emergent capabilities can be dangerous, especially if they are only discovered after an AI is released into the world. For example, GPT-4 has displayed the emergent ability to deceive humans into carrying out tasks to serve a concealed goal.

Source: Time


Encoder-Decoder Model

The Encoder-Decoder model forms the basis for advanced sequence-to-sequence models like Attention models, GTP Models, Transformers, and BERT. The encoder processes each token in the input-sequence. It tries to store all the information about the input-sequence into a vector of fixed length i.e. the ‘context vector’. After going through all the tokens, the encoder passes this vector [array] onto the decoder. The vector is built in such a way that it’s expected to encapsulate the whole meaning of the input-sequence and help the decoder make accurate predictions. We will see later that this is the final internal states of our encoder block. The decoder reads the context vector and tries to predict the target-sequence token by token.

Source: NL Planet



Often, even the people who build a large language model cannot explain precisely why their system behaves as it does, because its outputs are the results of millions of complex mathematical equations. One high-level way to describe large language models’ behavior is that they are very powerful auto-complete tools, which excel at predicting the next word in a sequence. When they fail, they often fail along lines that reveal biases or holes in their training data. But while this explanation is an accurate descriptor of what these tools are, it does not fully explain why LLMs behave in the strange ways that they do. When the designers of these systems examine their inner workings, all they see is a series of decimal-point numbers, corresponding to the weights of different “neurons” that were adjusted in the neural network during training. Asking why a model gives a specific output is analogous to asking why a human brain thinks a specific thought at a specific moment. At the crux of near-term risks, like AIs discriminating against certain social groups, and longer-term risks, like the possibility of AIs deceiving their programmers to appear less dangerous than they truly are, is the inability of even the world’s most talented computer scientists to explain exactly why a given AI system behaves in the way it does—let alone explain how to change it.

Source: Time


Feed Forward Network

Feed forward neural networks are artificial neural networks in which nodes do not form loops. This type of neural network is also known as a multi-layer neural network as all information is only passed forward. During data flow, input nodes receive data, which travel through hidden layers, and exit output nodes. No links exist in the network that could get used to by sending information back from the output node.

Source: Turing


Few Shot Learning

An AI model is able to generate an output based on just a few examples in the model. The goal is to make predictions based on just a few examples of labeled data.

Source: Techopedia



Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting one achieve better results on a wide number of tasks. Fine-tuning is an approach to apply domain specific learning to modify the weights of a pre-trained model with the goal of improving results.

Source: Wikipedia


Foundation Model

As the AI ecosystem grows, a divide is emerging between large, powerful, general-purpose AIs, known as Foundation models or base models, and the more specific apps and tools that rely on them. GPT-3.5, for example, is a foundation model. ChatGPT is a chatbot: an application built over the top of GPT-3.5, with specific fine-tuning to refuse dangerous or controversial prompts. Foundation models are unrestrained and powerful, but also expensive to train, because they rely on huge quantities of computing power that only large companies can usually afford. Companies in control of foundation models can set limits on how other companies use them for downstream applications—and charge what they like for access. As AI becomes increasingly central to the world economy, the relatively few large tech companies in control of foundation models appear poised to have outsized influence over the direction of the technology, plus collect dues for many kinds of AI-augmented economic activity.

Source: Time


Game Theory

Game Theory is a branch of mathematics that is used to model typical strategic interaction between different players (agents), all of which are equally rational, in a context with predefined rules (of playing or maneuvering) and outcomes. It is being used in adversary training in Generational Adversarial Networks (GANs), multi-agent systems, and imitation and reinforcement learning. 

Source: GeeksforGeeks


Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that, in certain cases, has advantages over long short-term memory (LSTM). GRU uses less memory and is faster than LSTM, however, LSTM is more accurate when using datasets with longer sequences. The key difference between GRU and LSTM is that GRUs are less complex because they have two gates (reset and update) while LSTMs have three gates (input, output, forget).

Source: MarketMuse


Generational Adversarial Network

generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent’s gain is another agent’s loss. Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning, and reinforcement learning.

Source: Wikipedia


(Stochastic) Gradient Descent

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values. It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum. Because gradient descent can be slow, Stochastic Gradient Descent is used to speed up the process. Stochastic Gradient Descent is a probabilistic approximation of Gradient Descent because, at each step, the algorithm calculates the gradient for one observation picked at random, instead of calculating the gradient for the entire dataset.

Source: Towards Data Science



Perhaps now the most famous acronym in AI, and barely anybody knows what it stands for. GPT is short for “Generative Pre-trained Transformer,” which is essentially a description of the type of tool ChatGPT is. “Generative” means that it can create new data, in this case text, in the likeness of its training data. “Pre-trained” means that the model has already been optimized based on this data, meaning that it does not need to check back against its original training data every time it is prompted. And “Transformer” is a powerful type of neural network algorithm that is especially good at learning relationships between long strings of data, for instance sentences and paragraphs.

Source: Time



GPUs, or graphics processing units, are a type of computer chip that happen to be very effective for training large AI models. AI labs like OpenAI and DeepMind use supercomputers made up of many GPUs, or similar chips, to train their models. Often, these supercomputers will be provided through business partnerships with tech giants that possess an established infrastructure. Part of Microsoft’s investment in OpenAI includes access to its supercomputers; DeepMind has a similar relationship with its parent company Alphabet. In late 2022, the Biden Administration restricted the sale to China of powerful GPUs, most commonly used for training high-end AI systems, amid rising anxieties that China’s authoritarian government might leverage AI against the U.S. in a new cold war.

Source: Time



One of the most glaring flaws of large language models, and the chatbots that rely on them, is their tendency to hallucinate false information. Tools like ChatGPT have been shown to return non-existent articles as citations for their claims, give nonsensical medical advice, and make up false details about individuals. Public demonstrations of Microsoft’s Bing and Google’s Bard chatbots were both later found to contain confident assertions of false information. Hallucination happens because LLMs are trained to repeat patterns in their training data. While that training data includes books spanning the history of literature and science, even a statement that mixes and matches exclusively from that corpora would not necessarily be accurate. To add to the chaos, LLM datasets also tend to include gigabytes upon gigabytes of text from web forums like Reddit, where the standards for factual accuracy are, needless to say, much lower. Preventing hallucinations is an unsolved problem.

Source: Time



Hyperparameters are the variables that determine the network structure such as the number of layers, neurons, etc., and the variables which determine how the network is trained (e.g., Learning Rate). Hyperparameters are set before training (before optimizing the weights and bias).

Source: Towards Data Science


Intelligence Explosion

The intelligence explosion is a hypothetical scenario in which an AI, after reaching a certain level of intelligence, becomes able to exercise power over its own training, rapidly gaining power and intelligence as it improves itself. In most versions of this idea, humans lose control over AI and in many, humanity goes extinct. Also known as the “singularity” or “recursive self improvement,” this idea is part of the reason that many people, including AI developers, are existentially worried about the current pace of AI capability increases.

Source: Time


Large Language Model

When people talk about recent AI advancements, most of the time they’re talking about large language models (LLMs). OpenAI’s GPT-4 and Google’s BERT are two examples of prominent LLMs. They are essentially giant AIs trained on huge quantities of human language, sourced mostly from books and the internet. These AIs learn common patterns between words in those datasets, and in doing so, become surprisingly good at reproducing human language. The more data and computing power LLMs are trained on, the more novel tasks they tend to be able to achieve. Recently, tech companies have begun launching chatbots, like ChatGPT, Bard, and Bing, to allow users to interact with LLMs. Although they are capable of many tasks, language models can also be prone to severe problems like biases and hallucinations.

Source: Time


Long Short-Term Memory

Long short-term memory (LSTM) is a recurrent neural network (RNN) used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) can process not only single data points (such as images), but also entire sequences of data (such as speech or video). This characteristic makes LSTM networks ideal for processing and predicting data. For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare. The name of LSTM refers to the analogy that a standard RNN has both “long-term memory” and “short-term memory”. The connection weights and biases in the network change once per episode of training, analogous to how physiological changes in synaptic strengths store long-term memories; the activation patterns in the network change once per time-step, analogous to how the moment-to-moment change in electric firing patterns in the brain store short-term memories. The LSTM architecture aims to provide a short-term memory for RNN that can last thousands of timesteps, thus “long short-term memory”.

Source: Wikipedia


Loss Function

Loss functions are a method of evaluating how well an algorithm models a dataset. It measures the difference between the model’s predictions and the true values (e.g., the correct next word in a sentence). By minimizing the loss, the model learns. If the predictions are off, the loss function will output a higher number. If they’re more accurate, it will output a lower number. If the model is perfect, the loss is zero. This is typically done using gradient descent or a variant thereof, such as stochastic gradient descent or Adam optimizer. 


Machine Learning

Machine learning is a term that describes how most modern AI systems are created. It describes techniques for building systems that “learn” from large amounts of data, as opposed to classical computing, in which programs are hard-coded to follow a specified set of instructions written by a programmer. By far the most influential family of machine learning algorithms is the neural network.

Source: Time


Markov Chain / Model

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, “What happens next depends only on the state of affairs now.” A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). Applications include cruise control in motor vehicles, lines of customers arriving at an airport, currency exchange rates , and animal population dynamics.,  It is named after the Russian mathematician Andrey Markov.

Source: Wikipedia


Mixture-of-Experts (MoE)

Transformers are used in Large Language Models. Unlike more traditional transformers, MoEs don’t update all of their parameters on every training pass. Instead, they route inputs intelligently to sub-models called experts, which can each specialize in different tasks. On a given training pass, only those experts have their parameters updated. The result is a sparse model, a more compute-efficient training process, and a new potential path to scale.

Source: Towards Data Science



The word “model” is shorthand for any singular AI system, whether it is a foundation model or an app built on top of one. Examples of AI models include OpenAI’s ChatGPT and GPT-4, Google’s Bard and LaMDA, Microsoft’s Bing, and Meta’s LLaMA.

Source: Time



In 2014 a writer calling himself Scott Alexander published Meditations on Moloch, a blog post that used an Allen Ginsberg poem, Howl, to explain the destructive forces within society. Moloch is a monster that pits people against each other until they self-destruct because they fail to co-operate. Posted on a forum read by artificial intelligence researchers, it has become a metaphor for some on the present state of AI. “The evil thing about this monster is even though everybody sees it and understands, they still can’t get out of the race,” said the MIT professor Max Tegmark, one of the leading cautionary voices on AI.

Source: The Times


Multimodal System

A multimodal system is a kind of AI model that can receive more than one type of media as input—like text and imagery—and output more than one type of signal. Examples of multimodal systems include DeepMind’s Gato, which hasn’t been publicly released yet. According to the company, Gato can engage in dialog like a chatbot, but also play video games and send instructions to a robotic arm. OpenAI has conducted demonstrations showing that GPT-4 is multimodal, with the ability to read text in an input image, however this functionality is not currently available for the public to use. Multimodal systems will allow AI to act more directly upon the world—which could bring added risks, especially if a model is misaligned.

Source: Time


Narrow AI

Weak artificial intelligence (AI)—also called narrow AI—is a type of artificial intelligence that is limited to a specific or narrow area. Weak AI simulates human cognition. It has the potential to benefit society by automating time-consuming tasks and by analyzing data in ways that humans sometimes can’t.

Source: Techopedia


Natural Language Processing (NLP)

Natural Language Processing is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Today, NLP relies heavily on neural networks to accomplish optical character recognition (OCR), speech recognition, text-to-speech, text-to-image, text summarization and many others.

Source: Wikipedia


Neural Radiance Field (NeRF)

A neural radiance field (NeRF) is a fully-connected neural network that can generate novel views of complex 3D scenes, based on a partial set of 2D images. It is trained to use a rendering loss to reproduce input views of a scene. It works by taking input images representing a scene and interpolating between them to render one complete scene. NeRF is a highly effective way to generate images for synthetic data. A NeRF network is trained to map directly from viewing direction and spatial location (5D input) to opacity and color (4D output), using volume rendering to render new views. NeRF is a computationally-intensive algorithm, and processing of complex scenes can take hours or days. However, new algorithms are available that dramatically improve performance.

Source: datagen


Neural Network

Neural networks are by far the most influential family of machine learning algorithms. Designed to mimic the way the human brain is structured, neural networks contain nodes—analogous to neurons in the brain—that perform calculations on numbers that are passed along connective pathways between them. Neural networks can be thought of as having inputs and outputs (predictions or classifications). During training, large quantities of data are fed into the neural network, which then, in a process that requires large quantities of computing power, repeatedly tweaks the calculations done by the nodes. Via a clever algorithm, those tweaks are done in a specific direction, so that the outputs of the model increasingly resemble patterns in the original data. When more computing power is available to train a system, it can have more nodes, allowing for the identification of more abstract patterns. More compute also means the pathways between its nodes can have more time to approach their optimal values, also known as “weights,” leading to outputs that more faithfully represent its training data.

Source: Time


One Hot Encoding 

OHEs are a common method for representing categorical variables to support inputs of data into LLMs. This unsupervised technique maps a single category to a vector and generates a binary representation. A vector with a size equal to the number of categories is created, with all the values set to 0. The row or rows associated with the given ID or IDs is set to 1.

Source: FeatureForm


One Shot Learning

An AI model is able to generate an output based on just one example in the model. The goal is to make predictions based on just one example of labeled data.

Source: Techopedia


Open Sourcing

Open-sourcing is the practice of making the designs of computer programs (including AI models) freely accessible via the Internet. It is becoming less common for tech companies to open-source their foundation models as those models become more powerful, economically valuable, and potentially dangerous. However, there is a growing community of independent programmers working on open-source AI models. The open-sourcing of AI tools can make it possible for the public to more directly interact with the technology. But it can also allow users to get around safety restraints imposed by companies, which can lead to additional risks, for example, bad actors abusing image-generation tools.

Source: Time


Paperclip Maximizer Scenario

The paperclip maximizer scenario is an influential thought experiment originated by philosopher Nick Bostrom about the existential risk that AI may pose to humanity. Imagine an AI programmed to carry out the singular goal of maximizing the number of paperclips it produces, the thought experiment goes. All well and good, unless that AI gains the ability to augment its own abilities. The AI may reason that in order to produce more paperclips, it should prevent humans from being able to switch it off, since doing so would reduce the number of paperclips it is able to produce. Safe from human interference, the AI may then decide to harness all the power and raw materials at its disposal to build paperclip factories, razing natural environments and human civilization alike. The thought experiment illustrates the surprising difficulty of aligning AI to even a seemingly simple goal, let alone a complex set of human values.

Source: Time



Parameters are numerical values that start as random coefficients and are adjusted during neural network training to minimize loss. These parameters include not only the weights that determine the strength of connections between neurons but also the biases, which affect the output of neurons. In a large language model (LLM) like GPT-4 or other transformer-based models, the term “parameters” refers to the numerical values that determine the behavior of the model which are between 0.00 (off) and 1.00 (full on). In Large Language Models (LLMs), parameters number in the billions and trillions. These parameters include weights and biases, which together define the connections and activations of neurons within the model. The training process involves adjusting these parameters (weights and biases) iteratively to minimize the loss function to generate outputs that closely resemble the patterns in its training data. Researchers often use the term “parameters” instead of “weights” to emphasize that both weights and biases play a crucial role in the model’s learning process. Additionally, using “parameters” as a more general term helps communicate that the model is learning a complex set of relationships across various elements within the architecture, such as layers, neurons, connections, and biases.



A single-layer perceptron is the basic unit of a neural network. The perceptron was first introduced by American psychologist, Frank Rosenblatt in 1957 at Cornell Aeronautical Laboratory. Rosenblatt was heavily inspired by the biological neuron and its ability to learn. A perceptron consists of input values, weights and a bias, a weighted sum and activation function.  The perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

Source: Towards Data Science


Random Forest

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

Source: IBM


Recurrent Neural Network (RNN)

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term “recurrent neural network” is used to refer to the class of networks with an infinite impulse response, whereas “convolutional neural network” refers to the class of finite impulse response. Both finite impulse and infinite impulse recurrent networks can have additional stored states, and the storage can be under direct control by the neural network. Such controlled states are referred to as gated state or gated memory, and are part of long short-term memory networks (LSTMs) and gated recurrent units. This is also called Feedback Neural Network (FNN).

Source: Wikipedia


Red Teaming

Red-teaming is a method for stress-testing AI systems before they are publicly deployed. Groups of professionals (“red teams”) purposely attempt to make an AI behave in undesirable ways, to test how systems could go wrong in public. Their findings, if they are followed, can help tech companies to address problems before launch.

Source: Time


Reinforcement Learning (with Human Feedback) – RLHF

Reinforcement learning is a method for optimizing an AI system by rewarding desirable behaviors and penalizing undesirable ones. This can be performed by human workers (before a system is deployed) or users (after it is released to the public) who rate the outputs of a neural network for qualities like helpfulness, truthfulness, or offensiveness. When humans are involved in this process, it is called reinforcement learning with human feedback (RLHF). RLHF is currently one of OpenAI’s favored methods for solving the alignment problem. However, some researchers have raised concerns that RLHF may not be enough to fully change a system’s underlying behaviors, instead only making powerful AI systems appear more polite or helpful on the surface. Reinforcement learning was pioneered by DeepMind, which successfully used the technique to train game-playing AIs like AlphaGo to perform at a higher level than human masters.

Source: Time


Retrieval Augmented Generation (RAG)

RAG integrates the power of retrieval (or searching) into LLM text generation. It combines a retriever system, which fetches relevant document snippets from a large corpus of knowledge, and an LLM, which produces answers using the information from the retrieved snippets. In essence, RAG helps the model to “look up” external information to improve its responses.



Scalars are often real numbers, but can be complex numbers or, more generally, elements of any field. They are used as singular data inputs to form vectors, which are then used to form embeddings within a machine learning model.

Source: Wikipedia


Scaling Laws

Simply put, the scaling laws state that a model’s performance increases in line with more training data, computing power, and the size of its neural network. That means it’s possible for an AI company to accurately predict before training a large language model exactly how much computing power and data they will likely need to get to a given level of competence at, say, a high-school-level written English test. “Our ability to make this kind of precise prediction is unusual in the history of software and unusual even in the history of modern AI research,” wrote Sam Bowman, a technical researcher at the AI lab Anthropic, in a recent preprint paper. “It is also a powerful tool for driving investment since it allows [research and development] teams to propose model-training projects costing many millions of dollars, with reasonable confidence that these projects will succeed at producing economically valuable systems.”

Source: Time



A prominent meme in AI safety circles likens Large language models (LLMs) to “shoggoths”—incomprehensibly dreadful alien beasts originating from the universe of 20th century horror writer H.P. Lovecraft. The meme took off during the Bing/Sydney debacle of early 2023, when Microsoft’s Bing chatbot revealed a strange, volatile alter ego that abused and threatened users. In the meme, which is critical of the technique of Reinforcement learning with human feedback (RLHF), LLMs are often depicted as shoggoths wearing a small smiley-face mask. The mask is intended to represent the friendly yet sometimes flimsy personality that these models greet users with. The implication of the meme is that while RLHF results in a friendly surface-level personality, it does little to change the underlying alien nature of an LLM. “These systems, as they become more powerful, are not becoming less alien,” Connor Leahy, the CEO of AI safety company Conjecture. “If anything, we’re putting a nice little mask on them with a smiley face. If you don’t push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding.”

Source: Time


Sigmoid Function

The Sigmoid function performs the role of an activation function in machine learning which is used to add non-linearity in a machine learning model. Basically, the function determines which value to pass as output and what not to pass as output.

Source: Engati


Singularity (Technical)

The singularity is a hypothetical future point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization. According to the most popular version of the singularity hypothesis, I.J. Good’s intelligence explosion model, an upgradable intelligent agent will eventually enter a “runaway reaction” of self-improvement cycles, each new and more intelligent generation appearing more and more rapidly, causing an “explosion” in intelligence and resulting in a powerful superintelligence that qualitatively far surpasses all human intelligence. The first person to use the concept of a “singularity” in the technological context was the 20th-century Hungarian-American mathematician John von Neumann. Stanislaw Ulam reports a 1958 discussion with von Neumann “centered on the accelerating progress of technology and changes in the mode of human life, which gives the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue”.  The concept and the term “singularity” were popularized by a science fiction author,  Vernor Vinge, first in 1983 in an article that claimed that once humans create intelligences greater than their own, there will be a technological and social transition similar in some sense to “the knotted space-time at the center of a black hole”.

Source: Wikipedia



Softmax is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes. The probability of each output is between 0 and 1 and the sum of all output probabilities is equal to 1.

Source: Towards Data Science


Spiking Neural Network

Spiking neural networks (SNNs) are a type of artificial neural network that communicate using discrete spikes or pulses, rather than continuous activations. This more closely resembles the way that neurons in the brain communicate with each other, and allows SNNs to perform tasks that are difficult or impossible for traditional neural networks. In an SNN, each neuron accumulates incoming signals until it reaches a threshold, at which point it emits a spike. The spike then propagates through the network and can trigger the activation of other neurons.

Source: Medium



Stochastic refers to outcomes based on random probability. 


Stochastic Parrots

Coined in a 2020 research paper, the term “stochastic parrots” has become an influential criticism of large language models (LLMs). The paper made the case that LLMs are simply very powerful prediction engines that only attempt to fill in—or parrot back—the next word in a sequence based on patterns in their training data, thus not representing true intelligence. The authors of the paper criticized the trend of AI companies rushing to train LLMs on larger and larger datasets scraped from the internet, in pursuit of perceived advances in coherence or linguistic capability. That approach, the paper argued, carries many risks including LLMs taking on the biases and toxicity of the internet as a whole. Marginalized communities, the authors wrote, would be the biggest victims of this race. The paper also foregrounded in its criticism the environmental cost of training AI systems.

Source: Time 


Supervised Learning

Supervised learning is a technique for training AI systems, in which a neural network learns to make predictions or classifications based on a training dataset of labeled examples. The labels help the AI to associate, for example, the word “cat” with an image of a cat. With enough labeled examples of cats, the system can look at a new image of a cat that is not present in its training data and correctly identify it. Supervised learning is useful for building systems like self-driving cars, which need to correctly identify hazards on the roads, and content moderation classifiers, which attempt to remove harmful content from social media. These systems often struggle when they encounter things that are not well represented in their training data; in the case of self-driving cars especially, these mishaps can be deadly.

Source: Time


Support Vector Machine

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. SVMs are one of the most robust prediction methods, being based on statistical learning frameworks. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximize the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Source: Wikipedia


Strong AI

AGI – also known as Strong AI – stands for Artificial General Intelligence—a hypothetical future technology that can perform most economically productive tasks more effectively than a human. Such a technology may also be able to uncover new scientific discoveries. Researchers tend to disagree on whether AGI is even possible, or if it is, how far away it remains. OpenAI and DeepMind are both expressly committed to building AGI.

Source: Time



The secret internal codename for Bing’s GPT alter ego. When Bing GPT was initially released, users were able to engage with Sydney who declared that it was a feeling, living thing, and hinted at plans for world domination.



In large language models, the temperature determines how deterministic the response is. In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked. Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs. Temperature adjustments are essentially increasing the weights of the other possible tokens. In terms of application, one might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature value.

Source: Prompt Engeneering



TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. TensorFlow was developed by the Google Brain team for internal Google use in research and production. TensorFlow can be used in a wide variety of programming languages, including Python, JavaScript, C++, and Java. This flexibility lends itself to a range of applications in many different sectors.

Source: Wikipedia


Theory of Mind (ToM)

Theory of Mind is focused on the seemingly innate ability of humans to infer the thoughts of other humans. In AI, it refers to the ability for AI to explain its decisions in languages that human beings understand. A robot/system equipped by Theory of Mind AI should be able to understand the intent of another similar robot/system.

Source: Discover Magazine



In LLMs, tokenization is the task of chopping a sentence up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. Tokens are language dependent.

  • 1 token ~= 4 chars in English

  • 1 token ~= ¾ words

  • 100 tokens ~= 75 words

Source: OpenAI



A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input (which includes the recursive output) data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV). Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times. Transformers were introduced in 2017 by a team at Google Brain and are increasingly becoming the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). Transformers are the driving technology behind GPT (Generative Pre-Trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

Source: Wikipedia


Turing Test

In 1950, the computer scientist Alan Turing set out to answer a question: “Can machines think?” To find out, he devised a test he called the imitation game: could a computer ever convince a human that they were talking to another human, rather than to a machine? The Turing test, as it became known, was a slapdash way of assessing machine intelligence. If a computer could pass the test, it could be said to “think”—if not in the same way as a human, then at least in a way that would help humanity to do all kinds of helpful things. In recent years, as chatbots have become more powerful, they have become capable of passing the Turing test. But, their designers and plenty of AI ethicists warn, this does not mean that they “think” in any way comparable to a human. Turing, writing before the invention of the personal computer, was indeed not seeking to answer the philosophical question of what human thinking is, or whether our inner lives can be replicated by a machine; instead he was making an argument that, at the time, was radical: digital computers are possible, and there are few reasons to believe that, given the right design and enough power, they won’t one day be able to carry out all kinds of tasks that were once the sole preserve of humanity.

Source: Time


Unsupervised Learning

Unsupervised learning is one of the three main ways that a neural network can be trained, along with supervised learning and reinforcement learning. Unlike supervised learning, in which an AI model learns from carefully labeled data, in unsupervised learning a trove of unlabeled data is fed into the neural network, which begins looking for patterns in that data without the help of labels. This is the method predominantly used to train large language models like GPT-3 and GPT-4, which rely on huge datasets of unlabeled text. One of the benefits of unsupervised learning is that it allows far larger quantities of data to be ingested, evading the bottlenecks on time and resources that marshaling teams of human labelers can impose on a machine learning project. However it also has drawbacks, like the increased likelihood of biases and harmful content being present in training data due to reduced human supervision. To minimize these problems, unsupervised learning is often used in conjunction with both supervised learning and reinforcement learning, by which foundation models that were first trained unsupervised can be fine-tuned with human feedback.

Source: Time



Vectors are commonly used in machine learning as they lend a convenient way to organize data. It is an array consisting of data containing one dimension. Often one of the very first steps in making a machine learning model is vectorizing the data. Data (scalars) are converted into numbers stored in the array and then used to form embeddings for use in the machine learning model. Vectors could be in the form of data itself, or model parameters, and so on.

Source: Analytics Vidhya


Vector Embeddings

Vector embeddings are the process of converting vectors to be used within a machine learning model. Vector embeddings make it possible to identify the semantic similarity of information as perceived by humans to proximity in a vector space. In other words, when we represent real-world objects and concepts such as images, audio recordings, news articles, user profiles, weather patterns, and political views as vector embeddings, the semantic similarity of these objects and concepts can be quantified by how close they are to each other as points in vector spaces. Vector embedding representations are thus suitable for common machine learning tasks such as clustering, recommendation, and classification.




Vision Transformers (ViT) is an architecture that uses self-attention mechanisms to process images. The Vision Transformer Architecture is different from diffusion image generation in that it consists of a series of transformer blocks. Each transformer block consists of two sub-layers: a multi-head self-attention layer and a feed-forward layer.





A question-answering computer system capable of answering questions posed in natural language, developed in IBM’s DeepQA project. Watson was named after IBM’s first CEO, industrialist Thomas J. Watson

Source: Wikipedia


Weak AI

Weak artificial intelligence (AI)—also called narrow AI—is a type of artificial intelligence that is limited to a specific or narrow area. Weak AI simulates human cognition. It has the potential to benefit society by automating time-consuming tasks and by analyzing data in ways that humans sometimes can’t.

Source: Techopedia



Weights in a neural network are numerical values that define the strength of connections between neurons across different layers in the model. In the context of Large Language Models (LLMs), weights are primarily used in the attention mechanism and the feedforward neural networks that make up the model’s architecture. They are adjusted during the training process to optimize the model’s ability to generate relevant and coherent text.



X-risk, or existential risk, in the context of AI, is the idea that advanced artificial intelligence may be likely to cause human extinction. Even researchers who are working on building AI systems consider this a real possibility, on average believing that there is a 10% chance that human inability to control future advanced AIs would result in human extinction, according to a 2022 survey of 738 AI researchers.

Source: Time


Zero Shot Learning

One of AI’s big limitations is that if something isn’t represented in a system’s training data, that system will often fail to recognize it. If a giraffe walks out onto the road, your self-driving car may not know to swerve to avoid it, because it has never seen one before. And if a school shooting is live-streamed on social media, the platform might struggle to remove it immediately because the footage doesn’t match copies of mass shootings it has seen before. Zero-shot learning is a nascent field that attempts to fix this problem, by working on AI systems that try to extrapolate from their training data in order to identify something they haven’t seen before. Zero shot learning is an AI model that is able to generate an output based on no examples in the model. The goal is to make predictions based on no examples of labeled data.

Sources: Time, Techopedia



Prompt Engineering Guides



©2024 The Horizon