A Guide to How Generative AI Works (A Look Inside the Mind of GPT-3)

This article aims explain the GPT-3 (“Generative Pretrained Transformer 3.”) technology to everyone, without overhyping the topic.

March 27, 2023 —

Gabriel Lim

Chapter 1

Introduction

Chapter 2

What Does 'GPT-3' Stand For?

Chapter 3

Breaking Down How it Works

Chapter 4

Limitations of GPT-3

Chapter 5

Conclusion

Chapter 6

Sources

Chapter 1

Introduction

Editor’s Note: OpenAI released GPT-4 on March 14, 2023, but has kept details under wraps about its language model. While real world performance capabilities continue to grow and GPT-4’s abilities surpass GPT-3’s, the fundamental nature of the technology remains the same.

“I noticed you’re in charge of marketing operations. That’s great! I think you’ll love this email because your job sucks.”
— GPT-3

The sentence you see above wasn’t written by me. It was written entirely by OpenAI’s GPT-3 artificial intelligence language model, in one of my early (and pretty bad) attempts to craft the introduction to a sales email.

Probably not the most shining example of AI, I admit.

There’s lots of hype around GPT-3 today. I’m not here to add to the hype.

From where I stand, writing about GPT-3 online falls into two categories:

Amazement and Doom! “Artificial intelligence is here, and it’s coming for your jobs!”
Technotalk. “Here dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel).”

We need a better way to tell this story. This article aims explain the GPT-3 technology to everyone, without overhyping the topic. I’ll try to make state-of-the-art AI research accessible, without diluting insight, by explaining concepts as simply as possible.

The goal is to help you:

Build intuition as a marketer for the current state of technology today
Understand how the technology works on a conceptual level, so you can evaluate vendors better
Avoid being fooled by 90% of vendors that claim to use AI

Chapter 2

What Does 'GPT-3' Stand For?

GPT-3 stands for “Generative Pretrained Transformer 3.” That’s a mouthful, so let’s break it down:

Generative

“Generative” means that the language model is designed to “generate” words in an unsupervised manner. It was trained to predict the next word from a sequence of words. In other words, it uses existing text to create more text.

Pretrained

“Pretrained” means that the language model is trained with data “as is” out of the box. In the past, most language models needed to be trained on datasets which were too large and computationally expensive. This made it infeasible for most people to use.

Transformer

“Transformer” means that it uses the “transformer architecture” which we will cover below.

3 (Three)

“3” means that this is the third time OpenAI, the company which created GPT-3, is doing this.

Chapter 3

Breaking Down How it Works

GPT-3 is what we call a trained language model.

A trained language model generates text, based on input, which influences its output.

That’s all GPT-3 does.

Granted, it does it extremely well — but it’s not the AI singularity that tech blogs have been heralding lately.

Let’s take a closer look at what’s happening within a GPT-3 process. We’ll start by explaining what the G, P, T, and 3 stand for in GPT-3.

We’re going to do it in a slightly different order, though:

Starting with Pretrained
Then tackling Generative
Then addressing Transformer

At the end of this quick explanation, you’ll have a better understanding of how GPT-3 works.

Pretrained

GPT-3 is trained with 45 terabytes of text data found online, hailing from Wikipedia, books, the common web, etc. “Training” is the process of:

Exposing the model to lots of text
Getting the model to make predictions based on that text, and
Correcting the predictions so that it learns to make better predictions over time

The above example is run at scale, across millions of examples, millions of times. And the GPT-3 you see today is just one model, trained on the above process — which was estimated to cost OpenAI $12M and 355 years of computing time just to train.

Before GPT-3, if you wanted to build a specific system for a specific natural language processing task, you needed custom architecture, lots of specific training data, lots of hand-tuning, and lots of PhDs.

But GPT-3 provided a general purpose model that’s pretty good at many things out of the box (such as sci-fi writing, marketing copy, arithmetic, SQL generation, and more).

The interesting thing about GPT-3 is that its creators didn’t have an end use-case in mind.

Which is a lot like how people are.

You didn’t come out of the womb really good at poker, or at running demand gen campaigns, or administering Marketo without breaking a sweat. But your brain is a general-purpose computer that can figure out how to get very good at that stuff with enough practice and intentionality.

The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.
— Sam Altman (@sama) July 19, 2020

So, is GPT-3 the AI singularity we have all been worried about? No. GPT-3 feels tantalizingly close to artificial general intelligence (AGI), but it’s not.

Generative

At the heart of it, GPT-3 is trained to predict (or “generate”) the next word from a sequence of words.

In the above simplified examples I’ve shared, you may have assumed that GPT-3 outputs entire phrases. For instance:

“All the world’s a stage,” -> “and all the men and women merely players.”

But that’s not entirely accurate.

GPT-3 actually generates one word at a time (‌to be precise, these are called “tokens,” but let’s just assume a token is a word for now).

How does GPT-3 make these predictions? By basically using a lot of matrix multiplication and linear algebra.

Wait. You can do linear algebra on words?

Yep. It’s pretty clever, actually. At a high-level, GPT-3‌ converts a word to what is called a vector (a list of numbers representing the word).

Here’s an example:

The above is how the word “King” is represented as a vector.

Ultimately, it’s just a list of 50 numbers. By comparing these numbers to other numbers from other words (word vectors), you can capture a lot of information, meaning and associations around these words.

Source: GloVe: Global Vectors for Word Representation (https://nlp.stanford.edu/pubs/glove.pdf).

This interactive tool shows how GPT-2 attempts to use predictions to build content. You can play with this tool here to appreciate how the technology has evolved. Compared to the latest tools, it feels pretty primitive.

In our Shakespeare example, we’re only interested in the first word it outputs (“women”), and as you can see, GPT-2 gets it right most of the time.

Does this look familiar?

This is similar to the “next-word prediction” feature on your smartphone keyboards.

Think of GPT-3 generation capabilities as just a souped-up version of your smartphone keyboard’s word predictor.

To summarize, GPT-3 produces outputs by:

Converting the input words into a vector (which is a list of numbers, as mentioned above)
Putting it into a magic box (which we will talk more about later), doing some matrix multiplication, and calculating a prediction
Converting the result (which is a vector) into a word

Transformer

In June 2017, a bunch of really smart folks published a paper titled, “Attention Is All You Need.” In it, they introduced an architecture called a Transformer, which was quite “transformative” (pardon me) to the field of natural language processing.

Now I’m going to explain “self-attention” — a crucial component of the 2017 paper. Don’t worry: I’m oversimplifying things so everyone can understand it without specialized knowledge.

What is Self-Attention?

Imagine we have this input sentence that we want to process:

“The whale didn’t swim to the surface because it was too lazy.”

What does “it” in this sentence refer to? Is it referring to the surface or the whale? (Spoiler: It’s the whale.) This is a simple question for a toddler to answer … but it’s not that easy for an algorithm.

Self-attention enables our model to correctly associate the word “it” with “whale.” It enables a text model to “look at” every single word in the original sentence before making a decision about what word to include in its output.

This allows the AI to “bake” the understanding of other relevant words into the word it’s currently processing.

Here’s let’s see how GPT-3 maintains coherence of the new text it generates. After each token is produced, it’s added to the sequence of inputs:

This new sequence becomes the new input to the model in its next step. Rinse and repeat.

This is called “auto-regression,” and is one of the ideas which made Recurrent Neural Networks (RNNs) “unreasonably effective.”

This paper, “The Unreasonable Effectiveness of Recurrent Neural Networks,” captured my attention in May 2015, and was one of the founding inspirations for Saleswhale, which was acquired by 6sense in 2021 and evolved into Conversational Email.

As an example, look at the prompts given to our Conversational Email tool for a company called LaunchDarkly:

“LaunchDarkly is a feature management platform. It enables dev and ops teams to control their whole feature lifecycle, from concept to launch to value. Feature flagging is an industry best practice of wrapping a new or risky section of code or infrastructure change with a flag.”

In this instance, words like “it” and “their” are referring to other words — specifically “LaunchDarkly” and “dev and ops teams.” There’s no way for the AI to understand or process this prompt without incorporating that context. When the model ingests this sentence, it has to know that:

The sentence refers to LaunchDarkly
“Their” refers to dev and ops teams
“With a flag” refers to the concept of feature flagging

This is what self-attention does. It bakes in the model’s understanding of relevant terms that explain the context of certain words before processing it.

Why are transformers such a large breakthrough? It’s the ability to scale up this process immensely through a concept called parallelization.

Before 2017, the way we used deep learning to understand text was by using models like Recurrent Neural Networks (RNN), which we mentioned briefly above.

Source: Wikipedia

An RNN basically takes an input (such as “The whale didn’t swim to the surface because it was too lazy”), processes the words one at a time, and then sequentially spits out the output.

But RNNs had issues. They struggled with large inputs of text, and by the time they’d get to the end of a sentence, they’d forget what had happened at its beginning. An RNN might forget that it was “the whale” who was too lazy and think it was “the surface” instead.

And the biggest problem with RNNs? Because they processed words sequentially, it was hard to distribute the computing tasks to multiple processors. And as LLMs got larger and larger, the sheer amount of computation pushed through individual GPUs caused slowdowns in even the fastest machines. This slowdown also meant you couldn’t train them on much data.

Transformers changed everything. Transformers could be very efficiently when parallelized. And with enough hardware, you could train some really big models.

How big? Really, really big.

GPT-3 was trained on 45 terabytes (!) of text, which includes almost the entire entirety of the public internet. That’s mind-boggling, and brings us to the final part of this article.

3 (Three)

As mentioned above, the invention of Transformers gave us the ability to throw massive data and inputs at our models.

And GPT-3 is MASSIVE.

It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which word token to generate at each run.

Think of parameters as dials on a radio that you move up and down to find the right signal.

To understand GPT-3, it helps to understand how GPT-2 came about.

A Wild Idea

In October 2019, Google released BERT, which overshadowed the release of OpenAI’s “GPT-1” (just called GPT back then). Instead of trying to beat BERT at its own game, the OpenAI team decided to take a different route.

Here’s how I imagine the conversation among the OpenAI team went down:

Engineer #1: “Hey, this BERT thing seems to be working better than our GPT idea. What do we do?”

Engineer #2: “We can try copying them…”

Engineer #3: “I know! Let’s just throw more GPUs and data at our model and call it GPT-2!”

All three together: “Awesome!”

We’ll never know if that conversation happened, but it’s essentially what the OpenAI team set out to do. (This is another gross simplification.) But the real question is: Did it work? Does cranking all the dials to 11 and throwing more money at the problem lead to better results?

The answer was surprisingly, yes.

Since our work on "Semi-supervised sequence learning", ELMo, BERT and others have shown changes in the algorithm give big accuracy gains. But now given these nice results with a vanilla language model, it's possible that a big factor for gains can come from scale. Exciting! https://t.co/fL4acZE2Cn
— Quoc Le (@quocleix) February 18, 2019

They blew BERT, with an already impressive 340 million parameters, out of the water with GPT-2’s stupendous 1.5 billion parameters.

GPT-2 showed that it’s possible to improve the performance of a model by “simply” increasing the model size.

With the success of GPT-2… guess what happened next? Someone went even further, saying, “Hey… what if we scaled this even more? Like way more?”

And that’s how GPT-3 was born.

Two Orders of Magnitude Larger

GPT-3 is a MASSIVE, MASSIVE, 175 billion parameter model. That’s two orders of magnitude larger than the already huge GPT-2 model.

Think of GPT-3 as a much bigger GPT-2. An approach that was made possible by the transformer architecture that we covered above.

GPT-3 cost 200x more to train than GPT-2 — a whopping $12 million dollars in computation just to train the model.

Reading the OpenAI GPT-3 paper. Impressive performance on many few-shot language tasks. The cost to train this 175 billion parameter language model appears to be staggering: Nearly $12 million dollars in compute based on public cloud GPU/TPU cost models (200x the price of GPT-2) pic.twitter.com/5ztr4cMm3L
— Elliot Turner (@eturner303) May 29, 2020

GPT-3 showed the potential to “scale it till you break it” with the transformer architecture.

Chapter 4

Limitations of GPT-3

For all of its might and ability to appear human-like in its output, GPT-3 loses coherence, especially on longer texts. And if you read carefully, there can be a surrealist quality to the text that it generates.

Let’s compare GPT-3 to a toddler.

A toddler may have grammatically incorrect or broken speech, but it mostly has a coherent line of thought. It has some concept it wants to express, but it may not know the right words or grammar to express it.

GPT-3 is the exact opposite. It knows the words and the right grammar, but lacks an underlying coherent concept of it wants to express.

Here’s my favorite GPT-3 generated passage from a research paper. Even though it’s obviously incoherent, and easily identified as machine written, for some absurdist reason, I really like it:

Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm

Subtitle: Joaquin Phoenix pledged to not change for each awards event

Article: A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedo with a paper bag over his head that read, “I am a shape-shifter. I can’t change the world. I can only change myself.” It was a promise to not change to fit into the Hollywood mold: “I think that’s a really special thing, to not change yourself. I think it’s a really special thing to say, ‘This is what’s inside of me, I’m proud of it, and I’m not going to be ashamed because of the way that someone else thinks I should be.’” Now, it’s the Oscars, and Phoenix is at it again. But this time, his publicist is saying he’ll be wearing a tux no matter what.

Megyn Kelly was not impressed, and she let him have it on The Tonight Show. “You know, I feel like, I feel like you could have worn the tux,” she says. “But you’re saying you’re a shape-shifter. I don’t know if you can change your tux, but you can change your mind. You can change your mind. You can change your mind.” Phoenix says he did, but it didn’t stick. “I was like, ‘Okay, I’m going to wear a tuxedo to this thing.’ And then I thought, ‘I don’t want to wear a tuxedo to this thing.’” Kelly goes on to encourage him to change his mind again, but Phoenix says it’s too late: “I’m committed to wearing this.”

Chapter 5

Conclusion

GPT-3 has captured the imagination of the AI community. An interesting discussion on a programming forum compared the human brain with where GPT-3 is now.

A typical human brain has over 100 trillion synapses, which is just another 3 orders of magnitude larger than the current GPT-3 175 billion parameter model.

It took OpenAI just slightly more than a year to increase their GPT model by 2 orders of magnitude (1.5 billion -> 175 billion).

What would happen if we had a trillion parameter model?

I guess we will find out in another year or two.

Chapter 6

Sources

The fall of RNN – https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Transformer: A Novel Neural Network Architecture for Language Understanding – https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Generating Wikipedia by Summarizing Long Sequences – https://arxiv.org/pdf/1801.10198.pdf
Transfer learning – https://www.tensorflow.org/tutorials/images/transfer_learning
Semi-supervised Sequence Learning – https://arxiv.org/abs/1511.01432
BERT Paper: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT-1 Paper: Improving Language Understanding by Generative Pre-Training
GPT-2 Paper: Language Models are unsupervised multitask learners
GPT-3 Paper: Language models are few shot learners
Interactive GPT-2 demo: https://transformer.huggingface.co/
Attention Is All You Need (https://arxiv.org/abs/1706.03762)
The Unreasonable Effectiveness of Recurrent Neural Networks: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

Ready to see 6sense in action?

Gabriel Lim

Singaporean born and bred, Gabriel Lim is a self-taught programmer. As founder of the conversational email tool Saleswhale, he was one of the first Singaporean entrepreneurs invited to participate in the Y Combinator program. He was also one of the first 100 people on the planet to get access to GPT-3 through OpenAI’s private beta program. He joined 6sense upon Saleswhale’s acquisition in 2022, and leads our Generative AI efforts.

How We Do It

Create Your Very Own AI Email Assistant

How We Do It

Industries

Data

Ecosystem

Dive into 6sense Revenue AI™

6sense Named a Leader in The 2024 Gartner® Magic Quadrant™ for ABM Platforms

WE GET YOU. WE GOT YOU.

Become Our Partner

A Guide to How Generative AI Works (A Look Inside the Mind of GPT-3)

Chapters

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Table of Contents

Introduction

What Does 'GPT-3' Stand For?

Breaking Down How it Works

Pretrained

Generative

Transformer

What is Self-Attention?

3 (Three)

A Wild Idea

Two Orders of Magnitude Larger

Limitations of GPT-3

Conclusion

Sources

Gabriel Lim

Solutions

Company

Browse 6sense Data