Getting word embeddings¶
This tutorial introduces how to extract features, or word embeddings based on our stimulus transcript. Features are numeric vectors that capture the meaning of the words in our transcript. Here, we will present two types types of features: interpretable syntactic features and high-dimensional word embeddings from a language model.
Acknowledgments: This tutorial draws heavily on the encling tutorial by Samuel A. Nastase.
[ ]:
# only run this cell in colab
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
!pip install accelerate transformers spacy scikit-learn
First, we’ll import some general-purpose Python packages.
[1]:
import numpy as np
import pandas as pd
Extracting syntactic features¶
One type of linguistic features are explicit grammatical features we are familiar with and have names for. These can include parts of speech (e.g., noun, verb) or syntactic dependencies (e.g., root, subject, object). We will use the spaCy library (Honnibal et al., 2020).
[2]:
import spacy
from sklearn.preprocessing import LabelBinarizer
First we need to load the transcript as a pandas
dataframe. It contains columns of words and their start and end timestamps.
[ ]:
bids_root = "" # if using a local dataset, set this variable accordingly
# Download the transcript, if required
transcript_path = f"{bids_root}stimuli/podcast_transcript.csv"
if not len(bids_root):
!wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$transcript_path
transcript_path = "podcast_transcript.csv"
df = pd.read_csv(transcript_path)
df.head(10)
SpaCy requires us to download and load a model that enables its features. First, we will download the en-core-web-sm model trained on English and includes components for part-of-speech tagging and dependency parsing.
[ ]:
!python -m spacy download en_core_web_sm
[4]:
modelname = "en_core_web_sm"
nlp = spacy.load(modelname)
Collecting en-core-web-lg==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 400.7/400.7 MB 105.5 MB/s eta 0:00:0000:0100:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
Language processing pipelines typically use a tokenizer
to standarize the (sub-)word units (called tokens) they can operate on. Some words and punctuation will get separated into multiple tokens. For example, the word “there’s” will be tokenized into “there” and “‘s”. Thus the first step for us is to transform our transcript words into tokens that spaCy can work with.
To keep track of our word and their indices we first create a word_idx
column. We then tokenize the words using the tokenizer. Then, we will explode the dataframe so that each row of the dataframe is a token (and not a word). Note that we will add white spaces to the end of words before tokenization so we can track the boundary of each word. Compare the dataframes from before and from below.
[5]:
df.insert(0, "word_idx", df.index.values)
df["word_with_ws"] = df.word.astype(str) + " "
df["hftoken"] = df.word_with_ws.apply(nlp.tokenizer)
df = df.explode("hftoken", ignore_index=True)
df.head(10)
[5]:
word_idx | word | start | end | word_with_ws | hftoken | |
---|---|---|---|---|---|---|
0 | 0 | Act | 3.710 | 3.790 | Act | Act |
1 | 1 | one, | 3.990 | 4.190 | one, | one |
2 | 1 | one, | 3.990 | 4.190 | one, | , |
3 | 2 | monkey | 4.651 | 4.931 | monkey | monkey |
4 | 3 | in | 4.951 | 5.011 | in | in |
5 | 4 | the | 5.051 | 5.111 | the | the |
6 | 5 | middle. | 5.151 | 5.391 | middle. | middle |
7 | 5 | middle. | 5.151 | 5.391 | middle. | . |
8 | 6 | So | 6.592 | 6.732 | So | So |
9 | 7 | there's | 6.752 | 6.912 | there's | there |
Now we will create a doc objcet (essentially a list of token objects) from our tokenized text:
[6]:
words = [token.text for token in df.hftoken.tolist()]
spaces = [token.whitespace_ == " " for token in df.hftoken.tolist()]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)
doc = nlp(doc)
We will loop through the doc, and get the features for each token. The features include text
, tag
(detailed part-of-speech tag), dep
(syntactic dependency, i.e. the relation between tokens), and is_stop
(is the token part of a stop list). We will organize the features into a second dataframe and add those columns back to df
. We will drop the two columns we don’t need
anymore, and then save df
for future encoding.
[7]:
features = []
for token in doc:
features.append([token.text, token.tag_, token.dep_, token.is_stop])
df2 = pd.DataFrame(
features, columns=["token", "pos", "dep", "stop"], index=df.index
)
df = pd.concat([df, df2], axis=1)
df.drop(["hftoken", "word_with_ws"], axis=1, inplace=True)
df.head(10)
[7]:
word_idx | word | start | end | token | pos | dep | stop | |
---|---|---|---|---|---|---|---|---|
0 | 0 | Act | 3.710 | 3.790 | Act | NNP | ROOT | False |
1 | 1 | one, | 3.990 | 4.190 | one | CD | nummod | True |
2 | 1 | one, | 3.990 | 4.190 | , | , | punct | False |
3 | 2 | monkey | 4.651 | 4.931 | monkey | NN | appos | False |
4 | 3 | in | 4.951 | 5.011 | in | IN | prep | True |
5 | 4 | the | 5.051 | 5.111 | the | DT | det | True |
6 | 5 | middle. | 5.151 | 5.391 | middle | NN | pobj | False |
7 | 5 | middle. | 5.151 | 5.391 | . | . | punct | False |
8 | 6 | So | 6.592 | 6.732 | So | RB | advmod | True |
9 | 7 | there's | 6.752 | 6.912 | there | EX | expl | True |
Since the features we extracted are all categorical, we need to turn them into numerical vectors. We will use LabelBinarizer from sklearn
, which fits to all the possible category labels for a feature and then transforms our labels into one-hot binary vectors. There are 50 possible labels for tag
and 45 possible for dep
. So those two features will be turned into 50-dimensional and
45-dimensional vectors respectively. Our is_stop
feature is binary, so it will just be one dimensional. We concatenate all three features to form a 96-dimensional syntactic feature overall and save it for future encoding.
[8]:
taggerEncoder = LabelBinarizer().fit(nlp.get_pipe("tagger").labels)
dependencyEncoder = LabelBinarizer().fit(nlp.get_pipe("parser").labels)
a = taggerEncoder.transform(df.pos.tolist())
b = dependencyEncoder.transform(df.dep.tolist())
c = LabelBinarizer().fit_transform(df.stop.tolist())
embeddings = np.hstack((a, b, c))
print(f"Embeddings have a shape of: {embeddings.shape}")
Embeddings have a shape of: (5305, 96)
Extracting GPT-2 Features¶
Now we will extract contextual word embeddings from an autoregressive (or “causal”) large language model (LLM) called GPT-2 (Radford et al., 2019). GPT-2 relies on the Transformer architecture to sculpt the embedding of a given word based on the preceding context. The model is composed of a repeated circuit motif—called the “attention head”—by which the model can “attend” to previous words in the context window when determining the meaning of the current word. This GPT-2 implementation is composed of 12 layers, each of which contains 12 attention heads that influence the embedding as it proceeds to the subsequent layer. The embeddings at each layer of the model comprise 768 features and the context window includes the preceding 1024 tokens. Note that certain words will be broken up into multiple tokens; we’ll need to use GPT-2’s “tokenizer” to convert words into the appropriate tokens. GPT-2 has been (pre)trained on large corpora of text according to a simple self-supervised objective function: predict the next word based on the prior context.
We will be using the HuggingFace transformers library for working with these models. If you want to learn more about LLMs and GPT-2, here are some great blogs explaining transformers and GPT-2 architecture. The HuggingFace website also has many useful resources.
Note
Using large language models, even small ones, requires a lot of compute resources. If you’re use Colab, go to Edit
→ Notebook Settings
and select a GPU. Restart the runtime and try running again. Aftewards, you can run !nvidia-smi
in a new cell to verify you have GPU available.
[9]:
import torch
from accelerate import Accelerator, find_executable_batch_size
from transformers import AutoModelForCausalLM, AutoTokenizer
Let’s reload the stimulus transcript.
[10]:
df = pd.read_csv(transcript_path)
df.head(10)
[10]:
word | start | end | |
---|---|---|---|
0 | Act | 3.710 | 3.790 |
1 | one, | 3.990 | 4.190 |
2 | monkey | 4.651 | 4.931 |
3 | in | 4.951 | 5.011 |
4 | the | 5.051 | 5.111 |
5 | middle. | 5.151 | 5.391 |
6 | So | 6.592 | 6.732 |
7 | there's | 6.752 | 6.912 |
8 | some | 6.892 | 7.052 |
9 | places | 7.072 | 7.342 |
We will define some of the general arguments, including the model name as it appears on HuggingFace, the context length (i.e., how many tokens we input into the model), and compute device. We can set the device to cuda
to utilize a GPU if it’s available.
[11]:
modelname = "gpt2"
context_len = 32
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda", 0)
print("Using cuda!")
We will now load the GPT-2 tokenizer to convert words into a list of tokens. Then, we will explode the dataframe so that each row of the dataframe is a token. We will convert tokens to token_ids
(integers IDs corresponding to words in the GPT-2 vocabulary, which contains approximately 50,000 tokens) to use as input into GPT-2.
[12]:
# Load model
tokenizer = AutoTokenizer.from_pretrained(modelname)
df.insert(0, "word_idx", df.index.values)
df["hftoken"] = df.word.apply(lambda x: tokenizer.tokenize(" " + x))
df = df.explode("hftoken", ignore_index=True)
df["token_id"] = df.hftoken.apply(tokenizer.convert_tokens_to_ids)
df.head(10)
[12]:
word_idx | word | start | end | hftoken | token_id | |
---|---|---|---|---|---|---|
0 | 0 | Act | 3.710 | 3.790 | ĠAct | 2191 |
1 | 1 | one, | 3.990 | 4.190 | Ġone | 530 |
2 | 1 | one, | 3.990 | 4.190 | , | 11 |
3 | 2 | monkey | 4.651 | 4.931 | Ġmonkey | 21657 |
4 | 3 | in | 4.951 | 5.011 | Ġin | 287 |
5 | 4 | the | 5.051 | 5.111 | Ġthe | 262 |
6 | 5 | middle. | 5.151 | 5.391 | Ġmiddle | 3504 |
7 | 5 | middle. | 5.151 | 5.391 | . | 13 |
8 | 6 | So | 6.592 | 6.732 | ĠSo | 1406 |
9 | 7 | there's | 6.752 | 6.912 | Ġthere | 612 |
Then we will download and load the pretrained GPT-2 model. You can inspect its configurations in model.config
for more detailed information (e.g., number of layers, max context length).
[13]:
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(modelname)
print(
f"Model : {modelname}"
f"\nLayers: {model.config.num_hidden_layers}"
f"\nEmbDim: {model.config.hidden_size}"
f"\nConfig: {model.config}"
)
model = model.eval()
model = model.to(device)
Loading model...
Model : gpt2
Layers: 12
EmbDim: 768
Config: GPT2Config {
"_name_or_path": "gpt2",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.45.2",
"use_cache": true,
"vocab_size": 50257
}
Since our transcript contains more tokens than the context window (32), we will reformat all the token_ids
into data
, a torch tensor with a shape of (number of tokens x 33). This is because to extract feature for a token from GPT-2 using context length 32, we will need to input 33 tokens to GPT-2, which contains the token itself and the 32 preceding tokens. Note that for the first 32 tokens in the transcript, we will use the pad_token_id or 0 to pad the input length to 33.
[14]:
token_ids = df.token_id.tolist()
fill_value = 0
if tokenizer.pad_token_id is not None:
fill_value = tokenizer.pad_token_id
data = torch.full((len(token_ids), context_len + 1), fill_value, dtype=torch.long)
for i in range(len(token_ids)):
example_tokens = token_ids[max(0, i - context_len) : i + 1]
data[i, -len(example_tokens) :] = torch.tensor(example_tokens)
print(f"Data has a shape of: {data.shape}")
Data has a shape of: torch.Size([5491, 33])
We will use Accelerator to make extracting features more efficient. It includes a find_executable_batch_size algorithm, which can find the optimal batch size for the code by decreasing the batch size in half after each failed run on the code (in this case, our inference_loop
function).
Inside the inference_loop
funcion, we will use a PyTorch DataLoader
to supply token IDs to the model in batches and extract the features. In addition to the embeddings, we’ll also extract several other features of potential interest from the model. As GPT-2 proceeds through the text, it generates a probability distribution (the logits
extracted below) across all words in the vocabulary with the goal of correctly predicting the next word. We can use this probability distribution to
derive other features of the model’s internal computations. We’ll extract the following features from GPT-2:
embeddings: the 768-dimensional contextual embedding capturing the meaning of the current word
top_guesses: the highest probability word GPT-2 predicts for the current word
ranks: the rank of the correct word given probabilities across the vocabulary
true_probs: the probability at which GPT-2 predicted the current word
entropies: how uncertain GPT-2 was about the current word
low entropy indicates that the probability distribution was “focused” on certain words
high entropy indicates the probability distribution was more uniform/dispersed across words
[15]:
accelerator = Accelerator()
@find_executable_batch_size(starting_batch_size=32)
def inference_loop(batch_size=32):
# nonlocal accelerator # Ensure they can be used in our context
accelerator.free_memory() # Free all lingering references
data_dl = torch.utils.data.DataLoader(
data, batch_size=batch_size, shuffle=False
)
top_guesses = []
ranks = []
true_probs = []
entropies = []
embeddings = []
with torch.no_grad():
for batch in data_dl:
# Get output from model
output = model(batch.to(device), output_hidden_states=True)
logits = output.logits
states = output.hidden_states
true_ids = batch[:, -1]
brange = list(range(len(true_ids)))
logits_order = logits[:, -2, :].argsort(descending=True)
batch_top_guesses = logits_order[:, 0]
batch_ranks = torch.eq(logits_order, true_ids.reshape(-1, 1).to(device)).nonzero()[:, 1]
batch_probs = torch.softmax(logits[:, -2, :], dim=-1)
batch_true_probs = batch_probs[brange, true_ids]
batch_entropy = torch.distributions.Categorical(probs=batch_probs).entropy()
batch_embeddings = [state[:, -1, :].numpy(force=True) for state in states ]
top_guesses.append(batch_top_guesses.numpy(force=True))
ranks.append(batch_ranks.numpy(force=True))
true_probs.append(batch_true_probs.numpy(force=True))
entropies.append(batch_entropy.numpy(force=True))
embeddings.append(batch_embeddings)
return top_guesses, ranks, true_probs, entropies, embeddings
top_guesses, ranks, true_probs, entropies, embeddings = inference_loop()
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Now we will add the additional information from GPT-2 as columns to df
.
[16]:
df["rank"] = np.concatenate(ranks)
df["true_prob"] = np.concatenate(true_probs)
df["top_pred"] = np.concatenate(top_guesses)
df["entropy"] = np.concatenate(entropies)
df.head(10)
[16]:
word_idx | word | start | end | hftoken | token_id | rank | true_prob | top_pred | entropy | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Act | 3.710 | 3.790 | ĠAct | 2191 | 3185 | 1.000139e-08 | 0 | 0.092728 |
1 | 1 | one, | 3.990 | 4.190 | Ġone | 530 | 46 | 2.847577e-03 | 352 | 5.294118 |
2 | 1 | one, | 3.990 | 4.190 | , | 11 | 2 | 8.006448e-02 | 0 | 4.976894 |
3 | 2 | monkey | 4.651 | 4.931 | Ġmonkey | 21657 | 6978 | 6.075863e-06 | 734 | 5.869678 |
4 | 3 | in | 4.951 | 5.011 | Ġin | 287 | 24 | 1.004823e-03 | 0 | 2.478687 |
5 | 4 | the | 5.051 | 5.111 | Ġthe | 262 | 0 | 3.898537e-01 | 262 | 4.340655 |
6 | 5 | middle. | 5.151 | 5.391 | Ġmiddle | 3504 | 2 | 4.331103e-02 | 5228 | 5.842120 |
7 | 5 | middle. | 5.151 | 5.391 | . | 13 | 3 | 4.237065e-02 | 286 | 2.115351 |
8 | 6 | So | 6.592 | 6.732 | ĠSo | 1406 | 116 | 1.016026e-03 | 2191 | 5.861630 |
9 | 7 | there's | 6.752 | 6.912 | Ġthere | 612 | 16 | 8.699116e-03 | 11 | 5.249004 |
And confirm the size and number of embeddings we got. Note that there are 13 layers (instead of the expected 12) because also included are the initial embeddings before the first layer of the network. Note that list of embeddings will be in batches, which will require flatenning to match the number of tokens.
[17]:
print(f"There are {len(embeddings[0])} layers of embeddings")
print(f"Each word embedding is {embeddings[0][0].shape[1]} dimensions long")
There are 13 layers of embeddings
Each word embedding is 768 dimensions long