Getting word embeddings

This tutorial introduces how to extract features, or word embeddings based on our stimulus transcript. Features are numeric vectors that capture the meaning of the words in our transcript. Here, we will present two types types of features: interpretable syntactic features and high-dimensional word embeddings from a language model.

Acknowledgments: This tutorial draws heavily on the encling tutorial by Samuel A. Nastase.

Open in Colab

[ ]:
# only run this cell in colab
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
!pip install accelerate transformers spacy scikit-learn

First, we’ll import some general-purpose Python packages.

[1]:
import numpy as np
import pandas as pd

Extracting syntactic features

One type of linguistic features are explicit grammatical features we are familiar with and have names for. These can include parts of speech (e.g., noun, verb) or syntactic dependencies (e.g., root, subject, object). We will use the spaCy library (Honnibal et al., 2020).

[2]:
import spacy
from sklearn.preprocessing import LabelBinarizer

First we need to load the transcript as a pandas dataframe. It contains columns of words and their start and end timestamps.

[ ]:
bids_root = ""  # if using a local dataset, set this variable accordingly

# Download the transcript, if required
transcript_path = f"{bids_root}stimuli/podcast_transcript.csv"
if not len(bids_root):
    !wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$transcript_path
    transcript_path = "podcast_transcript.csv"

df = pd.read_csv(transcript_path)
df.head(10)

SpaCy requires us to download and load a model that enables its features. First, we will download the en-core-web-sm model trained on English and includes components for part-of-speech tagging and dependency parsing.

[ ]:
!python -m spacy download en_core_web_sm
[4]:
modelname = "en_core_web_sm"
nlp = spacy.load(modelname)
Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 400.7/400.7 MB 105.5 MB/s eta 0:00:0000:0100:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')

Language processing pipelines typically use a tokenizer to standarize the (sub-)word units (called tokens) they can operate on. Some words and punctuation will get separated into multiple tokens. For example, the word “there’s” will be tokenized into “there” and “‘s”. Thus the first step for us is to transform our transcript words into tokens that spaCy can work with.

To keep track of our word and their indices we first create a word_idx column. We then tokenize the words using the tokenizer. Then, we will explode the dataframe so that each row of the dataframe is a token (and not a word). Note that we will add white spaces to the end of words before tokenization so we can track the boundary of each word. Compare the dataframes from before and from below.

[5]:
df.insert(0, "word_idx", df.index.values)
df["word_with_ws"] = df.word.astype(str) + " "
df["hftoken"] = df.word_with_ws.apply(nlp.tokenizer)
df = df.explode("hftoken", ignore_index=True)
df.head(10)
[5]:
word_idx word start end word_with_ws hftoken
0 0 Act 3.710 3.790 Act Act
1 1 one, 3.990 4.190 one, one
2 1 one, 3.990 4.190 one, ,
3 2 monkey 4.651 4.931 monkey monkey
4 3 in 4.951 5.011 in in
5 4 the 5.051 5.111 the the
6 5 middle. 5.151 5.391 middle. middle
7 5 middle. 5.151 5.391 middle. .
8 6 So 6.592 6.732 So So
9 7 there's 6.752 6.912 there's there

Now we will create a doc objcet (essentially a list of token objects) from our tokenized text:

[6]:
words = [token.text for token in df.hftoken.tolist()]
spaces = [token.whitespace_ == " " for token in df.hftoken.tolist()]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)
doc = nlp(doc)

We will loop through the doc, and get the features for each token. The features include text, tag (detailed part-of-speech tag), dep (syntactic dependency, i.e. the relation between tokens), and is_stop (is the token part of a stop list). We will organize the features into a second dataframe and add those columns back to df. We will drop the two columns we don’t need anymore, and then save df for future encoding.

[7]:
features = []
for token in doc:
    features.append([token.text, token.tag_, token.dep_, token.is_stop])

df2 = pd.DataFrame(
        features, columns=["token", "pos", "dep", "stop"], index=df.index
    )
df = pd.concat([df, df2], axis=1)
df.drop(["hftoken", "word_with_ws"], axis=1, inplace=True)
df.head(10)
[7]:
word_idx word start end token pos dep stop
0 0 Act 3.710 3.790 Act NNP ROOT False
1 1 one, 3.990 4.190 one CD nummod True
2 1 one, 3.990 4.190 , , punct False
3 2 monkey 4.651 4.931 monkey NN appos False
4 3 in 4.951 5.011 in IN prep True
5 4 the 5.051 5.111 the DT det True
6 5 middle. 5.151 5.391 middle NN pobj False
7 5 middle. 5.151 5.391 . . punct False
8 6 So 6.592 6.732 So RB advmod True
9 7 there's 6.752 6.912 there EX expl True

Since the features we extracted are all categorical, we need to turn them into numerical vectors. We will use LabelBinarizer from sklearn, which fits to all the possible category labels for a feature and then transforms our labels into one-hot binary vectors. There are 50 possible labels for tag and 45 possible for dep. So those two features will be turned into 50-dimensional and 45-dimensional vectors respectively. Our is_stop feature is binary, so it will just be one dimensional. We concatenate all three features to form a 96-dimensional syntactic feature overall and save it for future encoding.

[8]:
taggerEncoder = LabelBinarizer().fit(nlp.get_pipe("tagger").labels)
dependencyEncoder = LabelBinarizer().fit(nlp.get_pipe("parser").labels)

a = taggerEncoder.transform(df.pos.tolist())
b = dependencyEncoder.transform(df.dep.tolist())
c = LabelBinarizer().fit_transform(df.stop.tolist())
embeddings = np.hstack((a, b, c))
print(f"Embeddings have a shape of: {embeddings.shape}")
Embeddings have a shape of: (5305, 96)

Extracting GPT-2 Features

Now we will extract contextual word embeddings from an autoregressive (or “causal”) large language model (LLM) called GPT-2 (Radford et al., 2019). GPT-2 relies on the Transformer architecture to sculpt the embedding of a given word based on the preceding context. The model is composed of a repeated circuit motif—called the “attention head”—by which the model can “attend” to previous words in the context window when determining the meaning of the current word. This GPT-2 implementation is composed of 12 layers, each of which contains 12 attention heads that influence the embedding as it proceeds to the subsequent layer. The embeddings at each layer of the model comprise 768 features and the context window includes the preceding 1024 tokens. Note that certain words will be broken up into multiple tokens; we’ll need to use GPT-2’s “tokenizer” to convert words into the appropriate tokens. GPT-2 has been (pre)trained on large corpora of text according to a simple self-supervised objective function: predict the next word based on the prior context.

We will be using the HuggingFace transformers library for working with these models. If you want to learn more about LLMs and GPT-2, here are some great blogs explaining transformers and GPT-2 architecture. The HuggingFace website also has many useful resources.

Note

Using large language models, even small ones, requires a lot of compute resources. If you’re use Colab, go to EditNotebook Settings and select a GPU. Restart the runtime and try running again. Aftewards, you can run !nvidia-smi in a new cell to verify you have GPU available.

[9]:
import torch
from accelerate import Accelerator, find_executable_batch_size
from transformers import AutoModelForCausalLM, AutoTokenizer

Let’s reload the stimulus transcript.

[10]:
df = pd.read_csv(transcript_path)
df.head(10)
[10]:
word start end
0 Act 3.710 3.790
1 one, 3.990 4.190
2 monkey 4.651 4.931
3 in 4.951 5.011
4 the 5.051 5.111
5 middle. 5.151 5.391
6 So 6.592 6.732
7 there's 6.752 6.912
8 some 6.892 7.052
9 places 7.072 7.342

We will define some of the general arguments, including the model name as it appears on HuggingFace, the context length (i.e., how many tokens we input into the model), and compute device. We can set the device to cuda to utilize a GPU if it’s available.

[11]:
modelname = "gpt2"
context_len = 32
device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda", 0)
    print("Using cuda!")

We will now load the GPT-2 tokenizer to convert words into a list of tokens. Then, we will explode the dataframe so that each row of the dataframe is a token. We will convert tokens to token_ids (integers IDs corresponding to words in the GPT-2 vocabulary, which contains approximately 50,000 tokens) to use as input into GPT-2.

[12]:
# Load model
tokenizer = AutoTokenizer.from_pretrained(modelname)

df.insert(0, "word_idx", df.index.values)
df["hftoken"] = df.word.apply(lambda x: tokenizer.tokenize(" " + x))

df = df.explode("hftoken", ignore_index=True)
df["token_id"] = df.hftoken.apply(tokenizer.convert_tokens_to_ids)

df.head(10)
[12]:
word_idx word start end hftoken token_id
0 0 Act 3.710 3.790 ĠAct 2191
1 1 one, 3.990 4.190 Ġone 530
2 1 one, 3.990 4.190 , 11
3 2 monkey 4.651 4.931 Ġmonkey 21657
4 3 in 4.951 5.011 Ġin 287
5 4 the 5.051 5.111 Ġthe 262
6 5 middle. 5.151 5.391 Ġmiddle 3504
7 5 middle. 5.151 5.391 . 13
8 6 So 6.592 6.732 ĠSo 1406
9 7 there's 6.752 6.912 Ġthere 612

Then we will download and load the pretrained GPT-2 model. You can inspect its configurations in model.config for more detailed information (e.g., number of layers, max context length).

[13]:
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(modelname)

print(
    f"Model : {modelname}"
    f"\nLayers: {model.config.num_hidden_layers}"
    f"\nEmbDim: {model.config.hidden_size}"
    f"\nConfig: {model.config}"
)
model = model.eval()
model = model.to(device)
Loading model...
Model : gpt2
Layers: 12
EmbDim: 768
Config: GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.45.2",
  "use_cache": true,
  "vocab_size": 50257
}

Since our transcript contains more tokens than the context window (32), we will reformat all the token_ids into data, a torch tensor with a shape of (number of tokens x 33). This is because to extract feature for a token from GPT-2 using context length 32, we will need to input 33 tokens to GPT-2, which contains the token itself and the 32 preceding tokens. Note that for the first 32 tokens in the transcript, we will use the pad_token_id or 0 to pad the input length to 33.

[14]:
token_ids = df.token_id.tolist()
fill_value = 0
if tokenizer.pad_token_id is not None:
    fill_value = tokenizer.pad_token_id

data = torch.full((len(token_ids), context_len + 1), fill_value, dtype=torch.long)
for i in range(len(token_ids)):
    example_tokens = token_ids[max(0, i - context_len) : i + 1]
    data[i, -len(example_tokens) :] = torch.tensor(example_tokens)

print(f"Data has a shape of: {data.shape}")
Data has a shape of: torch.Size([5491, 33])

We will use Accelerator to make extracting features more efficient. It includes a find_executable_batch_size algorithm, which can find the optimal batch size for the code by decreasing the batch size in half after each failed run on the code (in this case, our inference_loop function).

Inside the inference_loop funcion, we will use a PyTorch DataLoader to supply token IDs to the model in batches and extract the features. In addition to the embeddings, we’ll also extract several other features of potential interest from the model. As GPT-2 proceeds through the text, it generates a probability distribution (the logits extracted below) across all words in the vocabulary with the goal of correctly predicting the next word. We can use this probability distribution to derive other features of the model’s internal computations. We’ll extract the following features from GPT-2:

  • embeddings: the 768-dimensional contextual embedding capturing the meaning of the current word

  • top_guesses: the highest probability word GPT-2 predicts for the current word

  • ranks: the rank of the correct word given probabilities across the vocabulary

  • true_probs: the probability at which GPT-2 predicted the current word

  • entropies: how uncertain GPT-2 was about the current word

    • low entropy indicates that the probability distribution was “focused” on certain words

    • high entropy indicates the probability distribution was more uniform/dispersed across words

[15]:
accelerator = Accelerator()

@find_executable_batch_size(starting_batch_size=32)
def inference_loop(batch_size=32):
    # nonlocal accelerator  # Ensure they can be used in our context
    accelerator.free_memory()  # Free all lingering references

    data_dl = torch.utils.data.DataLoader(
        data, batch_size=batch_size, shuffle=False
        )

    top_guesses = []
    ranks = []
    true_probs = []
    entropies = []
    embeddings = []

    with torch.no_grad():
        for batch in data_dl:
            # Get output from model
            output = model(batch.to(device), output_hidden_states=True)
            logits = output.logits
            states = output.hidden_states

            true_ids = batch[:, -1]
            brange = list(range(len(true_ids)))
            logits_order = logits[:, -2, :].argsort(descending=True)
            batch_top_guesses = logits_order[:, 0]
            batch_ranks = torch.eq(logits_order, true_ids.reshape(-1, 1).to(device)).nonzero()[:, 1]
            batch_probs = torch.softmax(logits[:, -2, :], dim=-1)
            batch_true_probs = batch_probs[brange, true_ids]
            batch_entropy = torch.distributions.Categorical(probs=batch_probs).entropy()
            batch_embeddings = [state[:, -1, :].numpy(force=True) for state in states ]

            top_guesses.append(batch_top_guesses.numpy(force=True))
            ranks.append(batch_ranks.numpy(force=True))
            true_probs.append(batch_true_probs.numpy(force=True))
            entropies.append(batch_entropy.numpy(force=True))
            embeddings.append(batch_embeddings)

        return top_guesses, ranks, true_probs, entropies, embeddings

top_guesses, ranks, true_probs, entropies, embeddings = inference_loop()
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Now we will add the additional information from GPT-2 as columns to df.

[16]:
df["rank"] = np.concatenate(ranks)
df["true_prob"] = np.concatenate(true_probs)
df["top_pred"] = np.concatenate(top_guesses)
df["entropy"] = np.concatenate(entropies)

df.head(10)
[16]:
word_idx word start end hftoken token_id rank true_prob top_pred entropy
0 0 Act 3.710 3.790 ĠAct 2191 3185 1.000139e-08 0 0.092728
1 1 one, 3.990 4.190 Ġone 530 46 2.847577e-03 352 5.294118
2 1 one, 3.990 4.190 , 11 2 8.006448e-02 0 4.976894
3 2 monkey 4.651 4.931 Ġmonkey 21657 6978 6.075863e-06 734 5.869678
4 3 in 4.951 5.011 Ġin 287 24 1.004823e-03 0 2.478687
5 4 the 5.051 5.111 Ġthe 262 0 3.898537e-01 262 4.340655
6 5 middle. 5.151 5.391 Ġmiddle 3504 2 4.331103e-02 5228 5.842120
7 5 middle. 5.151 5.391 . 13 3 4.237065e-02 286 2.115351
8 6 So 6.592 6.732 ĠSo 1406 116 1.016026e-03 2191 5.861630
9 7 there's 6.752 6.912 Ġthere 612 16 8.699116e-03 11 5.249004

And confirm the size and number of embeddings we got. Note that there are 13 layers (instead of the expected 12) because also included are the initial embeddings before the first layer of the network. Note that list of embeddings will be in batches, which will require flatenning to match the number of tokens.

[17]:
print(f"There are {len(embeddings[0])} layers of embeddings")
print(f"Each word embedding is {embeddings[0][0].shape[1]} dimensions long")
There are 13 layers of embeddings
Each word embedding is 768 dimensions long