Getting word embeddings¶

This tutorial introduces how to extract features, or word embeddings based on our stimulus transcript. Features are numeric vectors that capture the meaning of the words in our transcript. Here, we will present two types types of features: interpretable syntactic features and high-dimensional word embeddings from a language model.

Acknowledgments: This tutorial draws heavily on the encling tutorial by Samuel A. Nastase.

[ ]:

# only run this cell in colab
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
!pip install accelerate transformers spacy scikit-learn

First, we’ll import some general-purpose Python packages.

[1]:

import numpy as np
import pandas as pd

Extracting syntactic features¶

One type of linguistic features are explicit grammatical features we are familiar with and have names for. These can include parts of speech (e.g., noun, verb) or syntactic dependencies (e.g., root, subject, object). We will use the spaCy library (Honnibal et al., 2020).

[2]:

import spacy
from sklearn.preprocessing import LabelBinarizer

First we need to load the transcript as a pandas dataframe. It contains columns of words and their start and end timestamps.

[ ]:

bids_root = ""  # if using a local dataset, set this variable accordingly

# Download the transcript, if required
transcript_path = f"{bids_root}stimuli/podcast_transcript.csv"
if not len(bids_root):
    !wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$transcript_path
    transcript_path = "podcast_transcript.csv"

df = pd.read_csv(transcript_path)
df.head(10)

SpaCy requires us to download and load a model that enables its features. First, we will download the en-core-web-sm model trained on English and includes components for part-of-speech tagging and dependency parsing.

[ ]:

!python -m spacy download en_core_web_sm

[4]:

modelname = "en_core_web_sm"
nlp = spacy.load(modelname)

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 400.7/400.7 MB 105.5 MB/s eta 0:00:0000:0100:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')

Language processing pipelines typically use a tokenizer to standarize the (sub-)word units (called tokens) they can operate on. Some words and punctuation will get separated into multiple tokens. For example, the word “there’s” will be tokenized into “there” and “‘s”. Thus the first step for us is to transform our transcript words into tokens that spaCy can work with.

To keep track of our word and their indices we first create a word_idx column. We then tokenize the words using the tokenizer. Then, we will explode the dataframe so that each row of the dataframe is a token (and not a word). Note that we will add white spaces to the end of words before tokenization so we can track the boundary of each word. Compare the dataframes from before and from below.

[5]:

df.insert(0, "word_idx", df.index.values)
df["word_with_ws"] = df.word.astype(str) + " "
df["hftoken"] = df.word_with_ws.apply(nlp.tokenizer)
df = df.explode("hftoken", ignore_index=True)
df.head(10)

[5]:

	word_idx	word	start	end	word_with_ws	hftoken
0	0	Act	3.710	3.790	Act	Act
1	1	one,	3.990	4.190	one,	one
2	1	one,	3.990	4.190	one,	,
3	2	monkey	4.651	4.931	monkey	monkey
4	3	in	4.951	5.011	in	in
5	4	the	5.051	5.111	the	the
6	5	middle.	5.151	5.391	middle.	middle
7	5	middle.	5.151	5.391	middle.	.
8	6	So	6.592	6.732	So	So
9	7	there's	6.752	6.912	there's	there

Now we will create a doc objcet (essentially a list of token objects) from our tokenized text:

[6]:

words = [token.text for token in df.hftoken.tolist()]
spaces = [token.whitespace_ == " " for token in df.hftoken.tolist()]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)
doc = nlp(doc)

We will loop through the doc, and get the features for each token. The features include text, tag (detailed part-of-speech tag), dep (syntactic dependency, i.e. the relation between tokens), and is_stop (is the token part of a stop list). We will organize the features into a second dataframe and add those columns back to df. We will drop the two columns we don’t need anymore, and then save df for future encoding.

[7]:

features = []
for token in doc:
    features.append([token.text, token.tag_, token.dep_, token.is_stop])

df2 = pd.DataFrame(
        features, columns=["token", "pos", "dep", "stop"], index=df.index
    )
df = pd.concat([df, df2], axis=1)
df.drop(["hftoken", "word_with_ws"], axis=1, inplace=True)
df.head(10)

[7]:

	word_idx	word	start	end	token	pos	dep	stop
0	0	Act	3.710	3.790	Act	NNP	ROOT	False
1	1	one,	3.990	4.190	one	CD	nummod	True
2	1	one,	3.990	4.190	,	,	punct	False
3	2	monkey	4.651	4.931	monkey	NN	appos	False
4	3	in	4.951	5.011	in	IN	prep	True
5	4	the	5.051	5.111	the	DT	det	True
6	5	middle.	5.151	5.391	middle	NN	pobj	False
7	5	middle.	5.151	5.391	.	.	punct	False
8	6	So	6.592	6.732	So	RB	advmod	True
9	7	there's	6.752	6.912	there	EX	expl	True

Since the features we extracted are all categorical, we need to turn them into numerical vectors. We will use LabelBinarizer from sklearn, which fits to all the possible category labels for a feature and then transforms our labels into one-hot binary vectors. There are 50 possible labels for tag and 45 possible for dep. So those two features will be turned into 50-dimensional and 45-dimensional vectors respectively. Our is_stop feature is binary, so it will just be one dimensional. We concatenate all three features to form a 96-dimensional syntactic feature overall and save it for future encoding.

[8]:

taggerEncoder = LabelBinarizer().fit(nlp.get_pipe("tagger").labels)
dependencyEncoder = LabelBinarizer().fit(nlp.get_pipe("parser").labels)

a = taggerEncoder.transform(df.pos.tolist())
b = dependencyEncoder.transform(df.dep.tolist())
c = LabelBinarizer().fit_transform(df.stop.tolist())
embeddings = np.hstack((a, b, c))
print(f"Embeddings have a shape of: {embeddings.shape}")

Embeddings have a shape of: (5305, 96)

Extracting GPT-2 Features¶

Now we will extract contextual word embeddings from an autoregressive (or “causal”) large language model (LLM) called GPT-2 (Radford et al., 2019). GPT-2 relies on the Transformer architecture to sculpt the embedding of a given word based on the preceding context. The model is composed of a repeated circuit motif—called the “attention head”—by which the model can “attend” to previous words in the context window when determining the meaning of the current word. This GPT-2 implementation is composed of 12 layers, each of which contains 12 attention heads that influence the embedding as it proceeds to the subsequent layer. The embeddings at each layer of the model comprise 768 features and the context window includes the preceding 1024 tokens. Note that certain words will be broken up into multiple tokens; we’ll need to use GPT-2’s “tokenizer” to convert words into the appropriate tokens. GPT-2 has been (pre)trained on large corpora of text according to a simple self-supervised objective function: predict the next word based on the prior context.

We will be using the HuggingFace transformers library for working with these models. If you want to learn more about LLMs and GPT-2, here are some great blogs explaining transformers and GPT-2 architecture. The HuggingFace website also has many useful resources.

Note

Using large language models, even small ones, requires a lot of compute resources. If you’re use Colab, go to Edit → Notebook Settings and select a GPU. Restart the runtime and try running again. Aftewards, you can run !nvidia-smi in a new cell to verify you have GPU available.

[9]:

import torch
from accelerate import Accelerator, find_executable_batch_size
from transformers import AutoModelForCausalLM, AutoTokenizer

Let’s reload the stimulus transcript.

[10]:

df = pd.read_csv(transcript_path)
df.head(10)

[10]:

	word	start	end
0	Act	3.710	3.790
1	one,	3.990	4.190
2	monkey	4.651	4.931
3	in	4.951	5.011
4	the	5.051	5.111
5	middle.	5.151	5.391
6	So	6.592	6.732
7	there's	6.752	6.912
8	some	6.892	7.052
9	places	7.072	7.342

We will define some of the general arguments, including the model name as it appears on HuggingFace, the context length (i.e., how many tokens we input into the model), and compute device. We can set the device to cuda to utilize a GPU if it’s available.

[11]:

modelname = "gpt2"
context_len = 32
device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda", 0)
    print("Using cuda!")

We will now load the GPT-2 tokenizer to convert words into a list of tokens. Then, we will explode the dataframe so that each row of the dataframe is a token. We will convert tokens to token_ids (integers IDs corresponding to words in the GPT-2 vocabulary, which contains approximately 50,000 tokens) to use as input into GPT-2.

[12]:

# Load model
tokenizer = AutoTokenizer.from_pretrained(modelname)

df.insert(0, "word_idx", df.index.values)
df["hftoken"] = df.word.apply(lambda x: tokenizer.tokenize(" " + x))

df = df.explode("hftoken", ignore_index=True)
df["token_id"] = df.hftoken.apply(tokenizer.convert_tokens_to_ids)

df.head(10)

[12]:

	word_idx	word	start	end	hftoken	token_id
0	0	Act	3.710	3.790	ĠAct	2191
1	1	one,	3.990	4.190	Ġone	530
2	1	one,	3.990	4.190	,	11
3	2	monkey	4.651	4.931	Ġmonkey	21657
4	3	in	4.951	5.011	Ġin	287
5	4	the	5.051	5.111	Ġthe	262
6	5	middle.	5.151	5.391	Ġmiddle	3504
7	5	middle.	5.151	5.391	.	13
8	6	So	6.592	6.732	ĠSo	1406
9	7	there's	6.752	6.912	Ġthere	612

Then we will download and load the pretrained GPT-2 model. You can inspect its configurations in model.config for more detailed information (e.g., number of layers, max context length).

[13]:

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(modelname)

print(
    f"Model : {modelname}"
    f"\nLayers: {model.config.num_hidden_layers}"
    f"\nEmbDim: {model.config.hidden_size}"
    f"\nConfig: {model.config}"
)
model = model.eval()
model = model.to(device)

Loading model...
Model : gpt2
Layers: 12
EmbDim: 768
Config: GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.45.2",
  "use_cache": true,
  "vocab_size": 50257
}

Since our transcript contains more tokens than the context window (32), we will reformat all the token_ids into data, a torch tensor with a shape of (number of tokens x 33). This is because to extract feature for a token from GPT-2 using context length 32, we will need to input 33 tokens to GPT-2, which contains the token itself and the 32 preceding tokens. Note that for the first 32 tokens in the transcript, we will use the pad_token_id or 0 to pad the input length to 33.

[14]:

token_ids = df.token_id.tolist()
fill_value = 0
if tokenizer.pad_token_id is not None:
    fill_value = tokenizer.pad_token_id

data = torch.full((len(token_ids), context_len + 1), fill_value, dtype=torch.long)
for i in range(len(token_ids)):
    example_tokens = token_ids[max(0, i - context_len) : i + 1]
    data[i, -len(example_tokens) :] = torch.tensor(example_tokens)

print(f"Data has a shape of: {data.shape}")

Data has a shape of: torch.Size([5491, 33])

We will use Accelerator to make extracting features more efficient. It includes a find_executable_batch_size algorithm, which can find the optimal batch size for the code by decreasing the batch size in half after each failed run on the code (in this case, our inference_loop function).

Inside the inference_loop funcion, we will use a PyTorch DataLoader to supply token IDs to the model in batches and extract the features. In addition to the embeddings, we’ll also extract several other features of potential interest from the model. As GPT-2 proceeds through the text, it generates a probability distribution (the logits extracted below) across all words in the vocabulary with the goal of correctly predicting the next word. We can use this probability distribution to derive other features of the model’s internal computations. We’ll extract the following features from GPT-2:

embeddings: the 768-dimensional contextual embedding capturing the meaning of the current word
top_guesses: the highest probability word GPT-2 predicts for the current word
ranks: the rank of the correct word given probabilities across the vocabulary
true_probs: the probability at which GPT-2 predicted the current word
entropies: how uncertain GPT-2 was about the current word
- low entropy indicates that the probability distribution was “focused” on certain words
- high entropy indicates the probability distribution was more uniform/dispersed across words

[15]:

accelerator = Accelerator()

@find_executable_batch_size(starting_batch_size=32)
def inference_loop(batch_size=32):
    # nonlocal accelerator  # Ensure they can be used in our context
    accelerator.free_memory()  # Free all lingering references

    data_dl = torch.utils.data.DataLoader(
        data, batch_size=batch_size, shuffle=False
        )

    top_guesses = []
    ranks = []
    true_probs = []
    entropies = []
    embeddings = []

    with torch.no_grad():
        for batch in data_dl:
            # Get output from model
            output = model(batch.to(device), output_hidden_states=True)
            logits = output.logits
            states = output.hidden_states

            true_ids = batch[:, -1]
            brange = list(range(len(true_ids)))
            logits_order = logits[:, -2, :].argsort(descending=True)
            batch_top_guesses = logits_order[:, 0]
            batch_ranks = torch.eq(logits_order, true_ids.reshape(-1, 1).to(device)).nonzero()[:, 1]
            batch_probs = torch.softmax(logits[:, -2, :], dim=-1)
            batch_true_probs = batch_probs[brange, true_ids]
            batch_entropy = torch.distributions.Categorical(probs=batch_probs).entropy()
            batch_embeddings = [state[:, -1, :].numpy(force=True) for state in states ]

            top_guesses.append(batch_top_guesses.numpy(force=True))
            ranks.append(batch_ranks.numpy(force=True))
            true_probs.append(batch_true_probs.numpy(force=True))
            entropies.append(batch_entropy.numpy(force=True))
            embeddings.append(batch_embeddings)

        return top_guesses, ranks, true_probs, entropies, embeddings

top_guesses, ranks, true_probs, entropies, embeddings = inference_loop()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Now we will add the additional information from GPT-2 as columns to df.

[16]:

df["rank"] = np.concatenate(ranks)
df["true_prob"] = np.concatenate(true_probs)
df["top_pred"] = np.concatenate(top_guesses)
df["entropy"] = np.concatenate(entropies)

df.head(10)

[16]:

	word_idx	word	start	end	hftoken	token_id	rank	true_prob	top_pred	entropy
0	0	Act	3.710	3.790	ĠAct	2191	3185	1.000139e-08	0	0.092728
1	1	one,	3.990	4.190	Ġone	530	46	2.847577e-03	352	5.294118
2	1	one,	3.990	4.190	,	11	2	8.006448e-02	0	4.976894
3	2	monkey	4.651	4.931	Ġmonkey	21657	6978	6.075863e-06	734	5.869678
4	3	in	4.951	5.011	Ġin	287	24	1.004823e-03	0	2.478687
5	4	the	5.051	5.111	Ġthe	262	0	3.898537e-01	262	4.340655
6	5	middle.	5.151	5.391	Ġmiddle	3504	2	4.331103e-02	5228	5.842120
7	5	middle.	5.151	5.391	.	13	3	4.237065e-02	286	2.115351
8	6	So	6.592	6.732	ĠSo	1406	116	1.016026e-03	2191	5.861630
9	7	there's	6.752	6.912	Ġthere	612	16	8.699116e-03	11	5.249004

And confirm the size and number of embeddings we got. Note that there are 13 layers (instead of the expected 12) because also included are the initial embeddings before the first layer of the network. Note that list of embeddings will be in batches, which will require flatenning to match the number of tokens.

[17]:

print(f"There are {len(embeddings[0])} layers of embeddings")
print(f"Each word embedding is {embeddings[0][0].shape[1]} dimensions long")

There are 13 layers of embeddings
Each word embedding is 768 dimensions long

Getting word embeddings¶

Extracting syntactic features¶

Extracting GPT-2 Features¶

Navigation

Related Topics