Training and evaluating encoding models

This tutorial introduces a typical encoding framework for mapping stimulus features onto brain activity during natural language comprehension.

The previous tutorial walked through extracting two types of linguistic features: syntactic features and language model word embeddings. The podcast-ecog dataset comes with several other feature spaces as well. For this tutorial, we will use the LLM contextual embeddings. Encoding models (Naselaris et al., 2011) use linear regression to map these features onto brain activity. Here, we use the Himalaya package (Dupré La Tour et al., 2022) to train encoding models using a ridge L2 penalty (also called ridge regression).

Acknowledgments: This tutorial draws heavily on the encling tutorial by Samuel A. Nastase.

Open in Colab

[ ]:
# only run this cell in colab
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
!pip install mne mne_bids himalaya scikit-learn pandas matplotlib nilearn
[1]:
import mne
import h5py
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nilearn.plotting import plot_markers
from mne_bids import BIDSPath

from himalaya.backend import set_backend, get_backend
from himalaya.ridge import RidgeCV
from himalaya.scoring import correlation_score

from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

We will set the Himalaya backend to torch_cuda so we can utilize a GPU for training, if available.

[2]:
if torch.cuda.is_available():
    set_backend("torch_cuda")
    print("Using cuda!")

Loading features

We will now load two contextual word embeddings from GPT-2 (Radford et al., 2019). The loaded features should be a numpy array with a shape of (number of tokens * feature dimensions). Note that the numbers of tokens are different for the two features because of different tokenization schemas.

[3]:
bids_root = ""  # if using a local dataset, set this variable accordingly

# Download the transcript, if required
embedding_path = f"{bids_root}stimuli/gpt2-xl/features.hdf5"
if not len(bids_root):
    !wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$embedding_path
    embedding_path = "features.hdf5"

print(f"Using embedding file path: {embedding_path}")
--2025-01-09 15:08:13--  https://s3.amazonaws.com/openneuro.org/ds005574/stimuli/gpt2-xl/features.hdf5
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.123.120, 52.216.43.152, 52.217.199.56, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.123.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1721998080 (1.6G) [application/x-hdf5]
Saving to: ‘features.hdf5’

features.hdf5        12%[=>                  ] 197.25M  47.8MB/s    eta 42s    ^C
Using embedding file path: features.hdf5
[3]:
modelname, layer = 'gpt2-xl', 24
with h5py.File(embedding_path, "r") as f:
    contextual_embeddings = f[f"layer-{layer}"][...]
print(f"LLM embedding matrix has shape: {contextual_embeddings.shape}")
LLM embedding matrix has shape: (5491, 1600)

We will also load the stimuli transcripts associated with these features. Both transcripts should contain information about the word, token, start (onset), and end (offset). The contextual word embedding transcript should also include other prediction information extracted from GPT-2, like rank, probability, and entropy. For instance, we can calculate how accurate the model is in predicting the next token in the transcript based on the rank column, which are integers that represents the rank of the actual token in all the possible tokens of GPT-2.

Note: Check that the transcript contains the same number of tokens as the features we loaded before.

[ ]:
# Download the transcript, if required
transcript_path = f"{bids_root}stimuli/gpt2-xl/transcript.tsv"
if not len(bids_root):
    !wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$transcript_path
    transcript_path = "transcript.tsv"

# Load transcript
df_contextual = pd.read_csv(transcript_path, sep="\t", index_col=0)
if "rank" in df_contextual.columns:
    model_acc = (df_contextual["rank"] == 0).mean()
    print(f"Model accuracy: {model_acc*100:.3f}%")

df_contextual.head()

When we extracted features, some words are split into separate tokens. Since we only have information of start and end for words, we will align the features from tokens to words for encoding models. Here, we simply average the token features across the same word. Now the features should be a numpy array with a shape of (number of words * feature dimensions).

[5]:
aligned_embeddings = []
for _, group in df_contextual.groupby("word_idx"): # group by word index
    indices = group.index.to_numpy()
    average_emb = contextual_embeddings[indices].mean(0) # average features
    aligned_embeddings.append(average_emb)
aligned_embeddings = np.stack(aligned_embeddings)
print(f"LLM embeddings matrix has shape: {aligned_embeddings.shape}")
LLM embeddings matrix has shape: (5136, 1600)

We will also construct a dataframe containing words with their start and end timestamps.

[6]:
df_word = df_contextual.groupby("word_idx").agg(dict(word="first", start="first", end="last"))
df_word.head()
[6]:
word start end
word_idx
0 Act 3.710 3.790
1 one, 3.990 4.190
2 monkey 4.651 4.931
3 in 4.951 5.011
4 the 5.051 5.111

Loading brain data

Next, we will load the preprocessed high-gamma ECoG data using MNE. Here, we will demonstrate loading data from our third subject.

[7]:
file_path = BIDSPath(root=f"{bids_root}derivatives/ecogprep",
                    subject="03", task="podcast", datatype="ieeg", description="highgamma",
                    suffix="ieeg", extension=".fif")
print(f"File path within the dataset: {file_path}")

# You only need to run this if using Colab (i.e. if you did not set bids_root to a local directory)
if not len(bids_root):
    !wget -nc https://s3.amazonaws.com/openneuro.org/ds005574/$file_path
    file_path = file_path.basename

raw = mne.io.read_raw_fif(file_path, verbose=False)
picks = mne.pick_channels_regexp(raw.ch_names, "LG[AB]*")
raw = raw.pick(picks)
raw
[7]:
General
Filename(s) sub-03_task-monkey_desc-highgamma_ieeg.fif
MNE object type Raw
Measurement date 2019-03-11 at 10:54:21 UTC
Participant sub-03
Experimenter Unknown
Acquisition
Duration 00:29:60 (HH:MM:SS)
Sampling frequency 512.00 Hz
Time points 921,600
Channels
ECoG
Head & sensor digitization 253 points
Filters
Highpass 70.00 Hz
Lowpass 200.00 Hz

We will map the start information (in seconds) of each word in the dataframe onto the brain signal data by multiplying by the sampling rate. Here the first column of events mark the start of each word on the brain signal data.

[8]:
events = np.zeros((len(df_word), 3), dtype=int)
events[:, 0] = (df_word.start * raw.info['sfreq']).astype(int)
events.shape
[8]:
(5136, 3)

Then we’ll take advantage of MNE’s tools for creating epochs around stimulus events, which here are the starts (onsets) of each word, to visualize brain signal that respond to word onsets. Here, we take a fixed-width window ranging from -2 seconds to +2 seconds relative to word onset. Since the sampling rate is 512 Hz (512 samples per second), we have 2049 lags total. The ECoG data is a numpy array with the shape of (number of words * number of ECoG electrodes * number of lags).

[9]:
epochs = mne.Epochs(
    raw,
    events,
    tmin=-2.0,
    tmax=2.0,
    baseline=None,
    proj=False,
    event_id=None,
    preload=True,
    event_repeated="merge",
)
print(f"Epochs object has a shape of: {epochs._data.shape}")
Not setting metadata
5136 matching events found
No baseline correction applied
Loading data for 5136 events and 2049 original time points ...
/tmp/ipykernel_3774742/3482921024.py:1: RuntimeWarning: The events passed to the Epochs constructor are not chronologically ordered.
  epochs = mne.Epochs(
6 bad epochs dropped
Epochs object has a shape of: (5130, 235, 2049)

Next, we’ll downsample the temporal resolution to 32 Hz, which reduces the number of lags to 32 * 4 = 128.

Note

This code block may take ~3 minutes to run.

[10]:
epochs = epochs.resample(sfreq=32, npad='auto', method='fft', window='hamming')
print(f"Epochs object has a shape of: {epochs._data.shape}")
Epochs object has a shape of: (5130, 235, 128)

Setting up feature and brain data

Now we have both the features and the ECoG data ready. We plan to fit encoding models at each electrode and for each lag, so we’ll reshape our target matrix Y to horizontally stack both electrodes and lags along the second dimension.

[11]:
epochs_data = epochs.get_data(copy=True)
epochs_data = epochs_data.reshape(len(epochs), -1)
print(f"ECoG data matrix shape: {epochs_data.shape}")
ECoG data matrix shape: (5130, 30080)

We will also align our features with the ECoG data.

[12]:
selected_df = df_word.iloc[epochs.selection]
averaged_embeddings = aligned_embeddings[epochs.selection]
print(averaged_embeddings.shape)
(5130, 1600)

We will change the float precision to float32 for all data to take advantage of the GPU memory and computational speed.

[13]:
X = averaged_embeddings
Y = epochs_data

if "torch" in get_backend().__name__:
    X = X.astype(np.float32)
    Y = Y.astype(np.float32)

X.shape, Y.shape
[13]:
((5130, 1600), (5130, 30080))

Building encoding models

Now, we will use ridge regression to estimate the encoding model. We create a model pipeline uisng sklearn, which includes a StandardScaler that standardizes features (X), and a RidgeCV model, which performs ridge regression with cross-validation over our specificed alpha values.

[14]:
alphas = np.logspace(1, 10, 10) # specify alpha values
inner_cv = KFold(n_splits=5, shuffle=False) # inner 5-fold cross-validation setup
model = make_pipeline(
    StandardScaler(), RidgeCV(alphas, fit_intercept=True, cv=inner_cv) # pipeline
)
model
[14]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('ridgecv',
                 RidgeCV(alphas=array([1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08,
       1.e+09, 1.e+10]),
                         cv=KFold(n_splits=5, random_state=None, shuffle=False),
                         fit_intercept=True))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Training encoding models

While RidgeCV contains an inner cross-validation setup to find the best alpha, we will also set up an outer cross-validation loop to evaluate our encoding model. Here, we will use k = 2, meaning we will train on half of the data and evaluate on the other half. Within each fold, we will split the train and test dataset. Then we will standardize Y the same way we standardize X in the pipeline. We will then fit our model on the training dataset and use it to predict for the testing dataset. For evaluation, we will calculate correlation scores between Y_preds, the ECoG signal predicted by our model, and Y_test, the actual ECoG signal. The encoding model is trained and evaluated for each electrode and each lag.

This code block may take a while to run. Make sure you are using a GPU if you have one (verify by running nvidia-smi). You may also consider resampling the epochs even further to use fewer lags, and/or choose specific electrodes to run to use fewer electrodes.

[15]:
epochs_shape = epochs._data.shape[1:] # number of electrodes * number of lags

def train_encoding(X, Y):

    corrs = [] # empty array to store correlation results
    kfold = KFold(2, shuffle=False) # outer 2-fold cross-validation setup
    for train_index, test_index in kfold.split(X): # loop through folds

        # Split train and test datasets
        X1_train, X1_test = X[train_index], X[test_index]
        Y_train, Y_test = Y[train_index], Y[test_index]

        # Standardize Y
        scaler = StandardScaler()
        Y_train = scaler.fit_transform(Y_train)
        Y_test = scaler.transform(Y_test)

        model.fit(X1_train, Y_train) # Fit pipeline with transforms and ridge estimator
        Y_preds = model.predict(X1_test) # Use trained model to predict on test set
        corr = correlation_score(Y_test, Y_preds).reshape(epochs_shape) # Compute correlation score

        if "torch" in get_backend().__name__: # if using gpu, transform tensor back to numpy
            corr = corr.numpy(force=True)

        corrs.append(corr) # append fold correlation results to final results
    return np.stack(corrs)

# set_backend("torch") # resort to torch or numpy if cuda out of memory
corrs_embedding = train_encoding(X, Y)
print(f"Encoding performance correlating matrix shape: {corrs_embedding.shape}")
Encoding performance correlating matrix shape: (2, 235, 128)

Plotting encoding performance

We trained and evaluated many encoding models. We have 235 electrodes and 128 lags for each, and on top of that we split the podcast into two 15-minute chunks to train on one half and test on the other. Thus, we have correlations for each of these in one array. Below, we will summarize these results in two ways: spatially and temporally.

First we summarize spatially by averaging over all electrodes and looking at the average temporal pattern of correlations. We notice that encoding performance increases after word onset and is lower near ± 2 seconds.

[16]:
lags = np.arange(-2 * 512, 2 * 512, 16) / 512 # specify the lags
mean = corrs_embedding.mean((0,1))
err = corrs_embedding.std((0,1)) / np.sqrt(np.product(corrs_embedding.shape[:2]))

fig, ax = plt.subplots()
ax.plot(lags, mean, color='black')
ax.fill_between(lags, mean - err, mean + err, alpha=0.1, color='black')
ax.set_xlabel("lag (s)")
ax.set_ylabel("encoding performance (r ± sem)")
ax.axvline(0, c=(.9, .9, .9), ls="--")
ax.axhline(0, c=(.9, .9, .9), ls="--")
fig.show()
../_images/notebooks_04-encoding_39_0.png

Next we summarize temporally by selecting the maximum correlation across lags per electrode. Now that we have one correlation per electrode, we plot the results on the brain.

[17]:
values = corrs_embedding.mean(0).max(-1)

ch2loc = {ch['ch_name']: ch['loc'][:3] for ch in raw.info['chs']}
coords = np.vstack([ch2loc[ch] for ch in raw.info['ch_names']])
coords *= 1000  # nilearn likes to plot in meters, not mm
print("Coordinate matrix shape: ", coords.shape)

order = values.argsort()
plot_markers(values[order], coords[order],
             node_size=30, display_mode='lzr',
             node_vmin=0, node_cmap='inferno_r', colorbar=True)
plt.show()
Coordinate matrix shape:  (235, 3)
../_images/notebooks_04-encoding_41_1.png