Skip to content

Baseline Results Summary

This page summarizes the baseline results for all tasks in the podcast benchmark.

Overview

Baseline results for all of our tasks using a simple deep network, trained only on our data.

Note: For detailed metrics across all lags, see the lag_performance.csv file in each task's results directory linked below.

Content/Non-Content Classification

Config: configs/neural_conv_decoder/neural_conv_decoder_content_noncontent.yml

Detailed Results: baseline-results/content_noncontent_task_sig_elecs_mlp_early_stop_roc_2025-12-19-00-34-17/lag_performance.csv

Content/Non-Content Classification

Best Performance:

  • Lag: 200ms
  • ROC-AUC: 0.5900

Word Embedding Decoding

Performance Across Lags

Word Embedding Decoding AUC ROC

Best Performance by Model

Arbitrary

Config: configs/neural_conv_decoder/neural_conv_decoder_arbitrary.yml

Detailed Results: baseline-results/ensemble_model_10_arbitrary_2025-12-19-00-17-32/lag_performance.csv

Best Performance:

  • Lag: 400ms
  • AUC-ROC: 0.5549

GloVe

Config: configs/neural_conv_decoder/neural_conv_decoder_glove.yml

Detailed Results: baseline-results/ensemble_model_10_glove_2025-12-19-00-17-41/lag_performance.csv

Best Performance:

  • Lag: 400ms
  • AUC-ROC: 0.6046

GPT-2

Config: configs/neural_conv_decoder/neural_conv_decoder_gpt2.yml

Detailed Results: baseline-results/ensemble_model_10_gpt2_2026-02-17-19-54-55/lag_performance.csv

Best Performance:

  • Lag: 400ms
  • AUC-ROC: 0.6050

Whisper Embedding Decoding

Config: configs/neural_conv_decoder/neural_conv_decoder_whisper_embedding.yml

Detailed Results: baseline-results/neural_conv_whisper_embedding_2026-02-17-19-25-13/lag_performance.csv

Whisper Embedding Decoding

Best Performance:

  • Lag: 400ms
  • Pairwise Accuracy: 0.7074

Multimodal Score

The multimodal score combines word embedding decoding (GPT-2) and whisper embedding decoding using the harmonic mean of their pairwise accuracies. Both metrics have chance level 0.5 and range [0, 1], so the harmonic mean is well-calibrated. Computed using scripts/multimodal_score.py.

Multimodal Score

Detailed Results: baseline-results/ensemble_model_10_gpt2_2026-02-17-14-01-23/multimodal_score.csv

Best Performance:

  • Lag: 400ms
  • Word Pairwise Accuracy: 0.5116
  • Whisper Pairwise Accuracy: 0.7074
  • Multimodal Score: 0.5938

GPT Surprisal (Regression)

Config: configs/neural_conv_decoder/neural_conv_decoder_gpt_surprise.yml

Detailed Results: baseline-results/gpt_surprise_2025-12-19-00-18-44/lag_performance.csv

GPT Surprisal (Regression)

Best Performance:

  • Lag: 400ms
  • Correlation: 0.0591

GPT Surprisal (Multiclass)

Config: configs/neural_conv_decoder/neural_conv_decoder_gpt_surprise_multiclass.yml

Detailed Results: baseline-results/gpt_surprise_2025-12-19-00-18-43/lag_performance.csv

GPT Surprisal (Multiclass)

Best Performance:

  • Lag: 200ms
  • ROC-AUC (Multiclass): 0.5333

Part of Speech

Config: configs/neural_conv_decoder/neural_conv_decoder_pos.yml

Detailed Results: baseline-results/pos_task_sig_elecs_without_other_classes_2025-12-19-00-34-17/lag_performance.csv

Part of Speech

Best Performance:

  • Lag: 600ms
  • ROC-AUC (Multiclass): 0.5305

Sentence Onset Detection

Config: configs/neural_conv_decoder/neural_conv_decoder_sentence_onset.yml

Detailed Results: baseline-results/sentence_onset_lr_2025-12-19-00-18-44/lag_performance.csv

Sentence Onset Detection

Best Performance:

  • Lag: 0ms
  • ROC-AUC: 0.8800

Intonation Unit Boundary Detection

Config: configs/neural_conv_decoder/neural_conv_decoder_iu_boundaries.yml

Detailed Results: baseline-results/iu_boundary_lr_2026-02-26-09-46-13/lag_performance.csv

Intonation Unit Boundary Detection

Best Performance:

  • Lag: -200ms
  • ROC-AUC: 0.5151

Volume Level Prediction

Config: configs/time_pooling_model/simple_model.yml

Detailed Results: baseline-results/volume_level_simple_2025-12-19-00-34-56/lag_performance.csv

Volume Level Prediction

Best Performance:

  • Lag: 200ms
  • Correlation: 0.4479

LLM Token Decoding

This section compares two approaches to LLM-based decoding from brain activity: one using brain data (LLM Token Finetuning) and a control without brain data (LLM Decoding).

LLM Decoding Comparison

LLM Token Finetuning (Brain Data)

Config: configs/neural_conv_decoder/llm_two_stage_multi.yml

Detailed Results: baseline-results/llm_token_finetune_2025-12-26-12-44-36/lag_performance.csv

Best Performance:

  • Lag: 200ms
  • Perplexity: 60.40

LLM Decoding (No Brain Data - Control)

Config: configs/controls/llm_decoding_no_brain_data.yml

Detailed Results: baseline-results/llm_decoding_control_2025-12-28-15-55-38/lag_performance.csv

Best Performance:

  • Lag: -200ms
  • Perplexity: 67.22