Baseline Results Summary¶
This page summarizes the baseline results for all tasks in the podcast benchmark.
Overview¶
Baseline results for all of our tasks using a simple deep network, trained only on our data.
Note: For detailed metrics across all lags, see the lag_performance.csv file in each task's results directory linked below.
Content/Non-Content Classification¶
Config: configs/neural_conv_decoder/neural_conv_decoder_content_noncontent.yml
Detailed Results: baseline-results/content_noncontent_task_sig_elecs_mlp_early_stop_roc_2025-12-19-00-34-17/lag_performance.csv

Best Performance:
- Lag: 200ms
- ROC-AUC: 0.5900
Word Embedding Decoding¶
Performance Across Lags¶

Best Performance by Model¶
Arbitrary¶
Config: configs/neural_conv_decoder/neural_conv_decoder_arbitrary.yml
Detailed Results: baseline-results/ensemble_model_10_arbitrary_2025-12-19-00-17-32/lag_performance.csv
Best Performance:
- Lag: 400ms
- AUC-ROC: 0.5549
GloVe¶
Config: configs/neural_conv_decoder/neural_conv_decoder_glove.yml
Detailed Results: baseline-results/ensemble_model_10_glove_2025-12-19-00-17-41/lag_performance.csv
Best Performance:
- Lag: 400ms
- AUC-ROC: 0.6046
GPT-2¶
Config: configs/neural_conv_decoder/neural_conv_decoder_gpt2.yml
Detailed Results: baseline-results/ensemble_model_10_gpt2_2026-02-17-19-54-55/lag_performance.csv
Best Performance:
- Lag: 400ms
- AUC-ROC: 0.6050
Whisper Embedding Decoding¶
Config: configs/neural_conv_decoder/neural_conv_decoder_whisper_embedding.yml
Detailed Results: baseline-results/neural_conv_whisper_embedding_2026-02-17-19-25-13/lag_performance.csv

Best Performance:
- Lag: 400ms
- Pairwise Accuracy: 0.7074
Multimodal Score¶
The multimodal score combines word embedding decoding (GPT-2) and whisper embedding decoding using the harmonic mean of their pairwise accuracies. Both metrics have chance level 0.5 and range [0, 1], so the harmonic mean is well-calibrated. Computed using scripts/multimodal_score.py.

Detailed Results: baseline-results/ensemble_model_10_gpt2_2026-02-17-14-01-23/multimodal_score.csv
Best Performance:
- Lag: 400ms
- Word Pairwise Accuracy: 0.5116
- Whisper Pairwise Accuracy: 0.7074
- Multimodal Score: 0.5938
GPT Surprisal (Regression)¶
Config: configs/neural_conv_decoder/neural_conv_decoder_gpt_surprise.yml
Detailed Results: baseline-results/gpt_surprise_2025-12-19-00-18-44/lag_performance.csv

Best Performance:
- Lag: 400ms
- Correlation: 0.0591
GPT Surprisal (Multiclass)¶
Config: configs/neural_conv_decoder/neural_conv_decoder_gpt_surprise_multiclass.yml
Detailed Results: baseline-results/gpt_surprise_2025-12-19-00-18-43/lag_performance.csv

Best Performance:
- Lag: 200ms
- ROC-AUC (Multiclass): 0.5333
Part of Speech¶
Config: configs/neural_conv_decoder/neural_conv_decoder_pos.yml
Detailed Results: baseline-results/pos_task_sig_elecs_without_other_classes_2025-12-19-00-34-17/lag_performance.csv

Best Performance:
- Lag: 600ms
- ROC-AUC (Multiclass): 0.5305
Sentence Onset Detection¶
Config: configs/neural_conv_decoder/neural_conv_decoder_sentence_onset.yml
Detailed Results: baseline-results/sentence_onset_lr_2025-12-19-00-18-44/lag_performance.csv

Best Performance:
- Lag: 0ms
- ROC-AUC: 0.8800
Intonation Unit Boundary Detection¶
Config: configs/neural_conv_decoder/neural_conv_decoder_iu_boundaries.yml
Detailed Results: baseline-results/iu_boundary_lr_2026-02-26-09-46-13/lag_performance.csv

Best Performance:
- Lag: -200ms
- ROC-AUC: 0.5151
Volume Level Prediction¶
Config: configs/time_pooling_model/simple_model.yml
Detailed Results: baseline-results/volume_level_simple_2025-12-19-00-34-56/lag_performance.csv

Best Performance:
- Lag: 200ms
- Correlation: 0.4479
LLM Token Decoding¶
This section compares two approaches to LLM-based decoding from brain activity: one using brain data (LLM Token Finetuning) and a control without brain data (LLM Decoding).

LLM Token Finetuning (Brain Data)¶
Config: configs/neural_conv_decoder/llm_two_stage_multi.yml
Detailed Results: baseline-results/llm_token_finetune_2025-12-26-12-44-36/lag_performance.csv
Best Performance:
- Lag: 200ms
- Perplexity: 60.40
LLM Decoding (No Brain Data - Control)¶
Config: configs/controls/llm_decoding_no_brain_data.yml
Detailed Results: baseline-results/llm_decoding_control_2025-12-28-15-55-38/lag_performance.csv
Best Performance:
- Lag: -200ms
- Perplexity: 67.22