
Christopher Potts
VerifiedStanford University · Symbolic Systems
Active 2001–2026
About
Christopher Potts is a Professor of Linguistics at Stanford University. He holds a B.A. in Linguistics with a German minor from New York University, obtained in 1999, and an M.A. and Ph.D. in Linguistics from the University of California, Santa Cruz, completed in 2000 and 2003 respectively. His academic focus includes applied logic, artificial intelligence, cognitive science, natural language, and philosophical foundations. In addition to his role in the Department of Linguistics, he is a member of the Bio-X Faculty and an affiliate of the Institute for Human-Centered Artificial Intelligence (HAI).
Research topics
- Computer Science
- Artificial Intelligence
- Political Science
- Data science
- Management science
- Engineering
- Psychology
- Engineering ethics
- Law
Selected publications
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Open MIND · 2026-02-24
preprintSenior authorInspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training
Gastroenterology · 2026-05-01
article2026-01-01
articleOpen accessGastrointestinal Endoscopy · 2026-05-01
articleCounterfactual Simulation Training for Chain-of-Thought Faithfulness
arXiv (Cornell University) · 2026-02-24
articleOpen accessSenior authorInspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training
Gastrointestinal Endoscopy · 2026-05-01
articleGastroenterology · 2026-05-01
articleDo Language Models Use Their Depth Efficiently?
ArXiv.org · 2025-05-20 · 1 citations
preprintOpen accessSenior authorModern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
arXiv (Cornell University) · 2025-01-28
preprintOpen accessSenior authorFine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
WARP: An Efficient Engine for Multi-Vector Retrieval
2025-07-13 · 5 citations
articleOpen accessMulti-vector retrieval methods such as ColBERT and its recent variant, the ConteXtualized Token Retriever (XTR), offer high accuracy but face efficiency challenges at scale. To address this, we present WARP, a retrieval engine that substantially improves the efficiency of retrievers trained with the XTR objective through three key innovations: (1) WARPSELECT for dynamic similarity imputation; (2) implicit decompression, avoiding costly vector reconstruction during retrieval; and (3) a two-stage reduction process for efficient score aggregation. Combined with highly-optimized C++ kernels, our system reduces end-to-end latency compared to XTR's reference implementation by 41x, and achieves a 3x speedup over the ColBERTv2/PLAID engine, while preserving retrieval quality. WARP also reduces index sizes by a factor of 2x-4x compared to XTR, enabling deployment on memory-constrained devices.
Recent grants
Expressive Content and the Semantics of Contexts
NSF · $218k · 2007–2012
RI: Medium: Bringing Sentiment Analysis and Social Network Analysis Together
NSF · $1.0M · 2012–2017
Frequent coauthors
- 50 shared
Atticus Geiger
- 45 shared
Zhengxuan Wu
- 44 shared
Christopher D. Manning
- 33 shared
Omar Khattab
Stanford University
- 33 shared
Elisa Kreiss
- 27 shared
Noah D. Goodman
- 25 shared
Matei Zaharia
- 19 shared
Samuel R. Bowman
Awards & honors
- Stanford Honors Thesis Prizes - Symbolic Systems
- Glushko Prize for Excellence in Undergraduate Research in Sy…
- Barwise Award for Distinguished Contributions to Symbolic Sy…
- Symbolic Systems Distinguished Teaching Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Christopher Potts
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup