Meghan Sumner
· Associate Professor of LinguisticsVerifiedStanford University · Linguistics
Active 2003–2026
Research topics
- Computer Science
- Speech recognition
- Psychology
- Audiology
- Cognitive psychology
- Artificial Intelligence
- Natural Language Processing
- Communication
- Medicine
- Neuroscience
Selected publications
Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition
ArXiv.org · 2026-01-11
articleOpen accessIn speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition
arXiv (Cornell University) · 2026-01-11
preprintOpen accessIn speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
Indexical weight is distributed: Evidence from social evaluations of /s/
Proceedings of the Linguistic Society of America · 2026-05-08
articleOpen accessThough it is well established that listeners can infer social meanings from linguistic variation, less is known about how listeners use properties of the linguistic signal to form such indexical relationships. We examine this question using the well-documented association between /s/ center of gravity (CoG) and perceived masculinity. Using a matched guise paradigm, we ask how two properties of the phonetic signal, the F2 of the vowel following /s/ and speaker voice, modulate listeners' masculinity evaluations of sibilant acoustics. Furthermore, we investigate whether this modulation depends on the variability of the /s/ tokens listeners hear within the experimental context by comparing listeners who heard a single invariant /s/ token per categorical /s/ CoG condition (fronted, mid, backed) with those who heard /s/ tokens that varied with the phonological environment. Bayesian modeling confirmed that high-CoG (fronted) /s/ is typically perceived as less masculine than low-CoG (backed) /s/, replicating prior findings. Critically, masculinity judgments were primarily influenced by speaker voice, with /s/ CoG and F2 of the following vowel serving as secondary, modulating cues to speaker masculinity. These findings suggest that social evaluation reflects integration of the full available signal rather than extraction of a single variable.
ArXiv.org · 2025-05-20
preprintOpen accessHuman listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.
Sublexical ARTifacts: Bottom-up Interference in a Lexical Category Search
Underline Science Inc. · 2025-06-18
otherOpen accessSenior authorHow listeners adapt to unfamiliar talkers and accents is a central question in psycholinguistics. In this study, we explored how listeners dynamically shift mappings from acoustic information to mental representations after hearing a new talker via novel eye-tracking methods. We tested a prediction from Adaptive Resonance Theory (ART) that an anomaly in the signal (in this case, a change in talker) increases the influence of bottom-up relative to top-down information, creating an environment where sublexical competitors (e.g. 'Arch' within 'Archer') would be more likely interfere with lexical access for the target. In two experiments (Exp. 1: General American English [GA] talkers; Exp. 2: GA and Spanish-accented [SP] talkers), this prediction was supported via analyses of accuracy, latency, and gaze. In Exp. 2, we found that the effect replicated but did not differ based on the accent of the talker. The data suggest new paths forward in speech adaptation research.
2025-01-01 · 2 citations
articleOpen accessArXiv.org · 2025-11-25
preprintOpen accessLarge language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
Talker-specificity beyond the lexicon: Recognition memory for spoken sentences
Psychonomic Bulletin & Review · 2025-08-27
articleSenior authorTalker-based asymmetries in memory for spoken sentences
The Journal of the Acoustical Society of America · 2025-04-01
articleSenior authorIt is well-established that memory plays a central role in the human ability to understand speech, but not all experiences with speech are remembered equally well. One hypothesis about how these asymmetries emerge is that representation strength depends on how listeners allocate cognitive resources to the speech signal, partially based on the social characteristics of the talker. To test this, we conducted three recognition memory experiments with 12 diverse, but roughly equally non-standard talkers (i.e., no speakers of mainstream American English). We manipulated attention at encoding, as well as retrieval modality. Participants heard spoken sentences at study with different test blocks: auditorily presented sentences (Exp. 1), orthographically presented sentences (Exp. 2), and images (Exp. 3). In all three experiments, memory was stronger in Full than Divided Attention. Crucially, we also found that memory performance depended heavily on the talker and that talker interacted with voice of repetition (Exp. 1) and attention (Exps. 1, 2, and 3) in complex ways. These results point to a highly dynamic, context-sensitive network of speech representations where encoding and recognition behaviors are patterned by resource allocation in addition to frequency and typicality. We discuss implications for understanding voice-based biases in everyday situations.
Speech patterns during memory recall relates to early tau burden across adulthood
Alzheimer s & Dementia · 2024 · 22 citations
- Psychology
- Audiology
- Cognitive psychology
INTRODUCTION: Early cognitive decline may manifest in subtle differences in speech. METHODS: We examined 238 cognitively unimpaired adults from the Framingham Heart Study (32-75 years) who completed amyloid and tau PET imaging. Speech patterns during delayed recall of a story memory task were quantified via five speech markers, and their associations with global amyloid status and regional tau signal were examined. RESULTS: Total utterance time, number of between-utterance pauses, speech rate, and percentage of unique words significantly correlated with delayed recall score although the shared variance was low (2%-15%). Delayed recall score was not significantly different between β-amyoid-positive (Aβ+) and -negative (Aβ-) groups and was not associated with regional tau signal. However, longer and more between-utterance pauses, and slower speech rate were associated with increased tau signal across medial temporal and early neocortical regions. DISCUSSION: Subtle speech changes during memory recall may reflect cognitive impairment associated with early Alzheimer's disease pathology. HIGHLIGHTS: Speech during delayed memory recall relates to tau PET signal across adulthood. Delayed memory recall score was not associated with tau PET signal. Speech shows greater sensitivity to detecting subtle cognitive changes associated with early tau accumulation. Our cohort spans adulthood, while most PET imaging studies focus on older adults.
Recent grants
NIH · $131k · 2006
The perception, representation, and use of non-native voicing cues
NSF · $349k · 2007–2011
NSF · $400k · 2012–2018
Frequent coauthors
- 7 shared
Arthur G. Samuel
Basque Center on Cognition, Brain and Language
- 6 shared
Seung Kyung Kim
University of Utah
- 5 shared
Ed King
Stanford University
- 4 shared
Marisa Tice
Stanford University
- 4 shared
Kevin B. McGowan
- 3 shared
William Clapp
Stanford University
- 3 shared
Marie-Catherine de Marneffe
- 3 shared
John M. Tomlinson
Humboldt-Universität zu Berlin
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Meghan Sumner
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup