Dani Byrd
· Professor of LinguisticsVerifiedUniversity of Southern California · Linguistics
Active 1990–2026
About
Dani Byrd is a Professor in the Department of Linguistics at the University of Southern California, within the Dornsife College of Letters, Arts and Sciences. Her research focuses on phonetics and phonology, and she is actively involved in research groups such as the USC Phonetics & Phonology Group and the USC SPAN Research Group. She contributes to the academic community through her work on speech, words, and the mind, and she has authored an introductory textbook titled 'Discovering Speech, Words, and Mind.' Professor Byrd is engaged in teaching courses related to phonetics and phonology and provides resources and support for students and colleagues in her field.
Research topics
- Artificial Intelligence
- Computer Science
- Speech recognition
- Physics
- Linguistics
- Computer vision
- Mathematics
Selected publications
Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
Open MIND · 2026-01-20
preprintMany spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
Open MIND · 2026-03-05
preprintSpeech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.
Learning-free L2-Accented Speech Generation using Phonological Rules
ArXiv.org · 2026-03-08
articleOpen accessAccent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.
Deep learning characterizes depression and suicidal ideation in young adults from eye movements
npj Digital Medicine · 2026-03-28 · 1 citations
articleOpen accessObjective biobehavioral markers for mental health conditions remain elusive, with diagnosis typically relying on self-reports and clinical interviews. We investigate eye tracking as a potential marker of attentional and mood biases associated with symptoms of depression and suicidal ideation from self-reported screening questionnaires. We analyze eye movements from 126 young adults during reading and responding to emotionally loaded sentences. A deep learning framework was designed to account for intra-trial and inter-trial variations in eye movements, achieving an AUC of 0.793 (95% CI: 0.766-0.819) for identifying depression/suicidality against healthy controls, and 0.826 (95% CI: 0.798-0.853) for suicidality specifically. The model also exhibited moderate accuracy in differentiating depressed from suicidal individuals (AUC: 0.609, 95% CI: 0.569-0.646). Discriminative patterns were more pronounced during response generation and for stimuli of negative sentiment. These findings suggest that eye tracking can provide objective markers of self-reported symptom severity by measuring the impact of emotional stimuli on oculomotor control.
Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
2026-04-21
articleMany spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
2026-05-14
articleOpen accessThe current real-time vocal tract MRI study examines the articulatory encoding of prosodic boundary, prominence and their interaction through kinematic analysis of penultimate and final lengthening near an intonational phrase (IP) boundary in Setswana.One hypothesis is that penultimate lengthening represents a specific case of final lengthening initiated on the IP-penultimate position.Alternatively, penultimate lengthening and final lengthening may result from the interaction between phrase-level prominence and boundary events.Our results reveal two phases of lengthening in the IP-penultimate and IPfinal positions.Displacement and peak velocity are also greater IP-finally than IP-medially, but boundary-related increase in displacement and peak velocity only shows a single progressive trend approaching the final IP boundary, with no IP-penultimate alterations comparable to durational patterns.Additionally, there is some evidence for greater duration, displacement and peak velocity of initial consonant gestures on word-penultimate syllables than on word-final ones regardless of utterance positions, indicating a possible word-penultimate prominence effect.These findings suggest that penultimate and final lengthening in Setswana are better understood as the interaction between disparate prominence and boundary events.The results are interpreted according to a prosodic gestural approach that posits the coordination of a phrasal-prominence-encoding gesture and a boundary-encoding gesture.
Learning-free L2-Accented Speech Generation using Phonological Rules
Open MIND · 2026-03-08
preprintAccent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.
JASA Express Letters · 2026-03-01
articleOpen accessModeling articulatory representations is critical to the scientific study of speech production, including its relation to speech acoustics. However, discretizing articulatory dynamics in continuous speech has proven computationally taxing. For example, segmentation analyses of real-time vocal tract images deploying contour-tracking methods, while successful, require manual creation of templates and human supervised assessment [e.g., Bresch and Narayanan (2009). IEEE Trans. Med. Imaging. 28(3), 323-338]. In this paper, we utilize Segment Anything Model 2 (SAM 2.0) [Ravi et al. (2024). arXiv:2408.00714] to efficiently segment critical articulators in real-time magnetic resonance imaging speech production data without fine-tuning and with global nonlinear image filtering to examine such systems' ability to segment speech dynamics, which have both language- and subject-specific characteristics.
Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
ArXiv.org · 2026-01-20
articleOpen accessMany spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
Interpretable Modeling of Articulatory Temporal Dynamics from Real-Time MRI for Phoneme Recognition
2026-04-21
articleOpen accessReal-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing. The source code is publicly available. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
Recent grants
Speech Prosody and Articulatory Dynamics in Spoken Language
NIH · $5.9M · 1997–2019
Doctoral Dissertation Research: Articulatory Dynamics and Stability of Multi-gesture Complexes
NSF · $19k · 2021–2023
NIH · $553k · 2002
Frequent coauthors
- 77 shared
Louis Goldstein
- 60 shared
Shrikanth Narayanan
- 41 shared
Sungbok Lee
Korea Advanced Institute of Science and Technology
- 21 shared
Krishna S. Nayak
- 21 shared
Asterios Toutios
University of Southern California
- 19 shared
Elliot Saltzman
Boston University
- 17 shared
Erik Bresch
Philips (Netherlands)
- 15 shared
Benjamin Parrell
University of Wisconsin–Madison
Education
Ph.D., Linguistics
University of Southern California
M.A., Linguistics
University of Southern California
B.A., Linguistics
University of Southern California
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Dani Byrd
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup