About
Gopala Krishna Anumanchipalli is the Robert E. and Beverly A. Brooks Assistant Professor in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, and also holds a position in the Department of Neurosurgery at UC San Francisco. He leads the Berkeley Speech Group, focusing on the intersection of speech processing, neuroscience, and artificial intelligence. His research emphasizes human-centered speech and assistive technologies, including the development of bio-inspired spoken language technologies, automated methods for early diagnosis, and approaches to characterize and rehabilitate disordered speech. His broader interests include healthcare, conversational AI, and multimodal learning. The Berkeley Speech Group is part of the Berkeley AI Research (BAIR) community. Gopala Anumanchipalli completed his PhD at Carnegie Mellon University and Instituto Superior Tecnico under the advisement of Alan Black and Luis Oliveira, followed by a postdoctoral fellowship at UC San Francisco with Edward Chang. He earned his B.Tech and M.S. degrees from IIIT Hyderabad under Raj Reddy.
Research topics
- Computer Science
- Artificial Intelligence
- Neuroscience
- Psychology
- Speech recognition
- Telecommunications
- Human–computer interaction
- Linguistics
- Audiology
- Medicine
- Physical medicine and rehabilitation
Selected publications
Automatic Detection of Articulatory-Based Disfluencies in Primary Progressive Aphasia
IEEE Journal of Selected Topics in Signal Processing · 2025-06-16 · 4 citations
articleOpen accessSenior authorSpeech corpora are collections of textual data derived from human verbal output and speech signals that can be processed from a variety of perspectives, including formal or semantic content, to serve analyses of different levels of linguistic organisation (phonemic, morphosyntactic, lexico-semantic and content information, prosody and intonation) and to serve analyses of important phenomena such as speech fluency and errors (non-fluencies). We focus on transcribing speech along with non-fluencies or dysfluencies, the detection of which plays an important role in the diagnosis of primary progressive aphasia, where we specifically examine articulation-based dysfluencies in nfvPPA speech. In this work, we propose SSDM 2.0, which is built on top of the current state-of-the-art system of dysfluency detection [1] and tackles its shortcomings via four main contributions: (1) We propose a novel Neural Articulatory Flow for deriving highly scalable, dysfluency-aware speech representations. (2) We develop a full-stack connectionist subsequence aligner to capture all major dysfluency types. (3) We introduce a mispronunciation prompt pipeline and consistency learning into LLMs to enable in-context dysfluency learning. (4) We curate and open-source Libri-Co-Dys [1], the largest co-dysfluency corpus to date. (5) We also present SSDM-L, a modular, non-end-to-end, lightweight model designed for clinical deployment. In clinical experiments on pathological speech transcription, we tested SSDM 2.0 using nfvPPA corpus primarily characterized by articulatory dysfluencies. Overall, SSDM 2.0 outper-forms SSDM and all other dysfluency transcription models by a large margin. See our project demo page at https:// berkeley-speech-group.github.io/SSDM2.0/.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
ArXiv.org · 2025-03-12 · 1 citations
preprintOpen accessLarge language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
ArXiv.org · 2025-03-06
preprintOpen accessSpoken dialogue modeling poses challenges beyond text-based language modeling, requiring real-time interaction, turn-taking, and backchanneling. While most Spoken Dialogue Models (SDMs) operate in half-duplex mode-processing one turn at a time - emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural conversations. However, current evaluations remain limited, focusing mainly on turn-based metrics or coarse corpus-level analyses. To address this, we introduce Full-Duplex-Bench, a benchmark that systematically evaluates key interactive behaviors: pause handling, backchanneling, turn-taking, and interruption management. Our framework uses automatic metrics for consistent, reproducible assessment and provides a fair, fast evaluation setup. By releasing our benchmark and code, we aim to advance spoken dialogue modeling and foster the development of more natural and engaging SDMs.
A High-Performance Neuroprosthesis for Speech Decoding and Avatar Control
Springer briefs in electrical and computer engineering · 2025-01-01
book-chapterA streaming brain-to-voice neuroprosthesis to restore naturalistic communication
Nature Neuroscience · 2025-03-31 · 56 citations
articleSenior authorTART: A Comprehensive Tool for Technique-Aware Audio-to-Tab Guitar Transcription
ArXiv.org · 2025-10-02
preprintOpen accessSenior authorAutomatic Music Transcription (AMT) has advanced significantly for the piano, but transcription for the guitar remains limited due to several key challenges. Existing systems fail to detect and annotate expressive techniques (e.g., slides, bends, percussive hits) and incorrectly map notes to the wrong string and fret combination in the generated tablature. Furthermore, prior models are typically trained on small, isolated datasets, limiting their generalizability to real-world guitar recordings. To overcome these limitations, we propose a four-stage end-to-end pipeline that produces detailed guitar tablature directly from audio. Our system consists of (1) Audio-to-MIDI pitch conversion through a piano transcription model adapted to guitar datasets; (2) MLP-based expressive technique classification; (3) Transformer-based string and fret assignment; and (4) LSTM-based tablature generation. To the best of our knowledge, this framework is the first to generate detailed tablature with accurate fingerings and expressive labels from guitar audio.
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in Hubert
2024-03-18 · 4 citations
articleSenior authorData-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and speech units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames exhibit salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.
Coding Speech Through Vocal Tract Kinematics
IEEE Journal of Selected Topics in Signal Processing · 2024-11-20 · 8 citations
articleSenior authorVocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech – Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
2024-09-01 · 3 citations
articleSenior authorCoding Speech through Vocal Tract Kinematics
arXiv (Cornell University) · 2024-06-18 · 2 citations
preprintOpen accessSenior authorVocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
Frequent coauthors
- 53 shared
Edward F. Chang
Neurological Surgery
- 39 shared
Alan W. Black
- 39 shared
Peter Wu
- 27 shared
Cheol Jun Cho
- 23 shared
Kaylo T. Littlejohn
University of California, San Francisco
- 20 shared
Josh Chartier
Neurological Surgery
- 18 shared
Jiachen Lian
- 17 shared
Inga Zhuravleva
Berkeley College
Education
- 2005
Ph.D., Electrical Engineering and Computer Sciences
University of California, Berkeley
- 2001
M.S., Electrical Engineering and Computer Sciences
University of California, Berkeley
- 1999
B.S., Electrical Engineering
Indian Institute of Technology, Madras
Awards & honors
- Hellman Fellow (2023)
- Google Faculty Research Award (2022)
- Rose Hills Innovator Program (2021)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Gopala Krishna Anumanchipalli
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup