Gopala Krishna Anumanchipalli

· Assistant ProfessorVerified

University of California, Berkeley · Department of Electrical Engineering and Computer Sciences

Active 2007–2025

h-index19

Citations2.9k

Papers9460 last 5y

Funding—

Faculty page Lab page Website

See your match with Gopala Krishna Anumanchipalli — sign in to PhdFit.Sign in

About

Gopala Krishna Anumanchipalli is the Robert E. and Beverly A. Brooks Assistant Professor in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, and also holds a position in the Department of Neurosurgery at UC San Francisco. He leads the Berkeley Speech Group, focusing on the intersection of speech processing, neuroscience, and artificial intelligence. His research emphasizes human-centered speech and assistive technologies, including the development of bio-inspired spoken language technologies, automated methods for early diagnosis, and approaches to characterize and rehabilitate disordered speech. His broader interests include healthcare, conversational AI, and multimodal learning. The Berkeley Speech Group is part of the Berkeley AI Research (BAIR) community. Gopala Anumanchipalli completed his PhD at Carnegie Mellon University and Instituto Superior Tecnico under the advisement of Alan Black and Luis Oliveira, followed by a postdoctoral fellowship at UC San Francisco with Edward Chang. He earned his B.Tech and M.S. degrees from IIIT Hyderabad under Raj Reddy.

Research topics

Computer Science
Artificial Intelligence
Neuroscience
Psychology
Speech recognition
Telecommunications
Human–computer interaction
Linguistics
Audiology
Medicine
Physical medicine and rehabilitation

Selected publications

Automatic Detection of Articulatory-Based Disfluencies in Primary Progressive Aphasia
IEEE Journal of Selected Topics in Signal Processing · 2025-06-16 · 4 citations
articleOpen accessSenior author
Speech corpora are collections of textual data derived from human verbal output and speech signals that can be processed from a variety of perspectives, including formal or semantic content, to serve analyses of different levels of linguistic organisation (phonemic, morphosyntactic, lexico-semantic and content information, prosody and intonation) and to serve analyses of important phenomena such as speech fluency and errors (non-fluencies). We focus on transcribing speech along with non-fluencies or dysfluencies, the detection of which plays an important role in the diagnosis of primary progressive aphasia, where we specifically examine articulation-based dysfluencies in nfvPPA speech. In this work, we propose SSDM 2.0, which is built on top of the current state-of-the-art system of dysfluency detection [1] and tackles its shortcomings via four main contributions: (1) We propose a novel Neural Articulatory Flow for deriving highly scalable, dysfluency-aware speech representations. (2) We develop a full-stack connectionist subsequence aligner to capture all major dysfluency types. (3) We introduce a mispronunciation prompt pipeline and consistency learning into LLMs to enable in-context dysfluency learning. (4) We curate and open-source Libri-Co-Dys [1], the largest co-dysfluency corpus to date. (5) We also present SSDM-L, a modular, non-end-to-end, lightweight model designed for clinical deployment. In clinical experiments on pathological speech transcription, we tested SSDM 2.0 using nfvPPA corpus primarily characterized by articulatory dysfluencies. Overall, SSDM 2.0 outper-forms SSDM and all other dysfluency transcription models by a large margin. See our project demo page at https:// berkeley-speech-group.github.io/SSDM2.0/.
Publisher OA PDF DOI
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
ArXiv.org · 2025-03-12 · 1 citations
preprintOpen access
Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
Publisher OA PDF DOI
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities
ArXiv.org · 2025-03-06
preprintOpen access
Spoken dialogue modeling poses challenges beyond text-based language modeling, requiring real-time interaction, turn-taking, and backchanneling. While most Spoken Dialogue Models (SDMs) operate in half-duplex mode-processing one turn at a time - emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural conversations. However, current evaluations remain limited, focusing mainly on turn-based metrics or coarse corpus-level analyses. To address this, we introduce Full-Duplex-Bench, a benchmark that systematically evaluates key interactive behaviors: pause handling, backchanneling, turn-taking, and interruption management. Our framework uses automatic metrics for consistent, reproducible assessment and provides a fair, fast evaluation setup. By releasing our benchmark and code, we aim to advance spoken dialogue modeling and foster the development of more natural and engaging SDMs.
Publisher OA PDF DOI
A High-Performance Neuroprosthesis for Speech Decoding and Avatar Control
Springer briefs in electrical and computer engineering · 2025-01-01
book-chapter
Publisher DOI
A streaming brain-to-voice neuroprosthesis to restore naturalistic communication
Nature Neuroscience · 2025-03-31 · 56 citations
articleSenior author
Publisher DOI
TART: A Comprehensive Tool for Technique-Aware Audio-to-Tab Guitar Transcription
ArXiv.org · 2025-10-02
preprintOpen accessSenior author
Automatic Music Transcription (AMT) has advanced significantly for the piano, but transcription for the guitar remains limited due to several key challenges. Existing systems fail to detect and annotate expressive techniques (e.g., slides, bends, percussive hits) and incorrectly map notes to the wrong string and fret combination in the generated tablature. Furthermore, prior models are typically trained on small, isolated datasets, limiting their generalizability to real-world guitar recordings. To overcome these limitations, we propose a four-stage end-to-end pipeline that produces detailed guitar tablature directly from audio. Our system consists of (1) Audio-to-MIDI pitch conversion through a piano transcription model adapted to guitar datasets; (2) MLP-based expressive technique classification; (3) Transformer-based string and fret assignment; and (4) LSTM-based tablature generation. To the best of our knowledge, this framework is the first to generate detailed tablature with accurate fingerings and expressive labels from guitar audio.
Publisher OA PDF DOI
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in Hubert
2024-03-18 · 4 citations
articleSenior author
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and speech units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames exhibit salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.
Publisher DOI
Coding Speech Through Vocal Tract Kinematics
IEEE Journal of Selected Topics in Signal Processing · 2024-11-20 · 8 citations
articleSenior author
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech – Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
Publisher DOI
Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
2024-09-01 · 3 citations
articleSenior author
Publisher DOI
Coding Speech through Vocal Tract Kinematics
arXiv (Cornell University) · 2024-06-18 · 2 citations
preprintOpen accessSenior author
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
Publisher OA PDF DOI

Frequent coauthors

Edward F. Chang
Neurological Surgery
53 shared
Alan W. Black
39 shared
Peter Wu
39 shared
Cheol Jun Cho
27 shared
Kaylo T. Littlejohn
University of California, San Francisco
23 shared
Josh Chartier
Neurological Surgery
20 shared
Jiachen Lian
18 shared
Inga Zhuravleva
Berkeley College
17 shared

Education

Ph.D., Electrical Engineering and Computer Sciences
University of California, Berkeley
2005
M.S., Electrical Engineering and Computer Sciences
University of California, Berkeley
2001
B.S., Electrical Engineering
Indian Institute of Technology, Madras
1999

Awards & honors

Hellman Fellow (2023)
Google Faculty Research Award (2022)
Rose Hills Innovator Program (2021)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Gopala Krishna Anumanchipalli

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you