
Sarah Moeller
· Ph.D.VerifiedUniversity of Florida · Linguistics
Active 1998–2026
About
Sarah Moeller is an Assistant Professor of Computational Language Science at the University of Florida, based in Turlington 4129, Gainesville, FL. She is an interdisciplinary researcher dedicated to bridging the technological and knowledge gaps between natural language processing (NLP) and traditional linguistics. Her research fosters a virtuous cycle where NLP contributes to the scientific study of minority languages, and in turn, the study of these languages expands NLP and AI to new linguistic contexts. Her special interest lies in the languages of the former Soviet Union, where she has engaged in fieldwork with Nakh-Daghestanian languages and encountered language endangerment firsthand. Moeller's work explores how best practices in linguistics might evolve with AI integration and how computational linguistics can empower both academic and community linguists to leverage AI positively. Before her academic career, she gained extensive experience teaching English as a Foreign Language, working as a freelance Russian-English interpreter, and spending several years in the NLP industry. She is passionate about helping linguists and individuals from humanities backgrounds adopt computational methods. Her educational background includes a Ph.D. in Linguistics and Cognitive Science from the University of Colorado Boulder, an M.A. in Applied Linguistics from Dallas International University, and a B.A. in History from Thomas Edison State College.
Research topics
- Computer Science
- Natural Language Processing
- Linguistics
- Artificial Intelligence
- Programming language
- Philosophy
- Epistemology
- Psychology
- Cognitive science
- Engineering
Selected publications
Computational Methods for Language Documentation and Description
Annual Review of Linguistics · 2026-01-30
articleOpen access1st authorCorrespondingIn this era of rapid artificial intelligence (AI) expansion, computational approaches are reshaping methods for language documentation and description. We survey the history of computational methods that have been applied to research in languages with limited digital resources and also present cutting-edge methods, such as large language models (LLMs), that have the potential to benefit documentary and descriptive fieldwork. We highlight how these methods affect data collection and annotation, transcription and phonological analysis, morphosyntactic description, and translation. Linguists, natural language processing engineers, and speech communities must consider how the use of computational methods such as data mining and machine learning should influence ethical best practices in linguistic field methods and how communities can continue to guide the documentation and maintenance of their languages in the age of AI. Looking forward, LLMs and making computational methods broadly usable through user interfaces are likely to emerge as prominent themes in documentary and descriptive research.
Analysis of LLM as a grammatical feature tagger for African American English
ArXiv.org · 2025-02-09
preprintOpen accessAfrican American English (AAE) presents unique challenges in natural language processing (NLP). This research systematically compares the performance of available NLP models--rule-based, transformer-based, and large language models (LLMs)--capable of identifying key grammatical features of AAE, namely Habitual Be and Multiple Negation. These features were selected for their distinct grammatical complexity and frequency of occurrence. The evaluation involved sentence-level binary classification tasks, using both zero-shot and few-shot strategies. The analysis reveals that while LLMs show promise compared to the baseline, they are influenced by biases such as recency and unrelated features in the text such as formality. This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE's unique linguistic characteristics. Data and code are available.
Analysis of LLM as a grammatical feature tagger for African American English
2025-01-01
articleOpen accessSenior authorAfrican American English (AAE) presents unique challenges in natural language processing (NLP) This research systematically compares the performance of available NLP models-rule-based, transformer-based, and large language models (LLMs)-capable of identifying key grammatical features of AAE, namely Habitual Be and Multiple Negation.These features were selected for their distinct grammatical complexity and frequency of occurrence.The evaluation involved sentencelevel binary classification tasks, using both zero-shot and few-shot strategies.The analysis reveals that while LLMs show promise compared to the baseline, they are influenced by biases such as recency and unrelated features in the text such as formality.This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE's unique linguistic characteristics.Data and code are available.
Challenges in Processing Chinese Texts Across Genres and Eras
2025-01-01
articleOpen accessSenior authorPre-trained Chinese Natural Language Processing (NLP) tools show reduced performance when analyzing poetry compared to prose.This study investigates the discrepancies between tools trained on either Classical or Modern Chinese prose when handling Classical Chinese prose and Classical Chinese poetry.Three experiments reveal error patterns that indicate the weaker performance on Classical Chinese poems is due to challenges identifying word boundaries.Specifically, tools trained on Classical prose struggle recognizing word boundaries within Classical poetic structures and tools trained on Modern prose have difficulty with word segmentation in both Classical Chinese genres.These findings provide valuable insights into the limitations of current NLP tools for studying Classical Chinese literature.
The Oral History Review · 2025-07-03
articleSenior author2024-01-01
articleOpen access1st authorCorrespondingSarah Moeller, Godfred Agyapong, Antti Arppe, Aditi Chaudhary, Shruti Rijhwani, Christopher Cox, Ryan Henke, Alexis Palmer, Daisy Rosenblum, Lane Schwartz. Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages. 2024.
Machine-in-the-Loop with Documentary and Descriptive Linguists
2024-01-01
articleOpen access1st authorCorrespondingThis paper describes a curriculum for teaching linguists how to apply machine-in-the-loop (MitL) approach to documentary and descriptive tasks.It also shares observations about the learning participants, who are primarily noncomputational linguists, and how they interact with the MitL approach.We found that they prefer cleaning over increasing the training data and then proceed to reanalyze their analytical decisions, before finally undertaking small actions that emphasize analytical strategies.Overall, participants display an understanding of the curriculum which covers fundamental concepts of machine learning and statistical modeling.
Leveraging syntactic dependencies in disambiguation: the case of African American English
2024-04-01 · 1 citations
preprintOpen accessSenior authorAfrican American English (AAE) has received recent attention in the field of natural language processing (NLP). Efforts to address bias against AAE in NLP systems tend to focus on lexical differences. Whenever the structural uniqueness of AAE is considered, the solution is often to remove or neutralize the differences. This work leverages knowledge about the unique morphosyntactic structures to improve automatic disambiguation of habitual and nonhabitual meanings of “be” in naturally produced AAE transcribed speech. Both meanings are employed in AAE but examples of Habitual be are rare in the already limited AAE data. Generally, representing contextual syntactic information improves semantic disambiguation of habituality. Using an ensemble of classical machine learning models with a representation of the unique POS and dependency patterns of Habitual be, we show that integrating syntactic information improves the identification of habitual uses of “be” by about 65 F1 points over a simple baseline model of n-grams, and as much as 74 points. The success of this approach demonstrates the potential impact when weembrace, rather than neutralize, the structural uniqueness of African American English.
A Comparison of Fine-Tuning and In-Context Learning for Clause-Level Morphosyntactic Alternation
2024-01-01
articleOpen accessThis paper presents our submission to the AmericasNLP 2024 Shared Task on the Creation of Educational Materials for Indigenous Languages.We frame this task as one of morphological inflection generation, treating each sentence as a single word.We investigate and compare two distinct approaches: fine-tuning neural encoder-decoder models such as NLLB-200, and in-context learning with proprietary large language models (LLMs).Our findings demonstrate that for this task, no one approach is perfect.Anthropic's Claude 3 Opus, when supplied with grammatical description entries, achieves the highest performance on Bribri among the evaluated models.This outcome corroborates and extends previous research exploring the efficacy of in-context learning in lowresource settings.For Maya, fine-tuning NLLB-200-3.3B using StemCorrupt augmented data yielded the best performance.
The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task
2024-01-01
articleOpen accessSenior authorWe contribute a seed dataset for the Bangla/Bengali language as part of the WMT24 Open Language Data Initiative shared task.We validate the quality of the dataset against a mined and automatically aligned dataset (NLLBv1) and two other existing datasets of crowdsourced manual translations.The validation is performed by investigating the performance of state-of-the-art translation models fine-tuned on the different datasets after controlling for training set size.Machine translation models fine-tuned on our dataset outperform models tuned on the other datasets in both translation directions (English-Bangla and Bangla-English).These results confirm the quality of our dataset.We hope our dataset will support machine translation for the Bangla/Bengali community and related low-resource languages.
Frequent coauthors
- 30 shared
Omri Abend
- 30 shared
Jakob Prange
Technische Hochschule Augsburg
- 30 shared
Austin Blodgett
DEVCOM Army Research Laboratory
- 30 shared
Vivek Srikumar
- 30 shared
Nathan Schneider
Georgetown University
- 30 shared
Jena D. Hwang
Allen Institute
- 27 shared
Adi Bitan
- 27 shared
Aviram Stern
University of Utah
Education
Ph.D., Linguistics and Cognitive Science
University of Colorado Boulder
M.A., Applied Linguistics
Dallas International University
B.A., History
Thomas Edison State College
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Sarah Moeller
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup