
Zoey Liu
· Ph.D.VerifiedUniversity of Florida · Linguistics
Active 2018–2025
About
Zoey Liu, PhD, is an Assistant Professor in the Department of Linguistics at the University of Florida, where she leads the Computational Linguistics Lab. Her research program studies variation and generalization in languages, machines, and their intersections, addressing questions from language typology, language learning, and multilingual NLP evaluation, with a focus on a non-western mind. She employs a data-driven methodology, incorporating principles such as The #BenderRule and methods of number counting at varying degrees of carbon dioxide consumption. Dr. Liu received her Ph.D. in Linguistics from the University of California, Davis in 2020. Her academic background includes a B.A. in Translation with honors from Nankai University and a summer internship as a software developer at the Cognitive Computing Lab at Baidu. She was a 2021-2023 Computing Innovation Fellow supported by the National Science Foundation / Computing Research Association and served as a Postdoctoral Research Fellow in Computer Science at Boston College. Beyond her academic pursuits, she actively serves Advocates for Indigenous California Language Survival and their biennial institute Breath of Life. Her personal interests include music, food, and simple methods.
Research topics
- Computer Science
- Natural Language Processing
- Artificial Intelligence
- Humanities
- Linguistics
- Information Retrieval
- Art
- Philosophy
- Epistemology
- Programming language
Selected publications
I Speak for the Árboles: Developing a Dependency Treebank for Spanish L2 and Heritage Speakers
2025-01-01
articleOpen accessSenior authorWe introduce the first dependency treebank containing Universal Dependencies (UD) annotations for Spanish learner writing from the UC Davis COWSL2H corpus.Our annotations include lemmatization, POS tagging, and syntactic dependencies.We adapt the existing UD framework for Spanish L1 to account for learner-specific features such as code-switching and non-canonical syntax.A suite of parsing evaluation experiments shows that parsers trained on learner data together with moderate sizes of Spanish L1 data can yield reasonable performance.Our annotations are openly accessible to motivate future development of learner-oriented language technologies.
The development of English negative constructions and communicative functions
Language Learning and Development · 2025-03-10 · 5 citations
article1st authorCorrespondingEvaluating learning trajectories of neural morphology acquisition models
Linguistics Vanguard · 2025-09-08
articleSenior authorAbstract Computational models of morphology acquisition have played a central role in debates over the nature of morphological representations since the origin of the “past tense debate” in the 1980s. The apparent success of recent artificial neural network architectures for morphological inflection in natural language processing has revitalized this debate. However, despite their often good performance, the actual suitability of these advanced neural networks as models of human morphology acquisition remains uncertain. We argue that much of this confusion stems from inconsistent methods of training and evaluation. In this work, we demonstrate that more careful dataset creation and an evaluation combining quantitative analysis and comparison with human development puts the evaluation of neural models on firmer ground.
ArXiv.org · 2025-04-28
preprintOpen accessCHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank. It is derived from previously dependency-annotated CHILDES data, which we harmonize to follow unified annotation principles. The gold-standard trees encompass utterances sampled from 11 children and their caregivers, totaling over 48K sentences (236K tokens). We validate these gold-standard annotations under the UD v2 framework and provide an additional 1M~silver-standard sentences, offering a consistent resource for computational and linguistic research.
Modeling the Dative Alternation in English Early Child Language
Open Mind · 2025-01-01
articleOpen access1st authorCorrespondingAbstract How do children develop syntactic ordering preferences, and what affects their syntactic choices? This study probes these questions with the English dative alternation as the test case. We built the largest-to-date dataset of the dative alternation that contains utterances produced by children and their parents, and subjected it to growth curve modeling and logistic regression analysis. Our results demonstrate: (1) the double object structure emerges slightly earlier than the prepositional object structure in child speech; (2) the production level of the double object structure is consistently higher and reaches maximum production growth at a later stage along children’s developmental trajectory; (3) length, givenness, nominal type, and structural persistence are among the most predictive factors of the ordering preferences in both child and parent production, revealing no pronounced difference in their effects dependent on the speaker role or children’s age; (4) children’s ordering preferences start becoming more parent-like as early as 18–24 months.
What data should I include in my POS tagging training set?
2025-01-01
articleOpen access1st authorCorrespondingUsing NLI to Identify Potential Collocation Transfer in L2 English
2025-01-01
articleOpen accessIdentifying instances of first language (L1) transfer -the application of the linguistics structures of a speaker's first language to their second language(s) -can facilitate second language (L2) learning as it can inform learning and teaching resources, especially when instances of negative transfer (that is, interference) can be identified.While studies of transfer between two languages A and B require a priori linguistic structures to be analyzed with three datasets (data from L1 speakers of language A, L1 speakers of language B, and L2 speakers of A or B), native language identification (NLI) -a machine learning task to predict one's L1 based on one's L2 production -has the advantage to detect instances of subtle and unpredicted transfer, casting a "wide net" to capture patterns of transfer that were missed before (Jarvis and Crossley, 2018).This study aims to apply NLI tasks to find potential instances of transfer of collocations.Our results, compared to previous transfer studies, indicate that NLI can be used to reveal collocation transfer, also in understudied L2 languages.
Towards Cross-Linguistic Semantic Grounding using Dictionary Graph Analysis
2024-01-01
articleOpen accessSenior authorPrevious work has explored the structure of dictionaries as directed graphs, with arcs between words when one word is used in the definition of another.We analyze the efficacy of these methodologies for analyzing semantic grounding and explore the cross-linguistic patterns of the strongly connected components of multiple monolingual dictionaries.We find that the number of sources in the condensation graph of a directed dictionary graph is roughly stable across multiple languages, and present future research directions.
arXiv (Cornell University) · 2024-11-07
preprintOpen accessThe unchecked spread of digital information, combined with increasing political polarization and the tendency of individuals to isolate themselves from opposing political viewpoints, has driven researchers to develop systems for automatically detecting political bias in media. This trend has been further fueled by discussions on social media. We explore methods for categorizing bias in US news articles, comparing rule-based and deep learning approaches. The study highlights the sensitivity of modern self-learning systems to unconstrained data ingestion, while reconsidering the strengths of traditional rule-based systems. Applying both models to left-leaning (CNN) and right-leaning (FOX) news articles, we assess their effectiveness on data beyond the original training and test sets.This analysis highlights each model's accuracy, offers a framework for exploring deep-learning explainability, and sheds light on political bias in US news media. We contrast the opaque architecture of a deep learning model with the transparency of a linguistically informed rule-based model, showing that the rule-based model performs consistently across different data conditions and offers greater transparency, whereas the deep learning model is dependent on the training set and struggles with unseen data.
How Important is a Language Model for Low-resource ASR?
2024-01-01
articleOpen access1st authorCorrespondingN-gram language models (LMs) are the innovation that first made large-vocabulary continuous automatic speech recognition (ASR) viable.With neural end-to-end ASR architectures, however, LMs have become an afterthought.While the effect on accuracy may be negligible for English and Mandarin, jettisoning the LM might not make sense for the world's remaining 6000+ languages.In this paper, we investigate the role of the LM in low-resource ASR.First we ask: does using an n-gram LM in decoding in neural architectures help ASR performance?While it may seem obvious that it should, its absence in most implementations suggests otherwise.Second, we ask: when an n-gram LM is used in ASR, is there a relationship between the size of the LM and ASR accuracy?We have discovered that gut feelings on this question vary considerably, but there is little empirical work to support any particular claim.We explore these questions "in the wild" using a deliberately diverse set of 9 very small ASR corpora.The results show that:(1) decoding with an n-gram LM, regardless of its size, leads to lower word error rates; and (2) increasing the size of the LM appears to yield improvements only when the audio corpus itself is already relatively large.This suggests that collecting additional LM training text may benefit widely-spoken languages which typically have larger audio corpora.In contrast, for endangered languages where data of any kind will always be limited, efforts may be better spent collecting additional transcribed audio.
Frequent coauthors
- 16 shared
Emily Prud’hommeaux
- 5 shared
Salam Khalifa
- 4 shared
Bonnie J. Dorr
- 4 shared
Marc Allassonnière‐Tang
Éco-Anthropologie
- 4 shared
Clara Vania
- 4 shared
Tiago Pimentel
- 3 shared
Duanchen Liu
Boston College
- 3 shared
William G. Dyer
University of Florida
Labs
Education
B.A., Translation
Nankai University
- 2020
Ph.D., Linguistics
University of California, Davis
Awards & honors
- 2021-2023 Computing Innovation Fellow supported by the Natio…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Zoey Liu
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup