Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Julia Hockenmaier

Julia Hockenmaier

· Professor and Willett Faculty ScholarVerified

University of Illinois Urbana-Champaign · Computer Science

Active 1998–2025

h-index33
Citations10.4k
Papers10124 last 5y
Funding$841k
See your match with Julia Hockenmaier — sign in to PhdFit.Sign in

About

Julia Hockenmaier is a professor and Willett Faculty Scholar at the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. Her research areas include Artificial Intelligence, with recent courses taught in Natural Language Processing, Machine Learning in NLP, and Embodied Natural Language Processing. She is involved in advancing understanding and development in AI models, particularly in natural language processing, and has contributed to discussions on the impact of AI technologies such as ChatGPT on the academic community. Her work is recognized within the field, and she is actively engaged in research and teaching to further the capabilities of AI in understanding and processing natural language.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Natural Language Processing
  • Linguistics
  • Programming language

Selected publications

  • Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

    2025-01-01 · 1 citations

    articleOpen accessSenior author
  • Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

    ArXiv.org · 2025-03-31

    preprintOpen accessSenior author

    Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

  • Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models

    2025-01-01

    articleOpen accessSenior author
  • ReasoningFlow: Semantic Structure of Complex Reasoning Traces

    ArXiv.org · 2025-06-03

    preprintOpen accessSenior author

    Large reasoning models (LRMs) generate complex reasoning traces with planning, reflection, verification, and backtracking. In this work, we introduce ReasoningFlow, a unified schema for analyzing the semantic structures of these complex traces. ReasoningFlow parses traces into directed acyclic graphs, enabling the characterization of distinct reasoning patterns as subgraph structures. This human-interpretable representation offers promising applications in understanding, evaluating, and enhancing the reasoning processes of LRMs.

  • On the Versatility of Sparse Autoencoders for In-Context Learning

    2025-01-01

    articleOpen accessSenior author

    Sparse autoencoders (SAEs) are emerging as a key analytical tool in the field of mechanistic interpretability for large language models (LLMs).While SAEs have primarily been used for interpretability, we shift focus and explore an understudied question: "Can SAEs be applied to practical tasks beyond interpretability?"Given that SAEs are trained on billions of tokens for sparse reconstruction, we believe they can serve as effective extractors, offering a wide range of useful knowledge that can benefit practical applications.Building on this motivation, we demonstrate that SAEs can be effectively applied to incontext learning (ICL).In particular, we highlight the utility of the SAE-reconstruction loss by showing that it provides a valuable signal in ICL-exhibiting a strong correlation with LLM performance and offering a powerful unsupervised approach for prompt selection.These findings underscore the versatility of SAEs and reveal their potential for real-world applications beyond interpretability.Our code is available at https://github.com/ihcho2/SAE-GPS.

  • Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

    ArXiv.org · 2025-10-31

    preprintOpen accessSenior author

    As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

  • BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

    ArXiv.org · 2025-01-18

    preprintOpen accessSenior author

    Developing interactive agents that can understand language, perceive their surroundings, and act within the physical world is a long-standing goal of AI research. The Minecraft Collaborative Building Task (MCBT) (Narayan-Chen, Jayannavar, and Hockenmaier 2019), a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated 3D Blocks World environment, offers a rich platform to work towards this goal. In this work, we focus on the Builder Action Prediction (BAP) subtask: predicting B's actions in a multimodal game context (Jayannavar, Narayan-Chen, and Hockenmaier 2020) - a challenging testbed for grounded instruction following, with limited training data. We holistically re-examine this task and introduce BAP v2 to address key challenges in evaluation, training data, and modeling. Specifically, we define an enhanced evaluation benchmark, featuring a cleaner test set and fairer, more insightful metrics that also reveal spatial reasoning as the primary performance bottleneck. To address data scarcity and to teach models basic spatial skills, we generate different types of synthetic MCBT data. We observe that current, LLM-based SOTA models trained on the human BAP dialogues fail on these simpler, synthetic BAP ones, but show that training models on this synthetic data improves their performance across the board. We also introduce a new SOTA model, Llama-CRAFTS, which leverages richer input representations, and achieves an F1 score of 53.0 on the BAP v2 task and strong performance on the synthetic data. While this result marks a notable 6 points improvement over previous work, it also underscores the task's remaining difficulty, establishing BAP v2 as a fertile ground for future research, and providing a useful measure of the spatial capabilities of current text-only LLMs in such embodied tasks.

  • The Power of Bullet Lists: A Simple Yet Effective Prompting Approach to Enhancing Spatial Reasoning in Large Language Models

    2025-01-01

    articleOpen accessSenior author

    While large language models (LLMs) are dominating the field of natural language processing, it remains an open question how well these models can perform spatial reasoning.Contrary to recent studies suggesting that LLMs struggle with spatial reasoning tasks, we demonstrate in this paper that a novel prompting technique, termed Patient Visualization of Thought (PATIENT-VOT), can boost LLMs' spatial reasoning abilities.The core idea behind PATIENT-VOT is to explicitly integrate bullet lists, coordinates, and visualizations into the reasoning process.By applying PATIENT-VOT, we achieve a significant boost in spatial reasoning performance compared to prior prompting techniques.We also show that integrating bullet lists into reasoning is effective in planning tasks, highlighting its general effectiveness across different applications.

  • Entailment-Preserving First-order Logic Representations in Natural Language Entailment

    ArXiv.org · 2025-02-24

    preprintOpen accessSenior author

    First-order logic (FOL) can represent the logical entailment semantics of natural language (NL) sentences, but determining natural language entailment using FOL remains a challenge. To address this, we propose the Entailment-Preserving FOL representations (EPF) task and introduce reference-free evaluation metrics for EPF, the Entailment-Preserving Rate (EPR) family. In EPF, one should generate FOL representations from multi-premise natural language entailment data (e.g. EntailmentBank) so that the automatic prover's result preserves the entailment labels. Experiments show that existing methods for NL-to-FOL translation struggle in EPF. To this extent, we propose a training method specialized for the task, iterative learning-to-rank, which directly optimizes the model's EPR score through a novel scoring function and a learning-to-rank objective. Our method achieves a 1.8-2.7% improvement in EPR and a 17.4-20.6% increase in EPR@16 compared to diverse baselines in three datasets. Further analyses reveal that iterative learning-to-rank effectively suppresses the arbitrariness of FOL representation by reducing the diversity of predicate signatures, and maintains strong performance across diverse inference types and out-of-domain data.

  • Entailment-Preserving First-order Logic Representations in Natural Language Entailment

    2025-01-01

    articleSenior author

Recent grants

Frequent coauthors

Education

  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    2006
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    2001
  • B.S., Computer Science

    University of Illinois at Urbana-Champaign

    1998

Awards & honors

  • Willett Faculty Scholar
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Julia Hockenmaier

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup