Jason Eisner

· Joint Appointment; Professor, Whiting School of Engineering, Computer Science

Johns Hopkins University · Neuroscience

Active 1983–2026

h-index53

Citations8.6k

Papers27076 last 5y

Funding$2.7M

Faculty page Lab page

See your match with Jason Eisner — sign in to PhdFit.Sign in

About

Jason Eisner is the John C. Malone Professor of Computer Science at Johns Hopkins University. The page does not provide specific details about his research focus, background, or key contributions. The content primarily consists of a humorous song and personal notes related to the department's move to Malone Hall, with no explicit biographical or research information included.

Research topics

Computer Science
Artificial Intelligence
Natural Language Processing
Speech recognition
Chemistry
Physics
Philosophy
Linguistics

Selected publications

LLMs Know More About Numbers than They Can Say
Underline Science Inc. · 2026-03-06
otherOpen access
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, 5.7 x 102 or 580?'' This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.
Publisher DOI
LLMs Know More About Numbers than They Can Say
ArXiv.org · 2026-02-08
articleOpen accessSenior author
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities. Our code is available at https://github.com/VCY019/Numeracy-Probing.
Publisher OA PDF
LLMs Know More About Numbers than They Can Say
Open MIND · 2026-02-08
preprintSenior author
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities. Our code is available at https://github.com/VCY019/Numeracy-Probing.
DOI
Accelerating Language Model Workflows with <scp>Prompt Choreography</scp>
Transactions of the Association for Computational Linguistics · 2026-03-16
articleOpen accessSenior author
Abstract Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages’ encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0–6.2× faster time-to-first-token) and achieves substantial end-to-end speedups (&gt;2.2×) in some workflows dominated by redundant computation.
Publisher OA PDF DOI
Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
ArXiv.org · 2025-12-29
articleOpen accessSenior author
We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
Publisher OA PDF
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
ArXiv.org · 2025-04-28
preprintOpen access
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
Publisher OA PDF DOI
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
ArXiv.org · 2025-04-07 · 1 citations
preprintOpen access
The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.
Publisher OA PDF DOI
Automating the Analysis of Parsing Algorithms (and other Dynamic Programs)
arXiv (Cornell University) · 2025-12-29
preprintOpen accessSenior author
Much algorithmic research in NLP aims to efficiently manipulate rich formal structures. An algorithm designer typically seeks to provide guarantees about their proposed algorithm -- for example, that its running time or space complexity is upper-bounded as a certain function of its input size. They may also wish to determine the necessary properties of the quantities derived by the algorithm to synthesize efficient data structures and verify type errors. In this paper, we develop a system for helping programmers to perform these types of analyses. We apply our system to a number of NLP algorithms and find that it successfully infers types, dead and redundant code, and parametric runtime and space complexity bounds.
Publisher DOI
Accelerating Language Model Workflows with Prompt Choreography
ArXiv.org · 2025-12-28
articleOpen accessSenior author
Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.
Publisher OA PDF
Accelerating Language Model Workflows with Prompt Choreography
arXiv (Cornell University) · 2025-12-28
preprintOpen accessSenior author
Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.
Publisher DOI

Recent grants

CAREER: Finite-State Machine Learning on Strings and Sequences
NSF · $500k · 2004–2010
RI: Medium: Learned Dynamic Prioritization
NSF · $900k · 2010–2014
ITR: Weighted Dynamic Programming for Statistical Natural Language Processing
NSF · $425k · 2003–2007
XPS: FULL: Collaborative Research: Parallel and Distributed Circuit Programming for Structured Prediction
NSF · $415k · 2016–2019
RI: Small: CompCog: Modeling Latent Discrete Knowledge Across Utterances
NSF · $458k · 2014–2018

Frequent coauthors

Ryan Cotterell
73 shared
Tim Vieira
23 shared
Benjamin Van Durme
19 shared
Matthew R. Gormley
Broad Center
18 shared
Luc De Raedt
KU Leuven
18 shared
Christo Kirov
17 shared
Mans Hulden
15 shared
Chu‐Cheng Lin
15 shared

Labs

Jason Eisner LabPI

Awards & honors

Fellow of the Association for Computational Linguistics

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jason Eisner

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you