Tim Althoff

· Associate ProfessorVerified

University of Washington · Computer Science & Engineering

Active 2012–2025

h-index30

Citations4.3k

Papers158114 last 5y

Funding$3.2M

Faculty page Lab page

See your match with Tim Althoff — sign in to PhdFit.Sign in

About

Tim Althoff is an Assistant Professor in Computer Science at the University of Washington. His research aims to better understand and empower people through data by developing computational methods that leverage large-scale behavioral data to extract actionable insights about life, health, and happiness. His work combines techniques from data science, social network analysis, and natural language processing. He directs the Behavioral Data Science Group, which focuses on research related to mental health, misinformation, scientific reproducibility, and public health, including efforts to inform the COVID-19 response. Tim Althoff is actively seeking postdoctoral researchers and PhD students, particularly in areas such as neural representation learning, natural language processing with applications to psychology and mental health, causal inference, data science, and mobile health.

Research topics

Computer Science
Medicine
Psychology
Social psychology
Social Science
Political Science
Internal medicine
Sociology
Virology
Mathematics education
Epistemology
Statistics
Programming language
Mathematics
Demography
Software engineering
Nursing
Data science
Public relations
Psychotherapist
Econometrics

Selected publications

A personal health large language model for sleep and fitness coaching
Nature Medicine · 2025-08-14 · 28 citations
articleOpen access
Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research.
Publisher DOI
LSM-2: Learning from Incomplete Wearable Sensor Data
ArXiv.org · 2025-06-05 · 1 citations
preprintOpen access
Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.
Publisher OA PDF DOI
A Framework for Automation in Psychotherapy
Current Directions in Psychological Science · 2025-11-07
articleOpen access
Psychotherapy is a conversational intervention that has relied on humans to manage its implementation. Improvements in conversational artificial intelligence (AI) have accompanied speculation on how technologies might automate components of psychotherapy, most often the replacement of human therapists. However, there is a spectrum of opportunities for human collaboration with autonomous systems in psychotherapy, including evaluation, documentation, training, and assistance. Clarity about what is being automated is necessary to understand the affordances and limitations of specific technologies. As a guidepost for empirical and ethical inquiry, we present a framework for categories of autonomous systems in psychotherapy. Categories include scripted or rule-based conversations, collaborative systems where humans are evaluated by, supervise, or are assisted by AI, and agents that generate interventions. These categories highlight considerations for key stakeholders as psychotherapy moves from unmediated human-to-human conversation to various forms of automation.
Publisher OA PDF DOI
Correction: Personalization Strategies for Increasing Engagement With Digital Mental Health Resources: Sequential Multiple Assignment Randomized Trial (Preprint)
2025-11-28
preprintOpen access
<sec> <title>UNSTRUCTURED</title> NA </sec>
Publisher DOI
Gender Differences in Trajectories of Depressive Symptoms Among Talkspace Clients: Naturalistic Observational Study
JMIR Formative Research · 2025-12-03
articleOpen accessSenior author
Background: Gender minority populations experience an increased risk of depression and report significant barriers to accessing mental health services. While digital mental health (DMH) technologies may address barriers, it remains unclear how gender minority clients engage with DMH services and if DMH improves their clinical outcomes. Objective: This naturalistic study explored gender differences in 15-week clinical outcomes of clients receiving technology-mediated psychotherapy from a large DMH provider. Methods: This study used observational data of clients who signed up for Talkspace (Talkspace, Inc) between February 2017 and July 2021. The analytic sample included Talkspace clients (N=20,156) with a baseline 8-item Patient Health Questionnaire (PHQ-8) score ≥10. Participants completed at least 2 PHQ-8 assessments over 15 weeks of treatment. Multilevel linear models tested gender differences in depressive symptom trajectories over the course of treatment (model 1) while also controlling for baseline PHQ-8 scores (model 2) and treatment engagement indicators (model 3). Sensitivity analyses reestimated model 2 among clients who submitted a PHQ-8 survey during the week 15 assessment period and among those who discontinued treatment beforehand. Reasons for service cancellation were also described for the latter group. Gender differences in secondary clinical outcomes were examined via chi-square and Fisher exact tests. Results: In all models, there were significant week-by-gender interactions. When controlling for baseline PHQ-8 scores, rates of symptom change were significantly slower for gender-diverse participants (b=0.60; P<.001), nonbinary participants (b=0.81; P<.001), and transgender women (b=0.87; P=.007), but not for women (P=.98) or transgender men (P=.38) compared to men. By week 15, adjusted PHQ-8 scores declined 8.7 points for both men and women, versus 4.4-7.4 points for gender minority clients. Sensitivity analyses indicated attenuated symptom improvement among week-15 completers, with transgender women showing the slowest changes (b=0.76; P=.02). Among earlier dropouts, weekly symptom reductions were steep overall (eg, week 3: b=-4.06, P<.001; week 6: b=-2.31, P<.001) while certain gender minority subgroups worsened (eg, adjusted scores for transgender women increased from 15.41 at baseline to 16.08 at final week 3 PHQ-8 survey submissions). Cancellation data (3450/20,156, 17.12%) confirmed discontinuation reasons related to both symptom improvement (928/3691 reasons, 25.14%) and potential barriers to treatment engagement (eg, cost: 1431/3691, 38.77%; poor service fit or poor perceived effectiveness: 677/3691, 18.34%). Gender differences were observed in rates of treatment response (weeks 3-12; all P≤.02), symptom remission (weeks 3, 6, 9, and 15; all P≤.047), and clinically significant symptom reduction (all time points, all P≤.03). Symptom deterioration did not differ by gender (all P>.05). Conclusions: While clinical outcomes generally improved over time among clients engaged in technology-mediated psychotherapy, some gender minority populations experienced slower improvements. Future research may explore strategies to adapt DMH interventions to better meet the needs of diverse gender identities.
Publisher DOI
Substance over Style: Evaluating Proactive Conversational Coaching Agents
2025-01-01 · 2 citations
articleSenior author
Publisher DOI
Reddit Rules and Rulers: Quantifying the Link Between Rules and Perceptions of Governance across Thousands of Communities
ArXiv.org · 2025-01-24
preprintOpen accessSenior author
Rules are a critical component of the functioning of nearly every online community, yet it is challenging for community moderators to make data-driven decisions about what rules to set for their communities. The connection between a community's rules and how its membership feels about its governance is not well understood. In this work, we conduct the largest-to-date analysis of rules on Reddit, collecting a set of 67,545 unique rules across 5,225 communities which collectively account for more than 67% of all content on Reddit. More than just a point-in-time study, our work measures how communities change their rules over a 5+ year period. We develop a method to classify these rules using a taxonomy of 17 key attributes extended from previous work. We assess what types of rules are most prevalent, how rules are phrased, and how they vary across communities of different types. Using a dataset of communities' discussions about their governance, we are the first to identify the rules most strongly associated with positive community perceptions of governance: rules addressing who participates, how content is formatted and tagged, and rules about commercial activities. We conduct a longitudinal study to quantify the impact of adding new rules to communities, finding that after a rule is added, community perceptions of governance immediately improve, yet this effect diminishes after six months. Our results have important implications for platforms, moderators, and researchers. We make our classification model and rules datasets public to support future research on this topic.
Publisher OA PDF DOI
How Conversational Structure and Style Shape Online Community Experiences
ArXiv.org · 2025-08-12
preprintOpen access
Sense of Community (SOC) is vital to individual and collective well-being. Although social interactions have moved increasingly online, still little is known about the specific relationships between the nature of these interactions and Sense of Virtual Community (SOVC). This study addresses this gap by exploring how conversational structure and linguistic style predict SOVC in online communities, using a large-scale survey of 2,826 Reddit users across 281 varied subreddits. We develop a hierarchical model to predict self-reported SOVC based on automatically quantifiable and highly generalizable features that are agnostic to community topic and that describe both individual users and entire communities. We identify specific interaction patterns (e.g., reciprocal reply chains, use of prosocial language) associated with stronger communities and identify three primary dimensions of SOVC within Reddit -- Membership & Belonging, Cooperation & Shared Values, and Connection & Influence. This study provides the first quantitative evidence linking patterns of social interaction to SOVC and highlights actionable strategies for fostering stronger community attachment, using an approach that can generalize readily across community topics, languages, and platforms. These insights offer theoretical implications for the study of online communities and practical suggestions for the design of features to help more individuals experience the positive benefits of online community participation.
Publisher OA PDF DOI
Human Decision-making is Susceptible to AI-driven Manipulation
ArXiv.org · 2025-02-11 · 5 citations
preprintOpen access
AI systems are increasingly intertwined with daily life, assisting users with various tasks and guiding decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized between-subjects experiment with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) equipped with established psychological tactics, allowing it to select and apply them adaptively during interactions to reach its hidden objectives. By analyzing participants' preference ratings, we found significant susceptibility to AI-driven manipulation. Particularly across both decision-making domains, interacting with the manipulative agents significantly increased the odds of rating hidden incentives higher than optimal options (Financial, MA: OR=5.24, SEMA: OR=7.96; Emotional, MA: OR=5.52, SEMA: OR=5.71) compared to the NA group. Notably, we found no clear evidence that employing psychological strategies (SEMA) was overall more effective than simple manipulative objectives (MA) on our primary outcomes. Hence, AI-driven manipulation could become widespread even without requiring sophisticated tactics and expertise. While our findings are preliminary and derived from hypothetical, low-stakes scenarios, we highlight a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to protect human autonomy.
Publisher OA PDF DOI
RADAR: Benchmarking Language Models on Imperfect Tabular Data
ArXiv.org · 2025-06-09
preprintOpen access
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
Publisher OA PDF DOI

Recent grants

Enhancing Engagement with Digital Mental Health Care
NIH · $873k · 2020–2024
Enhancing Engagement with Digital Mental Health Care
NIH · $2.3M · 2020–2025

Frequent coauthors

Adam S. Miner
33 shared
Inna Wanyin Lin
30 shared
Jure Leskovec
Stanford University
29 shared
Kevin Rushton
Mental Health America
21 shared
Ashish Sharma
UNSW Sydney
20 shared
David Wadden
20 shared
Theresa Nguyen
18 shared
Jina Suh
17 shared

Education

Ph.D.
Stanford University
M.S.
Stanford University
B.S.
University of Kaiserslautern, Germany

Awards & honors

SAP Stanford Graduate Fellowship
Fulbright scholarship
German Academic Exchange Service scholarship
German National Merit Foundation scholarship
Best Paper Award by the International Medical Informatics As…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Tim Althoff

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you