Catherine Blake

· ProfessorVerified

University of Illinois Urbana-Champaign · Information Sciences

Active 1958–2025

h-index17

Citations12.7k

Papers10013 last 5y

Funding$841k

Faculty page

See your match with Catherine Blake — sign in to PhdFit.Sign in

About

Catherine Blake is a professor in the School of Information Sciences and Health Innovation Professor in the Carle Illinois College of Medicine at the University of Illinois Urbana-Champaign (UIUC). She holds a courtesy appointment in the Siebel School of Computing and Data Science and is affiliated with the National Center for Supercomputing Applications (NCSA), Informatics, the Center for Health Informatics, and the Personalized Nutrition Initiative (PNI). Her research explores biomedical informatics, natural language processing, evidence-based discovery, learning health systems, socio-technical systems, data analytics, and literature-based discovery. Blake's work centers around synthesizing evidence from text using both human and automated methods, with a focus on extracting key findings from empirical studies in medicine, toxicology, and epidemiology through the Claim Framework she developed. She has industrial experience as a software developer and has nearly two decades of research experience in text mining and natural language processing, primarily in biomedicine. Blake earned her master's and doctoral degrees in information and computer science at the University of California, Irvine, and her bachelor's and master's degrees in computer science at the University of Wollongong. She currently teaches courses such as Independent Study (IS589CLB) and has led multiple research projects, including the Midwest Big Data Hub and various initiatives in evidence-based discovery and data analytics.

Research topics

Computer Science
Artificial Intelligence
Data Mining
Data science
Machine Learning
Information Retrieval
Natural Language Processing
Linguistics
Engineering
World Wide Web
Risk analysis (engineering)

Selected publications

In the Weeds: Entity Detection for Plant Based Foods
Proceedings of the Association for Information Science and Technology · 2025-10-01
articleOpen access1st authorCorresponding
ABSTRACT Funding for nutrition research in the United States is less than 5% of the National Institutes of Health budget, so nutrition researchers often turn to published work. This provides an ideal environment for text mining, where entity detection is the task of finding food mentions in text and entity linking connects each food expression to a specific food. For example, the system should harmonize the expressions soybean , soy bean , soya bean and the scientific name Glycine max (L) Merrill along with their plural forms to a single concept soybean. However, the system must not harmonize soybean with soy sprouts because these different forms of soya foods have very different nutritional profiles. Despite the numerous food ontologies available, our work on developing a gold standard for food entities revealed a unique set of challenges that would limit the utility of automated extraction for nutrition researchers.
Publisher OA PDF DOI
PrivacyChat: Utilizing Large Language Model for Fine-Grained Information Extraction over Privacy Policies
Lecture notes in computer science · 2024-01-01 · 2 citations
book-chapter
Publisher DOI
Uncertainty analysis on support vector machine for measuring organizational factors in probabilistic risk assessment of nuclear power plants
Progress in Nuclear Energy · 2022 · 4 citations
- Computer Science
- Data Mining
- Computer Science
Publisher DOI
<scp>Human‐Driven</scp> Models: A Case Study of Geologists as They Engage with Data for Decision Making
Proceedings of the Association for Information Science and Technology · 2022-10-01
articleSenior authorCorresponding
Abstract The value of geologic data is well established and demonstrated by efforts such as EarthChem and EarthCube. Although these communities are active in the documentation and preservation of geologic data, more work is needed to understand how geologists use this data to address specific problems. In this preliminary analysis, we focus on the information behaviors of professional geologists as they engage with multiple data streams to make decisions. Using semi‐structured interviews and grounded theory, our findings document how a single data point can drive changes to existing models. Responses also show that geologists view their experiences in data collection as critical and they use their knowledge and experience to iteratively re‐assess the context and fitness of their data as they search for coherent interpretations that resolve data‐model conflicts.
Publisher DOI
DESIGNING A DATABASE TO MANAGE UNCERTAIN LITHOLOGIC AND STRATIGRAPHIC PICKS
Abstracts with programs - Geological Society of America · 2021-01-01
article
Publisher DOI
Using semantics to scale up evidence-based chemical risk-assessments
PLoS ONE · 2021-12-15
articleOpen access1st authorCorresponding
BACKGROUND: The manual processes used for risk assessments are not scaling to the amount of data available. Although automated approaches appear promising, they must be transparent in a public policy setting. OBJECTIVE: Our goal is to create an automated approach that moves beyond retrieval to the extraction step of the information synthesis process, where evidence is characterized as supporting, refuting, or neutral with respect to a given outcome. METHODS: We combine knowledge resources and natural language processing to resolve coordinated ellipses and thus avoid surface level differences between concepts in an ontology and outcomes in an abstract. As with a systematic review, the search criterion, and inclusion and exclusion criterion are explicit. RESULTS: The system scales to 482K abstracts on 27 chemicals. Results for three endpoints that are critical for cancer risk assessments show that refuting evidence (where the outcome decreased) was higher for cell proliferation (45.9%), and general cell changes (37.7%) than for cell death (25.0%). Moreover, cell death was the only end point where supporting claims were the majority (61.3%). If the number of abstracts that measure an outcome was used as a proxy for association there would be a stronger association with cell proliferation than cell death (20/27 chemicals). However, if the amount of supporting evidence was used (where the outcome increased) the conclusion would change for 21/27 chemicals (20 from proliferation to death and 1 from death to proliferation). CONCLUSIONS: We provide decision makers with a visual representation of supporting, neutral, and refuting evidence whilst maintaining the reproducibility and transparency needed for public policy. Our findings show that results from the retrieval step where the number of abstracts that measure an outcome are reported can be misleading if not accompanied with results from the extraction step where the directionality of the outcome is established.
Publisher OA PDF DOI
Finding Keystone Citations for Constructing Validity Chains among Research Papers
Companion Proceedings of the The Web Conference 2018 · 2021 · 2 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Artificial Intelligence
New discoveries in science are often built upon previous knowledge. Ideally, such dependency information should be made explicit in a scientific knowledge graph. The Keystone Framework was proposed for tracking the validity dependency among papers. A keystone citation indicates that the validity of a given paper depends on a previously published paper it cites. In this paper, we propose and evaluate a strategy that repurposes rhetorical category classifiers for the novel application of extracting keystone citations that relate to research methods. Five binary rhetorical category classifiers were constructed to identify Background, Objective, Methods, Results, and Conclusions sentences in biomedical papers. The resulting classifiers were used to test the strategy against two datasets. The initial strategy assumed that only citations contained in Methods sentences were methods keystone citations, but our analysis revealed that citations contained in sentences classified as either Methods or Results had a high likelihood to be methods keystone citations. Future work will focus on fine tuning the rhetorical category classifiers, experimenting with multiclass classifiers, evaluating the revised strategy with more data, and constructing a larger gold standard citation context sentence dataset for model training.
Publisher DOI
The Reproducible Data Reuse (<scp>ReDaR</scp>) Framework to Capture and Assess Multiple Data Streams
Proceedings of the Association for Information Science and Technology · 2021-10-01
articleSenior authorCorresponding
Abstract Much of the literature in knowledge discovery from data (KDD) focuses on algorithms that are faster and more accurate at capturing patterns in a given data set. However, answering a research question is fundamentally connected with how well the data is aligned with the questions being asked. Thus, data selection is one of the most important steps to ensure that models produced from the KDD process are useful in practice. A lack of documentation about the data selection rationale and the transformations needed to semantically align the data streams prevents others from reproducing the research and obfuscates development of best practices in data integration. Our goal in this paper is to provide KDD practitioners with a framework that brings together theories in provenance, information quality, and contextual reasoning, to enable researchers to achieve a semantically aligned dataset with data selection, description, and documentation based on an application‐focused assessment.
Publisher DOI
Real Life Experience of the Use of CDK 4/6 Inhibitors in the First Line Treatment of Metastatic ER+ Her2– Breast Cancer: Focus in Neutropenia and Dose Reduction
Clinical Oncology · 2020-07-08
article1st authorCorresponding
Publisher DOI
Data-theoretic approach for socio-technical risk analysis: Text mining licensee event reports of U.S. nuclear power plants
Safety Science · 2020 · 25 citations
- Data Mining
- Computer Science
- Data Mining
DOI

Recent grants

Towards Evidence-Based Discovery
NSF · $392k · 2009–2012
Towards Evidence-Based Discovery
NSF · $449k · 2008–2011

Frequent coauthors

Michael B. Twidale
University of Illinois Urbana-Champaign
11 shared
Jeffrey M. Stanton
Syracuse University
10 shared
Suzie Allard
University of Tennessee at Knoxville
9 shared
Caryn L. Anderson
United States Department of Veterans Affairs
9 shared
Carole L. Palmer
University of Washington
9 shared
Ana Lučić
University of Illinois Urbana-Champaign
8 shared
Maria Souden
8 shared
Wanda Pratt
7 shared

Education

B.S., Computer Science
University of Wollongong
M.A., Computer Science
University of Wollongong
M.S., Information and Computer Science
University of California, Irvine
Ph.D., Information and Computer Science
University of California, Irvine

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Catherine Blake

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you