Sorin Istrail

· James A. & Julie N. Brown Professor of Computational and Mathematical Sciences

Brown University · Computer Science

Active 1978–2022

h-index54

Citations27.8k

Papers23810 last 5y

Funding$1.5M

Faculty page Lab page

OpenAlex

See your match with Sorin Istrail — sign in to PhdFit.Sign in

About

Sorin Istrail is the James A. and Julie N. Brown Professor of Computational and Mathematical Sciences and Professor of Computer Science at Brown University. He is associated with the Istrail Laboratory, a research group focused on computational biology and computer science within the Department of Computer Science at Brown University. His work involves research in genomics, including the sequencing of the human genome, whole genome shotgun assembly and comparison of human genome assemblies, immune peptidomics of humans and their pathogens, and the genomic study of the sea urchin, including its genome and transcriptome. His research emphasizes understanding the logic functions of the genomic cis-regulatory code, contributing to the broader field of computational molecular biology.

Research topics

Artificial Intelligence
Computer Science
Data Mining
Engineering
Genetics
Bioinformatics
Biology
Data science
Computational biology
Theoretical computer science
Management science

Selected publications

Special Issue: Professor Michael Waterman's 80th Birthday, Part 1
Journal of Computational Biology · 2022-06-15
article1st author
Publisher DOI
Michael Waterman's Contributions to Computational Biology and Bioinformatics
Journal of Computational Biology · 2022-06-21 · 2 citations
reviewSenior author
On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).
Publisher DOI
Computational Advances in Bio and Medical Sciences
Lecture notes in computer science · 2021 · 2 citations
1st authorCorresponding
- Computer Science
- Computer Science
- Artificial Intelligence
DOI
Combinatorial and statistical prediction of gene expression from haplotype sequence
Bioinformatics · 2020-05-01 · 2 citations
articleOpen access
MOTIVATION: Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS: In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Publisher OA PDF DOI
Proteinarium: Multi-sample protein-protein interaction analysis and visualization tool
Genomics · 2020 · 18 citations
- Data Mining
- Computer Science
- Biology
We posit the likely architecture of complex diseases is that subgroups of patients share variants in genes in specific networks which are sufficient to give rise to a shared phenotype. We developed Proteinarium, a multi-sample protein-protein interaction (PPI) tool, to identify clusters of patients with shared gene networks. Proteinarium converts user defined seed genes to protein symbols and maps them onto the STRING interactome. A PPI network is built for each sample using Dijkstra's algorithm. Pairwise similarity scores are calculated to compare the networks and cluster the samples. A layered graph of PPI networks for the samples in any cluster can be visualized. To test this newly developed analysis pipeline, we reanalyzed publicly available data sets, from which modest outcomes had previously been achieved. We found significant clusters of patients with unique genes which enhanced the findings in the original study.
Publisher DOI
Invariant Patterns in Crystal Lattices: Implications for Protein Folding Algorithms
TUGraz OPEN Library (Graz University of Technology) · 2020-04-07 · 1 citations
articleOpen access1st authorCorresponding
Publisher DOI
Preface Special Issue: RECOMB 2018
Journal of Computational Biology · 2020-03-01
article1st authorCorresponding
Publisher DOI
Proteinarium: Multi-Sample Protein-Protein Interaction Analysis and Visualization Tool
bioRxiv (Cold Spring Harbor Laboratory) · 2019-03-26 · 2 citations
preprintOpen access
Abstract Background Data analysis has become crucial in the post genomic era where the accumulation of genomic information is mounting exponentially. Analyzing protein-protein interactions in the context of the interactome is a powerful approach to understanding disease phenotypes. Results We describe Proteinarium, a multi-sample protein-protein interaction network analysis and visualization tool. Proteinarium can be used to analyze data for samples with dichotomous phenotypes, multiple samples from a single phenotype or a single sample. Then, by similarity clustering, the network-based relations of samples are identified and clusters of related samples are presented as a dendrogram. Each branch of the dendrogram is built based on network similarities of the samples. The protein-protein interaction networks can be analyzed and visualized on any branch of the dendrogram. Proteinarium’s input can be derived from transcriptome analysis, whole exome sequencing data or any high-throughput screening approach. Its strength lies in use of gene lists for each sample as a distinct input which are further analyzed through protein interaction analyses. Proteinarium output includes the gene lists of visualized networks and PPI interaction files where users can analyze the network(s) on other platforms such as Cytoscape. In addition, since the dendrogram is written in Newick tree format, users can visualize it in other software platforms like Dendroscope, ITOL. Conclusions Proteinarium, through the analysis and visualization of PPI networks, allows researchers to make important observations on high throughput data for a variety of research questions. Proteinarium identifies significant clusters of patients based on their shared network similarity for the disease of interest and the associated genes. Proteinarium is a command-line tool written in Java with no external dependencies and it is freely available at https://github.com/Armanious/Proteinarium .
Publisher DOI
How Does the Regulatory Genome Work?
Journal of Computational Biology · 2019-06-05 · 7 citations
articleOpen access1st author
Abstract The regulatory genome controls genome activity throughout the life of an organism. This requires that complex information processing functions are encoded in, and operated by, the regulatory genome. Although much remains to be learned about how the regulatory genome works, we here discuss two cases where regulatory functions have been experimentally dissected in great detail and at the systems level, and formalized by computational logic models. Both examples derive from the sea urchin embryo, but assess two distinct organizational levels of genomic information processing. The first example shows how the regulatory system of a single gene, endo16 , executes logic operations through individual transcription factor binding sites and cis -regulatory modules that control the expression of this gene. The second example shows information processing at the gene regulatory network (GRN) level. The GRN controlling development of the sea urchin endomesoderm has been experimentally explored at an almost complete level. A Boolean logic model of this GRN suggests that the modular logic functions encoded at the single-gene level show compositionality and suffice to account for integrated function at the network level. We discuss these examples both from a biological-experimental point of view and from a computer science-informational point of view, as both illuminate principles of how the regulatory genome works.
Publisher OA PDF DOI
Eric Davidson's Regulatory Genome for Computer Science: Causality, Logic, and Proof Principles of the Genomic <i>cis</i> -Regulatory Code
Journal of Computational Biology · 2019-07-01 · 6 citations
articleOpen access1st authorCorresponding
I think that it is a relatively good approximation to truth which is much too complicated to allow anything but approximations that mathematical ideas originate in empirics. But, once they are conceived, the subject begins to live a peculiar life of its own and is governed by almost entirely aesthetical motivations. In other words, at a great distance from its empirical source, or after much ''abstract'' inbreeding, a mathematical subject is in danger of degeneration. Whenever this stage is reached the only remedy seems to me to be the rejuvenating return to the source: the reinjection of more or less directly empirical ideas.-John von Neumann (1947).
Publisher OA PDF DOI

Recent grants

III: Small: Genome-Wide Algorithms for Haplotype Reconstruction and Beyond: A Combined Haplotype Assembly and Identical-by-Descent Tracts Approach
NSF · $500k · 2013–2018
EAGER: Haplotype Phasing Algorithms and Clark Consistency Graphs
NSF · $200k · 2011–2012
The cisGRN Browser and Database: cis-Regulatory Information Behind the Network
NSF · $850k · 2007–2012

Frequent coauthors

Pavel A. Pevzner
University of California, San Diego
223 shared
Michael S. Waterman
219 shared
Roberto Tagliaferri
98 shared
Waraporn Tongprasit
75 shared
Manoj P. Samanta
74 shared
Viktor Štolc
Ames Research Center
74 shared
Eric H. Davidson
65 shared
Marie-France Sagot
Institut national de recherche en informatique et en automatique
64 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Sorin Istrail

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you