Ilan Shomorony

· Assistant Professor, Electrical and Computer EngineeringVerified

University of Illinois Urbana-Champaign · Computer Science

Active 2007–2026

h-index19

Citations1.3k

Papers14771 last 5y

Funding$984k1 active

Faculty page

See your match with Ilan Shomorony — sign in to PhdFit.Sign in

About

Ilan Shomorony is an Assistant Professor in the Electrical and Computer Engineering department at the University of Illinois Urbana-Champaign, affiliated with the Siebel School of Computing and Data Science. His research focuses on applying principles from digital communication processes to biological data, specifically DNA sequencing and genomic data problems. He has earned recognition for his work, including the NSF CAREER Award in 2021, which highlights his contributions to understanding complex data challenges in genomics. Shomorony's academic background includes teaching courses related to data science, information theory, and signal processing. His research interests encompass areas such as information theory, data science, and the intersection of communication systems with biological data analysis. He is actively involved in advancing computational methods for biological data storage and sequencing, contributing to the broader field of computing and data science through innovative research and application.

Research topics

Artificial Intelligence
Computer Science
Genetics
Biology
Machine Learning
Computer Security
Computer hardware
Medicine
Pathology
Internal medicine
Bioinformatics
Algorithm
Computational biology

Selected publications

Recovering a Message from an Incomplete Set of Noisy Fragments
IEEE Transactions on Information Theory · 2026-01-01
preprintOpen accessSenior author
We consider the problem of communicating over a channel that breaks the message block into fragments of random lengths, shuffles them out of order, and deletes a random fraction of the fragments. Such a channel is motivated by applications in molecular data storage and forensics, and we refer to it as the torn-paper channel. We characterize the capacity of this channel under arbitrary i.i.d. fragment length distributions and deletion probabilities. Precisely, we show that the capacity is given by a closed-form expression that can be interpreted as F−A, where F is the coverage fraction, i.e., the fraction of the input codeword that is covered by output fragments, and A is an alignment cost incurred due to the lack of ordering in the output fragments. We then consider a noisy version of the problem, where the fragments are corrupted by binary symmetric noise. We derive upper and lower bounds to the capacity, both of which can be seen as F−A expressions. These bounds match for specific choices of fragment length distributions, and they are approximately tight in cases where there are not too many short fragments.
Publisher OA PDF DOI
Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling
ArXiv.org · 2026-01-29
articleOpen accessSenior author
Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.
Publisher OA PDF
On the Capacity of Noisy Frequency-based Channels
ArXiv.org · 2026-01-15
articleOpen access
We investigate the capacity of noisy frequency-based channels, motivated by DNA data storage in the short-molecule regime, where information is encoded in the frequency of items types rather than their order. The channel output is a histogram formed by random sampling of items, followed by noisy item identification. While the capacity of the noiseless frequency-based channel has been previously addressed, the effect of identification noise has not been fully characterized. We present a converse bound on the channel capacity that follows from stochastic degradation and the data processing inequality. We then establish an achievable bound, which is based on a Poissonization of the multinomial sampling process, and an analysis of the resulting vector Poisson channel with inter-symbol interference. This analysis refines concentration inequalities for the information density used in Feinstein bound, and explicitly characterizes an additive loss in the mutual information due to identification noise. We apply our results to a DNA storage channel in the short-molecule regime, and quantify the resulting loss in the scaling of the total number of reliably stored bits.
Publisher OA PDF
Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling
Open MIND · 2026-01-29
preprintSenior author
Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.
DOI
On the Capacity of Noisy Frequency-based Channels
arXiv (Cornell University) · 2026-01-15
preprintOpen access
We investigate the capacity of noisy frequency-based channels, motivated by DNA data storage in the short-molecule regime, where information is encoded in the frequency of items types rather than their order. The channel output is a histogram formed by random sampling of items, followed by noisy item identification. While the capacity of the noiseless frequency-based channel has been previously addressed, the effect of identification noise has not been fully characterized. We present a converse bound on the channel capacity that follows from stochastic degradation and the data processing inequality. We then establish an achievable bound, which is based on a Poissonization of the multinomial sampling process, and an analysis of the resulting vector Poisson channel with inter-symbol interference. This analysis refines concentration inequalities for the information density used in Feinstein bound, and explicitly characterizes an additive loss in the mutual information due to identification noise. We apply our results to a DNA storage channel in the short-molecule regime, and quantify the resulting loss in the scaling of the total number of reliably stored bits.
Publisher DOI
Guaranteed Recovery of Unambiguous Clusters
ArXiv.org · 2025-01-22
preprintOpen accessSenior author
Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the $K$-clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.
Publisher OA PDF DOI
Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach
IEEE Robotics and Automation Letters · 2025-10-03
preprintOpen access
We study <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(i) <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">when to apply a restorative action and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(ii) <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">which subset of agents to treat in parallel. The global budget limits the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">total number of restorations, while the capacity constraint bounds the number of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$r$</tex-math></inline-formula> disjoint sets (<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$r$</tex-math></inline-formula> = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.
Publisher OA PDF DOI
Fundamental Limits of Non-Adaptive Group Testing with Markovian Correlation
ArXiv.org · 2025-01-22
preprintOpen accessSenior author
We study a correlated group testing model where items are infected according to a Markov chain, which creates bursty binfection patterns. Focusing on a very sparse infections regime, we propose a non adaptive testing strategy with an efficient decoding scheme that is nearly optimal. Specifically, it achieves asymptotically vanishing error with a number of tests that is within a $1/\ln(2) \approx 1.44$ multiplicative factor of the fundamental entropy bound a result that parallels the independent group testing setting. We show that the number of tests reduces with an increase in the expected burst length of infected items, quantifying the advantage of exploiting correlation in test design.
Publisher OA PDF DOI
Fragmentation in Data Deduplication Systems II: The Jump Metric
2025-06-22
article
Data deduplication refers to a collection of data processing strategies that aim to remove repeated data chunks stored by different users. Despite providing excellent storage savings, deduplication can lead to severe file fragmentation issues: data chunks of the same file may be stored at distal locations on the server, making reconstruction time-consuming. Here, we continue our analytical study of uncoded and coded deduplication methods with reduced fragmentation levels. We model files as self-avoiding (simple) paths in specialized graphs whose nodes correspond to data chunks. To measure the level of fragmentation, we introduce the jump metric which captures the worst-case number of times during the reconstruction process of a file that one has to change the readout location on the server. We derive lower and upper bounds on the degree of jump fragmentation, and provide a new algorithm for computing the jump number of hierarchical data structures captured by trees. We also present examples that show how repetition and coded redundancy in chunk stores can reduce jump fragmentation.
Publisher DOI
Reducing Fragmentation in Data Deduplication Systems via Partial Repetition and Coding
IEEE Transactions on Information Theory · 2025-09-23
article
Data deduplication, one of the key features of modern Big Data storage devices, is the process of removing replicas of data chunks stored by different users. Despite the importance of deduplication, several drawbacks of the method, such as storage robustness and file fragmentation, have not been previously analyzed from a theoretical point of view. Storage robustness pertains to ensuring that deduplicated data can be used to reconstruct the original files without service disruptions and data loss. Fragmentation pertains to the problems of placing deduplicated data chunks of different user files in a proximity-preserving linear order, since neighboring chunks of the same file may be stored in sectors far apart on the server. This work proposes a new theoretical model for data fragmentation and introduces novel graph- and coding-theoretic approaches for reducing fragmentation via limited duplication (repetition coding) and coded deduplication (e.g., linear coding). In addition to alleviating issues with fragmentation, limited duplication and coded deduplication can also serve the dual purpose of increasing the robusteness of the system design. The contributions of our work are three-fold. First, we describe a new model for file structures of the form of self-avoiding (simple) paths in specialized graphs. Second, we introduce several new metrics for measuring the fragmentation level in deduplication systems on graph-structured files, including the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stretch metric that captures the worst-case “spread” of adjacent data chunks within a file when deduplicated and placed on the server; and, the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">jump metric that captures the worst-case number of times during the reconstruction process of a file that one has to change the readout location on the server. For the stretch metric, we establish a connection between the level of fragmentation and the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">bandwidth of the file-graph. In particular, we derive lower and upper bounds on the degree of fragmentation and describe instances of the problem where repetition and coding reduce fragmentation. The key ideas behind our approach are graph folding and information-theoretic arguments coupled with graph algorithms such as matching. For the jump metric, we provide a new algorithm for computing the jump number of hierarchical data structures captured by trees. Third, we describe how controlled repetition and coded redundancy added after deduplication can ensure valuable trade-offs between the storage volume and the degree of fragmentation.
Publisher DOI

Recent grants

CAREER: Genomic Data Science: From Informational Limits to Efficient Algorithms
NSF · $500k · 2021–2027
CIF: Small: Fundamental Limits of DNA-Based Storage
NSF · $484k · 2020–2024

Frequent coauthors

A. Salman Avestimehr
University of Southern California
31 shared
David Tse
27 shared
Reinhard Heckel
23 shared
Mikel Hernáez
Roche (France)
23 shared
Idoia Ochoa
Universidad de Navarra
23 shared
Thomas A. Courtade
University of California, Berkeley
19 shared
Govinda M. Kamath
10X Genomics (United States)
19 shared
Alireza Vahid
Rochester Institute of Technology
17 shared

Awards & honors

Research Honors NSF CAREER Award (2021)
List of Teachers Ranked as Excellent by Their Students, ECE…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Ilan Shomorony

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you