
Dana Yang
VerifiedCornell University · Industrial and Labor Relations
Active 1998–2025
About
Dana Yang joined the Department of Statistics and Data Sciences at Cornell University in Spring 2022. She completed a Simons-Berkeley fellowship at UC Berkeley, participating in the Computational Complexity of Statistical Inference program. Prior to her appointment at Cornell, she was a postdoctoral associate at Duke University’s Fuqua School of Business. Yang holds a B.S. in mathematics from Tsinghua University and earned both an M.A. and a Ph.D. in statistics from Yale University. Her research focuses on statistical inference and optimization, with notable presentations including topics such as learner-private convex optimization and the planted matching problem. She has been recognized with the Simons-Berkeley Research Fellowship from the Simons Institute for the Theory of Computing at UC Berkeley.
Research topics
- Computer Science
- Artificial Intelligence
- Mathematics
- Combinatorics
- Quantum mechanics
- Mathematical analysis
- Physics
- Mathematical optimization
- Theoretical computer science
Selected publications
"All-Something-Nothing" Phase Transitions in Planted k-Factor Recovery
ArXiv.org · 2025-03-12
preprintOpen accessSenior authorThis paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos--Renyi random graph $G(n,λ/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $λ$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $λ$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $λ< 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $Θ(n)$ many small trees of size $Θ(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $Θ(\sqrt{n\log n})$ suffices.
Generative Adversarial Networks based on Parallel Structured Generators for Training Stability
Journal of Korea Multimedia Society · 2024-06-30
articleOpen accessSenior authorOver the past few years, Generative Adversarial Network(GAN) has experienced significant growth in various applications as a generative model. However, the stability issues in training remain a challenge in GANs. To mitigate these problems, this paper proposes a novel GAN model that applies dual-parallelized generators. This study designs new methodologies by inputting three sets of data to the discriminator and updating the average of the loss values. Experimental results show that the proposed model shows an ideal convergence graph and reduces the loss by about 40%. The results also show an improvement in the quality of the generated data, with the model achieving stability during the training process.
Learner-Private Convex Optimization
IEEE Transactions on Information Theory · 2022 · 6 citations
Senior authorCorresponding- Computer Science
- Artificial Intelligence
- Computer Science
Convex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. It has gained considerable popularity thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this paper, we study how to optimally obfuscate the learner’s queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for an eavesdropping adversary. We consider two formulations of learner privacy: a Bayesian formulation in which the convex function is drawn randomly, and a maximin formulation in which the function is fixed and the adversary’s probability of error is measured with respect to a minimax criterion. Suppose that the learner wishes to ensure the adversary cannot estimate accurately with probability greater than <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1/L$ </tex-math></inline-formula> for some <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$L > 0$ </tex-math></inline-formula> . Our main results show that the query complexity overhead is additive in <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula> in the maximin formulation, but multiplicative in <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula> in the Bayesian formulation. Compared to existing learner-private sequential learning models with binary feedback, our results apply to the significantly richer family of general convex functions with full-gradient feedback. Our proofs rely on tools from the theory of Dirichlet processes, as well as a novel strategy designed for measuring information leakage under a full-gradient oracle.
Is it easier to count communities than find them?
arXiv (Cornell University) · 2022-12-21
preprintOpen accessSenior authorRandom graph models with community structure have been studied extensively in the literature. For both the problems of detecting and recovering community structure, an interesting landscape of statistical and computational phase transitions has emerged. A natural unanswered question is: might it be possible to infer properties of the community structure (for instance, the number and sizes of communities) even in situations where actually finding those communities is believed to be computationally hard? We show the answer is no. In particular, we consider certain hypothesis testing problems between models with different community structures, and we show (in the low-degree polynomial framework) that testing between two options is as hard as finding the communities. Our methods give the first computational lower bounds for testing between two different ``planted'' distributions, whereas previous results have considered testing between a planted distribution and an i.i.d. ``null'' distribution. We also show a formal relationship between the low--degree frameworks for recovery in a planted model and for testing two planted models.
Estimation of convex supports from noisy measurements
Bernoulli · 2021-04-14 · 3 citations
articleSenior authorA popular class of problems in statistics deals with estimating the support of a density from n observations drawn at random from a d-dimensional distribution. In the one-dimensional case, if the support is an interval, the problem reduces to estimating its end points. In practice, an experimenter may only have access to a noisy version of the original data. Therefore, a more realistic model allows for the observations to be contaminated with additive noise. In this paper, we consider estimation of convex bodies when the additive noise is distributed according to a multivariate Gaussian (or nearly Gaussian) distribution, even though our techniques could easily be adapted to other noise distributions. Unlike standard methods in deconvolution that are implemented by thresholding a kernel density estimate, our method avoids tuning parameters and Fourier transforms altogether. We show that our estimator, computable in (O(logn))(d−1)/2 time, converges at a rate of Od(loglogn/logn) in Hausdorff distance, in accordance with the polylogarithmic rates encountered in Gaussian deconvolution problems. Part of our analysis also involves the optimality of the proposed estimator. We provide a lower bound for the minimax rate of estimation in Hausdorff distance that is Ω d(1/log2n).
Optimal query complexity for private sequential learning against eavesdropping
International Conference on Artificial Intelligence and Statistics · 2021-03-18 · 3 citations
articleSenior authorWe study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner's queries but not the responses. The learner's goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot. We develop new querying strategies and analytical techniques and use them to prove tight upper and lower bounds on the optimal query complexity. The bounds almost match across the entire parameter range, substantially improving upon existing results. We thus obtain a complete picture of the optimal query complexity as a function of the estimation accuracy and the desired levels of privacy. We also extend the results to sequential learning models in higher dimensions, and where the binary responses are noisy. Our analysis leverages a crucial insight into the nature of private learning problem, which suggests that the query trajectory of an optimal learner can be divided into distinct phases that focus on pure learning versus learning and obfuscation, respectively.
Learner-Private Convex Optimization
arXiv (Cornell University) · 2021-02-23 · 1 citations
preprintOpen accessSenior authorConvex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. It has gained considerable popularity thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this paper, we study how to optimally obfuscate the learner's queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for an eavesdropping adversary. We consider two formulations of learner privacy: a Bayesian formulation in which the convex function is drawn randomly, and a minimax formulation in which the function is fixed and the adversary's probability of error is measured with respect to a minimax criterion. Suppose that the learner wishes to ensure the adversary cannot estimate accurately with probability greater than $1/L$ for some $L>0$. Our main results show that the query complexity overhead is additive in $L$ in the minimax formulation, but multiplicative in $L$ in the Bayesian formulation. Compared to existing learner-private sequential learning models with binary feedback, our results apply to the significantly richer family of general convex functions with full-gradient feedback. Our proofs learn on tools from the theory of Dirichlet processes, as well as a novel strategy designed for measuring information leakage under a full-gradient oracle.
The planted matching problem: Sharp threshold and infinite-order phase transition
arXiv (Cornell University) · 2021 · 3 citations
Senior authorCorresponding- Combinatorics
- Mathematics
- Physics
We study the problem of reconstructing a perfect matching $M^*$ hidden in a randomly weighted $n\times n$ bipartite graph. The edge set includes every node pair in $M^*$ and each of the $n(n-1)$ node pairs not in $M^*$ independently with probability $d/n$. The weight of each edge $e$ is independently drawn from the distribution $\mathcal{P}$ if $e \in M^*$ and from $\mathcal{Q}$ if $e \notin M^*$. We show that if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \le 1$, where $B(\mathcal{P},\mathcal{Q})$ stands for the Bhattacharyya coefficient, the reconstruction error (average fraction of misclassified edges) of the maximum likelihood estimator of $M^*$ converges to $0$ as $n\to \infty$. Conversely, if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \ge 1+ε$ for an arbitrarily small constant $ε>0$, the reconstruction error for any estimator is shown to be bounded away from $0$ under both the sparse and dense model, resolving the conjecture in [Moharrami et al. 2019, Semerjian et al. 2020]. Furthermore, in the special case of complete exponentially weighted graph with $d=n$, $\mathcal{P}=\exp(λ)$, and $\mathcal{Q}=\exp(1/n)$, for which the sharp threshold simplifies to $λ=4$, we prove that when $λ\le 4-ε$, the optimal reconstruction error is $\exp\left( - Θ(1/\sqrtε) \right)$, confirming the conjectured infinite-order phase transition in [Semerjian et al. 2020].
Learner-Private Online Convex Optimization.
arXiv (Cornell University) · 2021-02-23 · 1 citations
preprintOpen accessSenior authorOnline convex optimization is a framework where a learner sequentially queries an external data source in order to arrive at the optimal solution of a convex function. The paradigm has gained significant popularity recently thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversary that observe the submitted queries. In this paper, we study how to optimally obfuscate the learner's queries in first-order online convex optimization, so that their learned optimal value is provably difficult to estimate for the eavesdropping adversary. We consider two formulations of learner privacy: a Bayesian formulation in which the convex function is drawn randomly, and a minimax formulation in which the function is fixed and the adversary's probability of error is measured with respect to a minimax criterion. We show that, if the learner wants to ensure the probability of accurate prediction by the adversary be kept below $1/L$, then the overhead in query complexity is additive in $L$ in the minimax formulation, but multiplicative in $L$ in the Bayesian formulation. Compared to existing learner-private sequential learning models with binary feedback, our results apply to the significantly richer family of general convex functions with full-gradient feedback. Our proofs are largely enabled by tools from the theory of Dirichlet processes, as well as more sophisticated lines of analysis aimed at measuring the amount of information leakage under a full-gradient oracle.
The Cost-free Nature of Optimally Tuning Tikhonov Regularizers and Other Ordered Smoothers
International Conference on Machine Learning · 2020-07-12 · 1 citations
articleSenior authorWe consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than $C\sigma^2$, where $\sigma^2$ is the noise level and $C>0$ is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising cost-free nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of $\sigma^2\log(M)$ where $M$ is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices.
Frequent coauthors
- 10 shared
Jiaming Xu
- 5 shared
Kuang Xu
- 5 shared
Jason M. Klusowski
- 4 shared
Yihong Wu
Yale University
- 4 shared
David B. Zax
- 4 shared
Jian Ding
Peking University
- 3 shared
W. D. Brinda
Yale University
- 3 shared
David Pollard
Awards & honors
- Simons-Berkeley Research Fellowship, The Simons Institute fo…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Dana Yang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup