Jen Tang
· ProfessorVerifiedPurdue University · Quantitative Methods
Active 1987–2024
Research topics
- Artificial Intelligence
- Data Mining
- Computer Science
- Statistics
- Algorithm
- Mathematics
Selected publications
Clustering High-Dimensional Noisy Categorical Data
Figshare · 2024-01-01
datasetOpen accessSenior authorClustering is a widely used unsupervised learning technique that groups data into homogeneous clusters. However, when dealing with real-world data that contain categorical values, existing algorithms can be computationally costly in high dimensions and can struggle with noisy data that has missing values. Furthermore, except for one algorithm, no others provide theoretical guarantees of clustering accuracy. In this article, we propose a general categorical data encoding method and a computationally efficient spectral-based algorithm to cluster high-dimensional noisy categorical data (nominal or ordinal). Under a statistical model for data on <i>m</i> attributes from <i>n</i> subjects in <i>r</i> clusters with missing probability <i>ϵ</i>, we show that our algorithm exactly recovers the true clusters with high probability when mn(1−ϵ)≥CMr2 log 3M, with M=max(n,m) and a fixed constant <i>C</i>. In addition, we show that mn(1−ϵ)2≥rδ/2 with 0<δ<1 is necessary for <i>any</i> algorithm to succeed with probability at least (1+δ)/2. In cases where <i>m</i> = <i>n</i> and <i>r</i> are fixed, the sufficient condition matches with the necessary condition up to a polylog(n) factor. In numerical studies our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency. Supplementary materials for this article are available online.
Clustering High-Dimensional Noisy Categorical Data
Journal of the American Statistical Association · 2024 · 4 citations
Senior authorCorresponding- Computer Science
- Artificial Intelligence
- Data Mining
Clustering is a widely used unsupervised learning technique that groups data into homogeneous clusters. However, when dealing with real-world data that contain categorical values, existing algorithms can be computationally costly in high dimensions and can struggle with noisy data that has missing values. Furthermore, except for one algorithm, no others provide theoretical guarantees of clustering accuracy. In this article, we propose a general categorical data encoding method and a computationally efficient spectral-based algorithm to cluster high-dimensional noisy categorical data (nominal or ordinal). Under a statistical model for data on m attributes from n subjects in r clusters with missing probability ϵ, we show that our algorithm exactly recovers the true clusters with high probability when mn(1−ϵ)≥CMr2 log 3M, with M=max(n, m) and a fixed constant C. In addition, we show that mn(1−ϵ)2≥rδ/2 with 0
A Two-Stage Latent Variable Estimation Procedure for Time-Censored Accelerated Degradation Tests
IEEE Transactions on Reliability · 2017-08-21 · 9 citations
articleSenior authorParallel constant-stress accelerated degradation testing (PCSADT) is widely used to assess the reliability of highly reliable products in a timely manner when the products' degradation can be measured. Under a time-censored PCSADT, several groups of units are tested simultaneously, but under different stress levels, until a prespecified censoring time is reached. At this time, degradation values from the censored units, and failure times of the failed units are obtained. When the degradation follows a Wiener process where the parameters depend on the stress level through a life-stress model containing an unknown nuisance parameter, estimating this parameter often biases the maximum likelihood and least-squares estimators of the lifetime parameters. In this paper, we propose a two-stage procedure to address this problem. In the first stage, we transform the data under the different stress levels of a PCSADT so that the resulting data can be considered to have been obtained under normal stress. In the second stage, we introduce a latent variable for the unobserved degradation after the failure time for each failed unit to obtain a pseudodegradation value at the censoring time. We then use all degradation values (pseudo or observed) at the censoring time to develop latent variable estimators for all model parameters. Unlike other existing estimators, the proposed estimators are shown to be s-consistent, have closed-form expressions, and are easy to interpret. We use a real example of light-emitting diodes to illustrate the proposed method. In addition to proving s-consistencies, we conduct a simulation study to demonstrate that the proposed estimators also perform well in finite samples.
IIE Transactions · 2014-06-06 · 13 citations
articleSenior authorAccelerated Life Testing (ALT) is used to provide timely estimates of a product's lifetime distribution. Step-Stress ALT (SSALT) is one of the most widely adopted stress loadings and the optimum design of a SSALT plan has been extensively studied. However, few research efforts have been devoted to establishing the theoretical rationale for using SSALT in lieu of other types of stress loadings. This article proves the existence of statistically equivalent SSALT plans that can provide equally precise estimates to those derived from any continuous stress loading for the log-location-scale lifetime distributions with Type-I censoring. That is, for any optimization criterion based on the Fisher information matrix, SSALT is identical in comparison to other continuous stress loadings. The Weibull and lognormal distributions are introduced as special cases. For these two distributions, the relationship among statistical equivalencies is investigated and it is shown that two equivalent ALT plans must be equivalent in terms of the strongest version of equivalency for many objective functions. A numerical example for a ramp-stress ALT, using data from an existing study on miniature lamps, is used to illustrate equivalent SSALT plans. Results show that SSALT is not only equivalent to the existing ramp-stress test plans but also more cost-effective in terms of the total test cost.
Minimum cost allocation of quality improvement targets under supplier process disruption
RePEc: Research Papers in Economics · 2014-01-01 · 3 citations
articleSenior authorThis paper presents a system cost model to assist a manufacturer in assessing the minimum cost allocations of quality improvement targets to suppliers. The model accounts for the effects of autonomous learning and induced learning on quality improvement, via variance reductions of supplier processes. The model further accounts for the effects of planned and unplanned disruptions in supplier production processes, where such gaps in production decreases the amount of autonomous learning while providing an opportunity for induced learning, thereby counteracting the effect of disruptions on process improvement. An optimization model is developed that obtains the quality improvement allocations that minimize system expected cost to both suppliers and manufacturer. The proposed models also account for both the uncertainty in the realized induced learning rate as well as uncertainty in the realized level of process disruptions. An example is used to demonstrate an implementation of the proposed models and to assess the sensitivity of the optimal target allocations to several model parameters.
Optimum step-stress accelerated degradation test for Wiener degradation process under constraints
European Journal of Operational Research · 2014-09-22 · 136 citations
articleSenior authorMinimum cost allocation of quality improvement targets under supplier process disruption
European Journal of Operational Research · 2013-02-09 · 24 citations
articleSenior authorNaval Research Logistics (NRL) · 2012-12-20 · 26 citations
articleSenior authorAbstract Accelerated life testing (ALT) is commonly used to obtain reliability information about a product in a timely manner. Several stress loading designs have been proposed and recent research interests have emerged concerning the development of equivalent ALT plans. Step‐stress ALT (SSALT) is one of the most commonly used stress loadings because it usually shortens the test duration and reduces the number of required test units. This article considers two fundamental questions when designing a SSALT and provides formal proofs in answer to each. Namely: (1) can a simple SSALT be designed so that it is equivalent to other stress loading designs? (2) when optimizing a multilevel SSALT, does it degenerate to a simple SSALT plan? The answers to both queries, under certain reasonable model assumptions, are shown to be a qualified YES. In addition, we provide an argument to support the rationale of a common practice in designing a SSALT, that is, setting the higher stress level as high as possible in a SSALT plan. © 2012 Wiley Periodicals, Inc. Naval Research Logistics, 2013
Step-stress accelerated life tests: a proportional hazards–based non-parametric model
IIE Transactions · 2012-06-14 · 20 citations
articleSenior authorUsing data from a simple step-stress accelerated life test procedure, a non-parametric proportional hazards model is proposed for obtaining upper confidence bounds for the cumulative failure probability of a product under normal use conditions. The approach is non-parametric in the sense that most of the functions involved in the model do not assume any specific forms, except for certain verifiable conditions. Test statistics are introduced to verify assumptions about the model and to test the goodness of fit of the proposed model to the data. A numerical example, using data simulated from the lifetime distribution of an existing parametric study on metal-oxide semiconductor capacitors, is used to illustrate the proposed methods. Discussions on how to determine the optimal stress levels and sample size are also given.
Methods for identifying influential variables in an out-of-control multivariate normal process
Statistica Sinica · 2011-07-27 · 1 citations
articleSenior authorHotelling's T 2 is a well-known statistic for testing the mean vector of a multivariate normal distribution. Control charts based on T 2 have been widely used in statistical process control for monitoring a multivariate process. Although it is a powerful tool, the T 2 statistic has a practical problem, namely, that a significant T 2 -value that normally signals an overall out-of-control condition in the process mean vector does not provide direct information about which variable or group of variables may have caused this out-of-control condition. We propose a diagnostic method to identify the influential variable(s) for cases with and without a speci- fied out-of-control mean vector. Our approach, based on the likelihood principle, computes the conditional likelihood of a variable or sub-group of variables causing or not causing the overall out-of-control condition. Unlike many existing meth- ods, our method assumes that an out-of-control condition already exists; hence, all conditional likelihoods in this paper are based on non-central distributions of the monitoring/testing statistics. By comparing these conditional likelihoods, we iden- tify the influential variable(s). We use an example from the literature to illustrate our method and to demonstrate its effectiveness.
Frequent coauthors
- 12 shared
Robert Plante
Purdue University West Lafayette
- 12 shared
Herbert Moskowitz
Purdue University West Lafayette
- 12 shared
Kwei Tang
National Chengchi University
- 4 shared
Weijia Wang
Emory University
- 3 shared
Sheng‐Tsaing Tseng
National Tsing Hua University
- 2 shared
Regina Y. Liu
Rutgers, The State University of New Jersey
- 2 shared
Peter Sing-Lai Lam
PAREXEL International (United States)
- 2 shared
Suresh Chand
Purdue University West Lafayette
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jen Tang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup