Pen-Chung Yew

Verified

University of Minnesota · Computer Science and Engineering

Active 1981–2026

h-index42

Citations7.0k

Papers50130 last 5y

Funding$829k

Faculty page

See your match with Pen-Chung Yew — sign in to PhdFit.Sign in

About

Pen-Chung Yew is a professor in the Department of Computer Science & Engineering at the University of Minnesota Twin Cities. He joined the department in 1994 and has served as the Associate Head, Head of the department, and the William Norris Land-Grant Chair Professor. His educational background includes a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign, an M.S. in Computer Engineering from the University of Massachusetts Amherst, and a B.S. in Electrical Engineering from National Taiwan University. His research and teaching interests focus on computer architectures and compilers targeting future generations of secure high-performance multi- and many-core systems. Recent research areas include enhancing system security at micro-architectural and source/binary-code levels, system virtualization, dynamic binary translation, leveraging machine-learning approaches to improve compilation techniques, high-performance memory systems, and debugging and testing parallel programs. Yew has held significant roles in academia and research organizations, including serving as the Program Director of the Microelectronic Systems Architecture in the NSF and as the Director of the Institute of Information Science in Taiwan. He was recognized as an IEEE Fellow in 1998 and has contributed extensively to the field through research, leadership, and service.

Research topics

Computer Science
Operating system
Machine Learning
Data Mining
Parallel computing
Embedded system
Artificial Intelligence
Computer architecture
Programming language
Mathematics
Algorithm
Arithmetic
Computer network

Selected publications

XuanJia: A Comprehensive Virtualization-Based Code Obfuscator for Binary Protection
arXiv (Cornell University) · 2026-01-15
preprintOpen accessSenior author
Virtualization-based binary obfuscation is widely adopted to protect software intellectual property, yet existing approaches leave exception-handling (EH) metadata unprotected to preserve ABI compatibility. This exposed metadata leaks rich structural information, such as stack layouts, control-flow boundaries, and object lifetimes, which can be exploited to facilitate reverse engineering. In this paper, we present XuanJia, a comprehensive VM-based binary obfuscation framework that provides end-to-end protection for both executable code and exception-handling semantics. At the core of XuanJia is ABI-Compliant EH Shadowing, a novel exception-aware protection mechanism that preserves compatibility with unmodified operating system runtimes while eliminating static EH metadata leakage. XuanJia replaces native EH metadata with ABI-compliant shadow unwind information to satisfy OS-driven unwinding, and securely redirects exception handling into a protected virtual machine where the genuine EH semantics are decrypted, reversed, and replayed using obfuscated code. We implement XuanJia from scratch, supporting 385 x86 instruction encodings and 155 VM handler templates, and design it as an extensible research testbed. We evaluate XuanJia across correctness, resilience, and performance dimensions. Our results show that XuanJia preserves semantic equivalence under extensive dynamic and symbolic testing, effectively disrupts automated reverse-engineering tools such as IDA Pro, and incurs negligible space overhead and modest runtime overhead. These results demonstrate that XuanJia achieves strong protection of exception-handling logic without sacrificing correctness or practicality.
Publisher DOI
XuanJia: A Comprehensive Virtualization-Based Code Obfuscator for Binary Protection
ArXiv.org · 2026-01-15
articleOpen accessSenior author
Virtualization-based binary obfuscation is widely adopted to protect software intellectual property, yet existing approaches leave exception-handling (EH) metadata unprotected to preserve ABI compatibility. This exposed metadata leaks rich structural information, such as stack layouts, control-flow boundaries, and object lifetimes, which can be exploited to facilitate reverse engineering. In this paper, we present XuanJia, a comprehensive VM-based binary obfuscation framework that provides end-to-end protection for both executable code and exception-handling semantics. At the core of XuanJia is ABI-Compliant EH Shadowing, a novel exception-aware protection mechanism that preserves compatibility with unmodified operating system runtimes while eliminating static EH metadata leakage. XuanJia replaces native EH metadata with ABI-compliant shadow unwind information to satisfy OS-driven unwinding, and securely redirects exception handling into a protected virtual machine where the genuine EH semantics are decrypted, reversed, and replayed using obfuscated code. We implement XuanJia from scratch, supporting 385 x86 instruction encodings and 155 VM handler templates, and design it as an extensible research testbed. We evaluate XuanJia across correctness, resilience, and performance dimensions. Our results show that XuanJia preserves semantic equivalence under extensive dynamic and symbolic testing, effectively disrupts automated reverse-engineering tools such as IDA Pro, and incurs negligible space overhead and modest runtime overhead. These results demonstrate that XuanJia achieves strong protection of exception-handling logic without sacrificing correctness or practicality.
Publisher OA PDF
GPU Stream-Aware Communication for Effective Pipelining
2025-11-03
article
Modern heterogeneous supercomputing systems consist of CPUs, GPUs, and high-speed network interconnects. Communication libraries that support efficient inter-process data movement between memory buffers, especially those involving GPU memory, typically require the CPU to orchestrate the data transfer operations. This approach necessitates expensive synchronization between the CPU and GPU, and is ineffective for achieving better compute/communication overlap in applications using techniques like pipelining. A new offload-friendly communication strategy, stream-triggered (ST) communication, is explored to offload the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization-based implementation is used to illustrate the proposed strategy. A latency-sensitive nearest-neighbor microbenchmark was used to examine various performance characteristics of the implementation. The offloaded implementation showed significant performance improvements both between nodes (inter-node) and within a single node (intranode) when compared to standard MPI active RMA (33% and $\mathbf{2 7 \%}$, respectively) and point-to-point communication ($\mathbf{9 \%}$ and 38%, respectively).
Publisher DOI
EVeREST-C: An Effective and Versatile Runtime Energy Saving Tool for CPUs
2025-06-08
articleOpen access
Power and energy efficiency are increasingly important challenges within HPC.However, it is still important to achieve these goals while maintaining desired/high application performance.Balancing these goals involves the challenge of precise application characterization.For successful user adoption, this must avoid modifying the application and/or extraneous application profiling, and also be portable to different processors across processor generations and vendors.We propose EVeREST-C to solve these challenges.Everest targets the finer-grained individual application functions for exploiting power/energy saving opportunities via Dynamic Voltage Frequency Scaling (DVFS) in both the core and the uncore, without application-specific knowledge.Since Everest relies on a single standard and accurate performance event, IPS (instructions per second), for its characterization rather than on the (many) performance counters that can differ across platforms, it is portable across processors.Finally, the fine-grained approach enables Everest to additionally save power/energy for select communication (MPI) phases, where appropriate phases are chosen based on both their length and position in the application with regards to the memory/compute boundedness of surrounding user routines.We evaluate Everest using SPEC CPU 2017 and various MPI applications, on Intel and AMD platforms.We find that Everest saves on average 11% more energy for SPEC compared to the baseline and 8% more energy on MPI applications compared to a state-of-the-art solution.
Publisher OA PDF DOI
EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUs
2025-02-28 · 2 citations
articleOpen access
Amid conflicting demands for ever-improving performance and maximizing energy savings, it is important to have a tool that automatically identifies opportunities to save power/energy at runtime without compromising performance. GPUs in particular present challenges due to (1) reduced savings available from memory bound applications, and (2) limited availability of low overhead performance counters. Thus, a successful tool must address these issues while still tackling the challenges of dynamic application characterization, versatility across processors from different vendors, and effectiveness at making the right power-performance tradeoffs for desired energy savings.
Publisher OA PDF DOI
DeCOS: Data-Efficient Reinforcement Learning for Compiler Optimization Selection Ignited by LLM
2025-06-08
article
Publisher DOI
JavART: A Lightweight Rule-Based JIT Compiler using Translation Rules Extracted from a Learning Approach
Proceedings of the ACM on Programming Languages · 2025-04-09
articleOpen access
The balance between the compilation/optimization time and the produced code quality is very important for Just-In-Time (JIT) compilation. Time-consuming optimizations can cause delayed deployment of the optimized code, and thus more execution time needs to be spent either in the interpretation or less optimized code, leading to a performance drag. Such a performance drag can be detrimental to mobile and client-side devices such as those running Android, where applications are often shorting-running, frequently restarted and updated. To tackle this issue, this paper presents a lightweight learning-based, rule-guided dynamic compilation approach to generate good-quality native code directly without the need to go through the interpretive phase and the first-level optimization at runtime. Different from existing JIT compilers, the compilation process is driven by translation rules, which are automatically learned offline by taking advantage of existing JIT compilers. We have implemented a prototype of our approach based on Android 14 to demonstrate the feasibility and effectiveness of such a lightweight rule-based approach using several real-world applications. Results show that, compared to the default mode running with the interpreter and two tiers of JIT compilers, our prototype can achieve a 1.23× speedup on average. Our proposed compilation approach can also generate native code 5.5× faster than the existing first-tier JIT compiler in Android, with the generated code running 6% faster. We also implement and evaluate our approach on a client-side system running Hotspot JVM, and the results show an average of 1.20× speedup.
Publisher DOI
Shield Bash: Abusing Defensive Coherence State Retrieval to Break Timing Obfuscation
ArXiv.org · 2025-04-14
preprintOpen accessSenior author
Microarchitectural attacks are a significant concern, leading to many hardware-based defense proposals. However, different defenses target different classes of attacks, and their impact on each other has not been fully considered. To raise awareness of this problem, we study an interaction between two state-of-the art defenses in this paper, timing obfuscations of remote cache lines (TORC) and delaying speculative changes to remote cache lines (DSRC). TORC mitigates cache-hit based attacks and DSRC mitigates speculative coherence state change attacks. We observe that DSRC enables coherence information to be retrieved into the processor core, where it is out of the reach of timing obfuscations to protect. This creates an unforeseen consequence that redo operations can be triggered within the core to detect the presence or absence of remote cache lines, which constitutes a security vulnerability. We demonstrate that a new covert channel attack is possible using this vulnerability. We propose two ways to mitigate the attack, whose performance varies depending on an application's cache usage. One way is to never send remote exclusive coherence state (E) information to the core even if it is created. The other way is to never create a remote E state, which is responsible for triggering redos. We demonstrate the timing difference caused by this microarchitectural defense assumption violation using GEM5 simulations. Performance evaluation on SPECrate 2017 and PARSEC benchmarks of the two fixes show less than 32\% average overhead across both sets of benchmarks. The repair which prevented the creation of remote E state had less than 2.8% average overhead.
Publisher OA PDF DOI
A System-Level Dynamic Binary Translator Using Automatically-Learned Translation Rules
2024-02-28 · 5 citations
article
System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU's helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.
Publisher DOI
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
arXiv (Cornell University) · 2024-02-15
preprintOpen access
System-level emulators have been used extensively for system design, debugging and evaluation. They work by providing a system-level virtual machine to support a guest operating system (OS) running on a platform with the same or different native OS that uses the same or different instruction-set architecture. For such system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based DBT approach has shown a significantly improved performance with a higher quality of translated code using automatically learned translation rules. However, it has only been applied to user-level emulation, and not yet to system-level emulation. In this paper, we explore the feasibility of applying this approach to improve system-level emulation, and use QEMU to build a prototype. ... To achieve better performance, we leverage several optimizations that include coordination overhead reduction to reduce the overhead of each coordination, and coordination elimination and code scheduling to reduce the coordination frequency. Experimental results show that it can achieve an average of 1.36X speedup over QEMU 6.1 with negligible coordination overhead in the system emulation mode using SPEC CINT2006 as application benchmarks and 1.15X on real-world applications.
Publisher OA PDF DOI

Recent grants

CSR: Medium: Dynamic Binary Translation for a Retargetable and Behaviorally-Accurate Cross-Architecture Whole System Virtual Machine
NSF · $829k · 2015–2019

Frequent coauthors

Guy L. Steele
Oracle (United States)
418 shared
Laxmikant V. Kalé
University of Illinois Urbana-Champaign
318 shared
Josep Torrellas
University of Illinois Urbana-Champaign
113 shared
Charles E. Leiserson
107 shared
Cédric Bastoul
107 shared
Edward F. Gehringer
107 shared
Pedro J. García
University of Castilla-La Mancha
106 shared
Krishna Kandalla
106 shared

Labs

Pen-Chung YewPI

Education

Doctor of Philosophy, Computer Science
University of Illinois at Urbana Champaign
1981
Master of Science, Electrical and Computer Engineering
University of Massachusetts Amherst
1977
Bachelor of Science, Electrical Engineering
National Taiwan University
1972

Awards & honors

William Norris Land Grant Chair in Large-Scale Computing (20…
IEEE Fellow (1998)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Pen-Chung Yew

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you