Xun (Steve) Jian

· Assistant ProfessorVerified

Virginia Tech · Computer Science

Active 2012–2025

h-index12

Citations285

Papers2811 last 5y

Funding$183k

Faculty page Lab page

See your match with Xun (Steve) Jian — sign in to PhdFit.Sign in

About

Xun (Steve) Jian is an Associate Professor in the Department of Computer Science at Virginia Tech. He holds a Ph.D. in computer engineering from the University of Illinois, Urbana-Champaign, earned in 2017, and a B.S. from the same university in 2011. His research interests focus on computer architecture. He is based at the Gilbert Place location in Blacksburg, VA, and is involved with the Institute for Advanced Computing in Alexandria, VA. His contact information includes an email at xunj@vt.edu and a phone number (540) 232-8433. His professional profile is linked through his homepage and social media platforms such as Facebook, Instagram, LinkedIn, YouTube, and X (Twitter).

Research topics

Computer Science
Computer hardware
Parallel computing
Algorithm
Computer network
Operating system
Engineering
Mathematics
Electrical engineering
Embedded system
Reliability engineering

Selected publications

ResBench: A Comprehensive Framework for Evaluating Database Resilience
ArXiv.org · 2025-11-14
preprintOpen access
Existing database benchmarks primarily focus on performance under ideal running environments. However, in real-world scenarios, databases probably face numerous adverse events. Quantifying the ability to cope with these events from a comprehensive perspective remains an open problem. We provide the definition of database resilience to describe its performance when facing adversity and propose ResBench, a benchmark for evaluating database resilience. This framework achieves automation, standardization, and visualization of the testing process through clear hierarchical decoupling. ResBench simulates adverse events and injects them during normal transaction processing, utilizing a module to gather multiple metrics for the evaluation model. We assess database resilience across eight dimensions: throughput, latency, stability, resistance, recovery, disturbance period, adaptation capability and metric deviation. All the results are presented to users via a user-friendly graphical interface. We demonstrate the execution process and result interpretation of ResBench using two types of adversity datasets.
Publisher OA PDF DOI
Systematic CXL Memory Characterization and Performance Analysis at Scale
2025-03-27 · 26 citations
articleOpen access
Compute Express Link (CXL) has emerged as a pivotal interconnect for memory expansion. Despite its potential, the performance implications of CXL across devices, latency regimes, processors, and workloads remain underexplored. We present Melody, a framework for systematic characterization and analysis of CXL memory performance. Melody builds on an extensive evaluation spanning 265 workloads, 4 real CXL devices, 7 latency levels, and 5 CPU platforms. Melody yields many insights: workload sensitivity to sub-μs CXL latencies (140-410ns), the first disclosure of CXL tail latencies, CPU tolerance to CXL latencies, a novel approach (SPA) for pinpointing CXL bottlenecks, and CPU prefetcher inefficiencies under CXL.
Publisher DOI
FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and Scheduling
2024-11-14 · 1 citations
articleOpen access
Federated learning (FL) has emerged as a new paradigm of machine learning (ML) with the goal of collaborative learning on the vast pool of private data available across distributed edge devices. The focus of most existing works in FL systems has been on addressing the challenges of computation and communication heterogeneity inherent in training with edge devices. However, the crucial impact of I/O and the role of limited on-device storage has not been explored fully in FL context. Without policies to exploit the on-device storage for placement of client data samples, and schedule clients based on I/O benefits, FL training can lead to inefficiencies, such as increased training time and impacted accuracy convergence.
Publisher OA PDF DOI
Memory Allocation Under Hardware Compression
2024-11-02
articleSenior author
As the scaling of memory density slows physically, a promising solution is to scale memory logically by enhancing the CPU's memory controller to encode and store data more densely in memory. This is known as hardware memory compression. Hardware memory compression decouples OS-managed physical memory from actual memory (i.e., DRAM); the memory controller spends a dynamically varying amount of DRAM on each physical page, depending on the compressibility of the page's content. The newly-decoupled actual memory effectively forms a new layer of memory beyond the traditional layers of virtual, pseudo-physical, and physical memory. We note unlike these traditional memory layers, each with its own specialized allocation interface (e.g., malloc/mmap for virtual memory, page tables+MMU for physical memory), this new layer of memory introduced by hardware memory compression still awaits its own unique memory allocation interface; its absence makes the allocation of actual memory imprecise and, sometimes, even impossible. Imprecisely allocating less actual memory, and/or unable to allocate more, can harm performance. Even imprecisely allocating more actual memory to some jobs can be harmful as it can result in allocating less actual memory to other jobs in highly-occupied memory systems, where compression is useful. To restore precise memory allocation, we design a new memory allocation specialized for this new layer of memory and, subsequently, architect a new MMU-like component in the memory controller and tackle the corresponding design challenges. We create a full-system FPGA prototype of a hardware-compressed memory system with precise memory allocation. Our evaluations using the prototype show that jobs perform stably under colocation. The performance variation is only 1%-2%; in comparison, it is 19%-89% under the prior art.
Publisher DOI
DyLeCT: Achieving Huge-page-like Translation Performance for Hardware-compressed Memory
2024-06-29 · 3 citations
articleSenior author
To expand effective memory capacity, hardware memory compression transparently compresses and packs memory values more densely together in DRAM. This requires introducing a new layer of hardware-managed address translation in the memory controller (MC). However, for large and irregular workloads that already suffer from frequent virtual address translation misses in the TLB, adding an additional layer of address translation can double the translation misses (e.g., by adding a new miss in the MC per TLB miss). While TLB misses can be drastically reduced by using huge pages, no prior work has explored huge-page-like translation reach for hardware memory compression. While compressing and moving an entire huge page worth of data at a time can lead to huge-page-like address translation, moving a huge page worth of data together can consume an exorbitant amount of memory bandwidth.This paper explores how to achieve huge-page-like translation performance in this new address translation layer, while keeping compression at the page (instead of huge page) granularity. We propose dynamically shortening the translation entries of hot pages to only a few bits per entry by migrating hot pages to the limited number of DRAM locations whose addresses can be encoded using a few bits; colder pages still use the bigger fulllength translations so that colder pages can be placed anywhere in memory to fully utilize all the space in memory. Each short translation is tiny (e.g., 2 bits); as such, a 128KB translation cache filled mostly with short translations can achieve similar (e.g., 2GB) total translation reach as a TLB filled entirely with huge page entries. Evaluations show our idea – Dynamic Length Compressed-Memory Translations (DyLeCT) – improves average performance by 10.25% over the prior art.
Publisher DOI
Counter-light Memory Encryption
2024-06-29 · 3 citations
articleSenior author
Unlike the well-known counter mode memory encryption (e.g., SGX1), more recent memory encryption (e.g., SGX2, SEV) has no counters. Without accessing any counters, such counterless memory encryption improves performance over counter mode encryption and gains wide adoption as a result.Counterless encryption, however, still incurs a costly performance overhead. Under counterless encryption, the cipher calculations take data as their direct inputs. As such, the ciphers for decrypting data can only be calculated sequentially after the missing data arrive from memory; this requires every last-level cache miss to stall on the cipher calculations after the needed data arrive from memory. Our real-system measurements find counterless encryption can slow down irregular workloads by 9%, on average.We observe while counter mode encryption incurs costly memory access overhead, its cipher calculations can often complete before data arrive because they take counters as input, instead of data, and counters can fit on-chip much better than data. As such, we explore how to combine both modes of encryption to achieve the best of both worlds – the efficient memory accesses of counterless encryption and fast cipher calculations of counter mode encryption. For irregular workloads, our proposed memory encryption – Counter-light Encryption – achieves 98% the average performance of no memory encryption. When memory bandwidth is starved, Counter-light Encryption is slower than counterless encryption by only 1.4% in the worst case.
Publisher DOI
Translation-optimized Memory Compression for Capacity
2022 · 10 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Computer hardware
The demand for memory is ever increasing. Many prior works have explored hardware memory compression to increase effective memory capacity. However, prior works compress and pack/migrate data at a small - memory block-level - granularity; this introduces an additional block-level translation after the page-level virtual address translation. In general, the smaller the granularity of address translation, the higher the translation overhead. As such, this additional block-level translation exacerbates the well-known address translation problem for large and/or irregular workloads. A promising solution is to only save memory from cold (i.e., less recently accessed) pages without saving memory from hot (i.e., more recently accessed) pages (e.g., keep the hot pages uncompressed); this avoids block-level translation overhead for hot pages. However, it still faces two challenges. First, after a compressed cold page becomes hot again, migrating the page to a full 4KB DRAM location still adds another level (albeit page-level, instead of block-level) of translation on top of existing virtual address translation. Second, only compressing cold data require compressing them very aggressively to achieve high overall memory savings; decompressing very aggressively compressed data is very slow (e.g., $\gt 800 ns$ assuming the latest Deflate ASIC in industry). This paper presents Translation-optimized Memory Compression for Capacity (TMCC) to tackle the two challenges above. To address the first challenge, we propose compressing page table blocks in hardware to opportunistically embed compression translations into them in a software-transparent manner to effectively prefetch compression translations during a page walk, instead of serially fetching them after the walk. To address the second challenge, we perform a large design space exploration across many hardware configurations and diverse workloads to derive and implement in HDL an ASIC Deflate that is specialized for memory; for memory pages, it is 4X as fast as the state-of-the art ASIC Deflate, with little to no sacrifice in compression ratio. Our evaluations show that for large and/or irregular workloads, TMCC can either improve performance by 14% without sacrificing effective capacity or provide 2.2x the effective capacity without sacrificing performance compared to a state-of-the-art hardware memory compression for capacity.
Publisher DOI
Eager Memory Cryptography in Caches
2022-10-01 · 6 citations
articleSenior author
To protect memory values from adversaries with physical access to data centers, secure memory systems ensure memory confidentiality and integrity via memory encryption and verification. The corresponding cryptography calculations require a memory block’s write counter as input. As such, CPUs today cache counters in the memory controller (MC). Due to the large memory footprint and irregular access patterns of many real-world applications, MC’s counter cache is too small to achieve high hit rate. A promising solution is also caching counters in the much bigger Last Level cache (LLC). As such, many prior works use LLC as a second level cache for counters to back up the smaller counter cache in MC. Caching counters in LLC introduces a new problem, however. Modern server CPUs have a long LLC access latency that not only can diminish the benefit of caching counters in LLC, but also can sometimes significantly increase counter access latency compared to not caching counters in LLC. We note the problem lies with MC sitting behind LLC; due to its physical location, MC can only see LLC misses and, therefore, can only serially access and use counters after data miss in LLC has completed. However, prior designs without caching counters in LLC can access and use counters in parallel with accessing data. If a block’s counter misses in MC’s counter cache, MC can fetch the counter from DRAM in parallel with data; if the counter hits in MC’s counter cache, MC can use counters for cryptography calculation in parallel with data traveling from DRAM to MC. To parallelize the access and use of counters with data access while caching counters in LLC, we observe that in modern CPUs, L2 is typically the first place that caches data from DRAM (i.e., L2 and L3 are non-inclusive); as such, data from DRAM need not be decrypted and verified until they reach L2. So it is possible to offload some decryption and verification tasks from MC to L2. Since L2 sits before L3, L2 can access counter and data in parallel from L3; L2 can also use counters for cryptography calculation in parallel with data traveling from DRAM to L2, instead of just from DRAM to MC. As such, we propose caching and using counters directly in L2 and refer to this idea as Eager Memory Cryptography in Caches (EMCC). Our evaluation shows that when applied to the state-of-the-art baseline, EMCC improves performance of large and/or irregular workloads by 7%, on average.
Publisher DOI
Self-Reinforcing Memoization for Cryptography Calculations in Secure Memory Systems
2022 · 9 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Parallel computing
Modern memory systems use encryption and message authentication codes to ensure confidentiality and integrity. Encryption and integrity verification rely on cryptography calculations, which are slow. To hide the latency of cryptography calculations, prior works exploit the fact that many cryptography steps only require a memory block’s write counter (i.e., a value that increases whenever the block is written to memory), but not the block itself. As such, memory controller (MC) caches counters so that MC can start calculating before missing blocks arrive from memory.Irregular workloads suffer from high counter miss rates, however, just like they suffer from high miss rates of page table entries. Many prior works have looked at the problem of page table entry misses for irregular workloads, but not the problem of counter misses for the irregular workloads.This paper addresses the memory latency overheads that irregular workloads suffer due to their high counter miss rate. We observe many (e.g., unlimited number of) counters can have the same value. As such, we propose memoizing cryptography calculations for hot counter values. When a counter arrives from memory, MC can use the counter value to look up a memoization table to quickly obtain the counter’s memoized results instead of slowly recalculating them. To maximize memoization table hit rate, we observe whenever writing a block to memory, increasing its counter to any value higher than the current counter value can satisfy the security requirement of always using different counter values to encrypt the same block. As such, we also propose a memoization-aware counter update: when writing a block to memory, increase its counter to a value whose cryptography calculation is currently memoized. We refer to memoizing the calculation results of counters and the corresponding memoization-aware counter update collectively as Self-Reinforcing Memoization for Cryptography Calculations (RMCC). Our evaluations show that RMCC improves average performance by 6% compared to the state-of-the-art. On average across the lifetimes of different workloads, RMCC accelerates decryption and verification for 92% of counter misses.
Publisher DOI
External Review Committee
2021-02-01
articleOpen access
Publisher OA PDF DOI

Recent grants

CRII: SHF: Pointer-aware Memory: Boosting Cybersecurity by Making Strong Memory Protection Affordable for Irregular Applications
NSF · $183k · 2019–2022

Frequent coauthors

Rakesh Kumar
GLA University
8 shared
Vilas Sridharan
7 shared
Henry Duwe
Iowa State University
5 shared
Gagandeep Panwar
Virginia Tech
5 shared
Nathan DeBardeleben
Los Alamos National Laboratory
4 shared
Rakesh Kumar
4 shared
Esha Choukse
4 shared
Jagadish Kotra
Google (United States)
4 shared

Education

PhD, Electrical Computer Engineering
University of Illinois at Urbana-Champaign
2017

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Xun (Steve) Jian

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you