Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
David Hung-Chang Du

David Hung-Chang Du

University of Minnesota · Computer Science and Engineering

Active 1986–2025

h-index21
Citations2.0k
Papers17732 last 5y
Funding$4.3M
See your match with David Hung-Chang Du — sign in to PhdFit.Sign in

About

David Hung-Chang Du is a Professor and the Qwest Chair Professor in the Department of Computer Science & Engineering at the University of Minnesota Twin Cities. He joined the department as an assistant professor in 1981 and was promoted to a full professor in 1991. He has served as the Director of the NSF I/UCRC Center for Research in Intelligent Storage from 2014 to 2021 and has held the position of Qwest Chair Professor since 2007. His professional background includes serving as vice president of engineering at 3CX in 1998 and as a program director at the National Science Foundation from 2006 to 2008. Du is an IEEE Fellow since 1998 and a Fellow of the Minnesota Supercomputer Institute, and he has received the IEEE and ACM Recognition of Service Award. His research focuses on intelligent storage systems, sensor and vehicular networks, and cyber physical systems, with previous research interests including high-speed networking, database design, multimedia computing, and CAD for VLSI circuits.

Research topics

  • Computer Science
  • Machine Learning
  • Computer Security
  • Business
  • Pathology
  • Medicine
  • Chemistry
  • Biology
  • Endocrinology
  • Anatomy
  • Cell biology
  • Biochemistry

Selected publications

  • PM-Dedup: Secure Deduplication with Partial Migration from Cloud to Edge Servers

    arXiv (Cornell University) · 2025-01-04 · 1 citations

    preprintOpen accessSenior author

    Currently, an increasing number of users and enterprises are storing their data in the cloud but do not fully trust cloud providers with their data in plaintext form. To address this concern, they encrypt their data before uploading it to the cloud. However, encryption with different keys means that even identical data will become different ciphertexts, making deduplication less effective. Encrypted deduplication avoids this issue by ensuring that identical data chunks generate the same ciphertext with content-based keys, enabling the cloud to efficiently identify and remove duplicates even in encrypted form. Current encrypted data deduplication work can be classified into two types: target-based and source-based. Target-based encrypted deduplication requires clients to upload all encrypted chunks (the basic unit of deduplication) to the cloud with high network bandwidth overhead. Source-based deduplication involves clients uploading fingerprints (hashes) of encrypted chunks for duplicate checking and only uploading unique encrypted chunks, which reduces network transfer but introduces high latency and potential side-channel attacks, which need to be mitigated by Proof of Ownership (PoW), and high computing overhead of the cloud. So, reducing the latency and the overheads of network and cloud while ensuring security has become a significant challenge for secure data deduplication in cloud storage. In response to this challenge, we present PM-Dedup, a novel secure source-based deduplication approach that relocates a portion of the deduplication checking process and PoW tasks from the cloud to the trusted execution environments (TEEs) in the client-side edge servers. We also propose various designs to enhance the security and efficiency of data deduplication.

  • GAIA: Glass-Aware I/O Middleware

    2025-10-26

    articleSenior author

    As cloud-scale services and data-centric applications continue to generate massive volumes of data, the need for ultra-durable, energy-efficient, and cost-effective archival storage becomes increasingly urgent. Quartz glass has recently emerged as a promising archival medium, offering multi-century durability, radiation and thermal resistance, and support for three-dimensional data encoding using femtosecond laser writing. However, the hybrid mechanical-optical architecture of glass storage—requiring mechanical movement along the X and Y axes and optical focal tuning along the Z axis—introduces unique performance bottlenecks during data access, which conventional I/O scheduling strategies are not equipped to handle.In this work, we present GAIA, a Glass-Aware I/O middlewAre designed to optimize data access in quartz glass storage systems. GAIA features three coordinated strategies: (1) Zigzag Data Placement, which aligns data with the mechanical stage’s natural motion to minimize direction-switching latency; (2) Z-Axis First Placement, which prioritizes low-latency optical traversal along the depth dimension; and (3) Shortest Moving Time First (SMTF) scheduling, which selects I/O operations based on predicted movement time rather than geometric distance. Through trace-driven simulations using enterprise-scale workloads and various glass sizes, GAIA reduces data read latency by up to 82% compared to traditional baseline schedulers. These results demonstrate the critical importance of middleware-level co-design in unlocking the performance potential of next-generation glass-based storage systems.

  • CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems

    IEEE Transactions on Computers · 2024-11-08

    articleSenior author

    Data deduplication relies on a chunk index to identify the redundancy of incoming chunks. As backup data scales, it is impractical to maintain the entire chunk index in memory. Consequently, an index lookup needs to search the portion of the on-storage index, causing a dramatic regression of index lookup throughput. Existing studies propose to search a subset of the whole index (partial index) to limit the storage I/Os and guarantee a high index lookup throughput. However, several core factors of designing partial indexing are not fully exploited. In this paper, we first comprehensively investigate the trade-offs of using different meta-groups, sampling methods, and meta-group selection policies for a partial index. We then propose a Collaborative Partial Index (CPI) which takes advantage of two meta-groups including recipe-segment and container-catalog to achieve more efficient and effective unique chunk identification. CPI further introduces a hook-entry sharing technology and a two-stage eviction policy to reduce memory usage without hurting the deduplication ratio. According to evaluation, with the same constraints of memory usage and storage I/O, CPI achieves a 1.21x-2.17x higher deduplication ratio than the state-of-the-art partial indexing schemes. Alternatively, CPI achieves 1.8X-4.98x higher index lookup throughput than others when the same deduplication ratio is achieved. Compared with full indexing, CPI's maximum deduplication ratio is only 4.07% lower but its throughput is 37.1x - 122.2x of that of full indexing depending on different storage I/O constraints in our evaluation cases.

  • Collision Aware Data Allocation In Multi-tube DNA Storage

    arXiv (Cornell University) · 2024-03-21

    preprintOpen accessSenior author

    DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction (PCR) based DNA storage systems apply primers as specific identifiers to tag different sets of DNA strands. However, if a primer has collisions with any payload in the same DNA tube, the primer cannot safely serve as an identifier and must be disabled in this tube. In a DNA storage system with multiple DNA tubes, the primer-payload collisions can spread over all DNA tubes, repeatedly disable many primers, and cause a significant overall capacity reduction. This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes. This allocation helps increase the number of usable primers over all tubes thus enhancing the overall storage capacity. The executing time of our scheme is $O(n^2)$ to the number of digital data chunks. The scheme serves as a pre-processing method for any DNA storage system. The evaluation of the state-of-the-art encoding scheme shows that the scheme can increase 20%-25% overall storage capacity.

  • VL-DNA: Enhance DNA Storage Capacity with Variable Payload (Strand) Lengths

    arXiv (Cornell University) · 2024-03-21

    preprintOpen accessSenior author

    DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction PCR based DNA storage systems apply primers as specific identifier to tag different set of DNA strands. However, the PCR based DNA storage system suffers from primer-payload collisions, causing a significant reduction of storage capacity. This paper proposes using variable strand length, which takes advantage of the inherent payload-cutting process, to split collisions and recover primers. The executing time of our scheme is linear to the number of primer-payload collisions. The scheme serves as a post-processing method to any DNA encoding scheme. The evaluation of three state-of-the-art encoding schemes shows that the scheme can recover thousands of usable primers and improve tube capacity ranging from 18.27% to 19x.

  • K8sES: Optimizing Kubernetes with Enhanced Storage Service-Level Objectives

    2023-11-06

    article

    Kubernetes (k8s) is a system for managing containerized applications across multiple hosts. It offers automatic deployment, maintenance, scaling, and resource management for applications. Applications in k8s usually have different storage requirements in the form of service-level objectives (SLOs). However, the current k8s storage management has several limitations which cause explicit performance and cost overhead. K8s administrators have to configure storage in advance manually, and users must know configurations and capabilities of provided storage. Users' storage SLOs can be easily violated in k8s.In this paper, we design and implement k8s Enhanced Storage (k8sES) which efficiently supports applications with various storage SLOs along with all other requirements in the Kubernetes environment. We design and incorporate storage scheduling as part of the node scheduling process in k8s. Applications will be scheduled onto the correct nodes and storage without intervention from either users or administrators. Proper storage resources will be dynamically carved based on users' storage SLOs. In addition, we provide a tool to monitor the I/O activities of both applications and storage devices in k8sES. The evaluation shows that k8sES can better meet users' storage SLOs along with other requirements. Also, k8sES can achieve higher resource utilization efficiency with overhead similar to that of the current k8s.

  • IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure

    ACM Transactions on Storage · 2022-03-29 · 17 citations

    articleSenior author

    Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers. HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an I n- S torage-based HBase architecture, called IS-HBase , to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module ( I n- S torage S can N er, called ISSN ) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.

  • Machine Learning-based Adaptive Migration Algorithm for Hybrid Storage Systems

    2022-10-01 · 8 citations

    articleSenior author

    Hybrid storage systems are prevalent in most large-scale enterprise storage systems since they balance storage performance, storage capacity and cost. The goal of such systems is to serve the majority of the I/O requests from high-performance devices and store less frequently used data in low-performance devices. A large data migration volume between tiers can cause a huge overhead in practical hybrid storage systems. Therefore, how to balance the trade-off between the migration cost and potential performance gain is a challenging and critical issue in hybrid storage systems. In this paper, we focused on the data migration problem of hybrid storage systems with two classes of storage devices. A machine learning-based migration algorithm called K-Means assisted Support Vector Machine (K-SVM) migration algorithm is proposed. This algorithm is capable of more precisely classifying and efficiently migrating data between performance and capacity tiers. Moreover, this K-SVM migration algorithm involves a K-Means clustering algorithm to dynamically select a proper training dataset such that the proposed algorithm can significantly reduce the volume of migrating data. Finally, the real implementation results indicate that the ML-based algorithm reduces the migration data volume by about 40% and achieves 70% lower latency than other algorithms.

  • HL-DNA: A Hybrid Lossy/Lossless Encoding Scheme to Enhance DNA Storage Density and Robustness for Images

    2022 IEEE 40th International Conference on Computer Design (ICCD) · 2022-10-01 · 9 citations

    article

    With the storage's demand for high density and long-term preservation, Deoxyribonucleic Acid (DNA) has become a promising candidate to satisfy the requirement of archival storage for rapidly increased digital volume. However, due to the biochemical constraints, DNA storage faces critical issues of low practical capacity and robustness. In this paper, we target image applications and propose to apply approximation to DNA storage to improve the overall encoding density and robustness of DNA storage by using a hybrid lossy and lossless encoding scheme (called HL-DNA). Several lossy and lossless encoding schemes (lossy and lossless codes) are proposed and used to encode incoming binary sequences. These two types of codes are coordinated to balance the encoding density and errors. The lossless codes are used to limit the errors and the lossy codes are used to improve the encoding density. Moreover, the introduced approximation and newly proposed hybrid encoding schemes in one DNA strand can improve the robustness of DNA storage. Finally, the experimental results indicate that the proposed HL-DNA improves the encoding density of DNA storage and makes it much close to the ideal case. Also, HL-DNA achieves higher robustness to the injected errors than other DNA storage codes.

  • DNA Storage: A Promising Large Scale Archival Storage?

    arXiv (Cornell University) · 2022-04-04 · 3 citations

    preprintOpen accessSenior author

    Deoxyribonucleic Acid (DNA), with its high density and long durability, is a promising storage medium for long-term archival storage and has attracted much attention. Several studies have verified the feasibility of using DNA for archival storage with a small amount of data. However, the achievable storage capacity of DNA as archival storage has not been comprehensively investigated yet. Theoretically, the DNA storage density is about 1 exabyte/mm3 (109 GB/mm3). However, according to our investigation, DNA storage tube capacity based on the current synthesizing and sequencing technologies is only at hundreds of Gigabytes due to the limitation of multiple bio and technology constraints. This paper identifies and investigates the critical factors affecting the single DNA tube capacity for archival storage. Finally, we suggest several promising directions to overcome the limitations and enhance DNA storage capacity.

Recent grants

Frequent coauthors

  • Bingzhe Li

    28 shared
  • Xiongzi Ge

    NetApp (United States)

    25 shared
  • Zhichao Cao

    Arizona State University

    23 shared
  • Jim Diehl

    Twin Cities Orthopedics

    16 shared
  • David J. Lilja

    University of Minnesota

    13 shared
  • Fenggang Wu

    Meta (United States)

    12 shared
  • Dongchul Park

    Hyundai Motor Group (South Korea)

    12 shared
  • Young-Jin Nam

    Kwangwoon University

    12 shared

Awards & honors

  • IEEE Fellow (1998)
  • Fellow of the Minnesota Supercomputer Institute
  • IEEE and ACM Recognition of Service Award
  • US West Chair in Telecommunications (1994-2000)
  • Selected Grants CSR: Small: Heterogeneous Storage Systems wi…
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with David Hung-Chang Du

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup