Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Nam Sung  Kim

Nam Sung Kim

· Professor, Electrical and Computer Engineering

University of Illinois Urbana-Champaign · Computer Science

Active 1995–2024

h-index48
Citations10.9k
Papers33986 last 5y
Funding$2.4M
See your match with Nam Sung Kim — sign in to PhdFit.Sign in

About

Nam Sung Kim is the W.J. 'Jerry' Sanders III Advanced Micro Devices, Inc. Endowed Chair Professor at the University of Illinois, Urbana-Champaign, and a fellow of ACM, IEEE, and NAI. His interdisciplinary research incorporates device, circuit, architecture, and software for power-efficient computing. Prior to joining the University of Illinois in 2015, he was an associate professor at the University of Wisconsin, Madison, where he was early-tenured in 2013. His professional experience includes roles as a senior research scientist at Intel and a senior vice president at Samsung Electronics, where he led the development of next-generation DRAM products, including the industry's first HBM-PIM. Kim's research focuses on systems for AI/ML, high-performance and energy-efficient processor, memory, storage, network, and system architectures, as well as energy-efficient computing techniques for mobile/wearable devices and data centers. He has published nearly 300 refereed articles and received numerous prestigious awards, including the IEEE MICRO Best Paper Award, NSF CAREER Award, ACM/IEEE ISCA Paper Award, SIGMICRO Test of Time Awards, and the Intel Outstanding Researcher Award. His academic positions include assistant and associate professorships at the University of Wisconsin, Madison, and associate and full professorships at the University of Illinois, Urbana-Champaign, where he currently holds the endowed chair.

Research topics

  • Computer Science
  • Computer architecture
  • Embedded system
  • Parallel computing
  • Computer hardware
  • Operating system
  • Computer network
  • Distributed computing
  • Human–computer interaction
  • Engineering
  • Computer engineering
  • Telecommunications
  • Algorithm
  • Data science

Selected publications

  • Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

    2023 · 108 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Parallel computing

    The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on memory expansion and disaggregation technologies based on Compute eXpress Link (CXL). Especially, CXL-based memory expansion technology has recently gained notable attention for its ability not only to economically expand memory capacity and bandwidth but also to decouple memory technologies from a specific memory interface of the CPU. However, since CXL memory devices have not been widely available, they have been emulated using DDR memory in a remote NUMA node. In this paper, for the first time, we comprehensively evaluate a true CXL-ready system based on the latest 4th-generation Intel Xeon CPU with three CXL memory devices from different manufacturers. Specifically, we run a set of microbenchmarks not only to compare the performance of true CXL memory with that of emulated CXL memory but also to analyze the complex interplay between the CPU and CXL memory in depth. This reveals important differences between emulated CXL memory and true CXL memory, some of which will compel researchers to revisit the analyses and proposals from recent work. Next, we identify opportunities for memory-bandwidth-intensive applications to benefit from the use of CXL memory. Lastly, we propose a CXL-memory-aware dynamic page allocation policy, Caption to more efficiently use CXL memory as a bandwidth expander. We demonstrate that Caption can automatically converge to an empirically favorable percentage of pages allocated to CXL memory, which improves the performance of memory-bandwidth-intensive applications by up to 24% when compared to the default page allocation policy designed for traditional NUMA systems.

  • Coordinated Science Laboratory 70th Anniversary Symposium: The Future of Computing

    arXiv (Cornell University) · 2022

    • Computer Science
    • Computer Science
    • Data science

    In 2021, the Coordinated Science Laboratory CSL, an Interdisciplinary Research Unit at the University of Illinois Urbana-Champaign, hosted the Future of Computing Symposium to celebrate its 70th anniversary. CSL's research covers the full computing stack, computing's impact on society and the resulting need for social responsibility. In this white paper, we summarize the major technological points, insights, and directions that speakers brought forward during the Future of Computing Symposium. Participants discussed topics related to new computing paradigms, technologies, algorithms, behaviors, and research challenges to be expected in the future. The symposium focused on new computing paradigms that are going beyond traditional computing and the research needed to support their realization. These needs included stressing security and privacy, the end to end human cyber physical systems and with them the analysis of the end to end artificial intelligence needs. Furthermore, advances that enable immersive environments for users, the boundaries between humans and machines will blur and become seamless. Particular integration challenges were made clear in the final discussion on the integration of autonomous driving, robo taxis, pedestrians, and future cities. Innovative approaches were outlined to motivate the next generation of researchers to work on these challenges. The discussion brought out the importance of considering not just individual research areas, but innovations at the intersections between computing research efforts and relevant application domains, such as health care, transportation, energy systems, and manufacturing.

  • Harmony

    Proceedings of the VLDB Endowment · 2022 · 18 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Parallel computing

    Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.

  • 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

    2022 IEEE International Solid- State Circuits Conference (ISSCC) · 2021 · 188 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Embedded system

    In recent years, artificial intelligence (AI) technology has proliferated rapidly and widely into application areas such as speech recognition, health care, and autonomous driving. To increase the capabilities of AI more powerful systems are needed to process a larger amount of data. This requirement has made domain-specific accelerators, such as GPUs and TPUs, popular; as they can provide orders of magnitude higher performance than state-of-the-art CPUs. However, these accelerators can only operate at their peak performance when they get the necessary data from memory as quickly as it is processed: requiring off-chip memory with a high bandwidth and a large capacity [1]. HBM has thus far met the bandwidth and capacity requirement [2-6], but recent AI technologies such as recurrent neural networks require an even higher bandwidth than HBM [7-8]. While a further increase in off-chip bandwidth can be accomplished by various techniques, it is often limited by power constraints at the chip or system level [9]. Hence, it is essential to decrease demand for off-chip bandwidth with unconventional architectures: such as processing-in-memory. In this paper, we present function-Inmemory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide 4× higher processing bandwidth than an off-chip memory solution. Second, we show techniques that do not require any modification to conventional memory controllers and their command protocols, which make FIMDRAM more practical for quick industry adoption. Finally, we conclude this paper with circuitand system-level evaluations of our fabricated FIMDRAM.

  • Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product

    2021 · 230 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Embedded system

    Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5×, respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5×, and the overall energy efficiency of the system running the applications by 3.2×.

  • Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks

    2020 · 111 citations

    • Computer Science
    • Computer Science
    • Computer architecture

    Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly accelerator-based INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multi-tenancy through a new dimension: dynamic architecture fission. To that end, we define Planaria <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omni-directional systolic arrays for DNN acceleration that allows omni-directional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7.4×, 7.2×, 12.2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2.1×, 2.3×, 1.9×).

Recent grants

Frequent coauthors

  • Trevor Mudge

    University of Michigan–Ann Arbor

    29 shared
  • Jung Ho Ahn

    Seoul National University

    28 shared
  • Myoungsoo Jung

    College of Micronesia-FSM

    24 shared
  • Michael Schulte

    Advanced Micro Devices (Canada)

    23 shared
  • Mohammad Alian

    University of Kansas

    22 shared
  • Ulya R. Karpuzcu

    20 shared
  • Abhishek Sinkar

    Oracle (United States)

    19 shared
  • Murali Annavaram

    18 shared

Labs

  • Siebel School of Computing and Data SciencePI

Education

  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    2000
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    1995
  • B.S., Computer Science

    Seoul National University

    1991

Awards & honors

  • IEEE International Symposium on Microarchitecture (MICRO) Be…
  • NSF CAREER Award (2010)
  • ACM/IEEE Most Influential International Symposium on Compute…
  • SIGMICRO 2021 Test of Time Awards (2021)
  • Intel Outstanding Researcher Award (2024)

Similar researchers at University of Illinois Urbana-Champaign

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Nam Sung Kim

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup