Nam Sung Kim

· Professor, Electrical and Computer Engineering

University of Illinois Urbana-Champaign · Computer Science

Active 1995–2024

h-index48

Citations10.9k

Papers33986 last 5y

Funding$2.4M

Faculty page

See your match with Nam Sung Kim — sign in to PhdFit.Sign in

About

Nam Sung Kim is the W.J. 'Jerry' Sanders III Advanced Micro Devices, Inc. Endowed Chair Professor at the University of Illinois, Urbana-Champaign, and a fellow of ACM, IEEE, and NAI. His interdisciplinary research incorporates device, circuit, architecture, and software for power-efficient computing. Prior to joining the University of Illinois in 2015, he was an associate professor at the University of Wisconsin, Madison, where he was early-tenured in 2013. His professional experience includes roles as a senior research scientist at Intel and a senior vice president at Samsung Electronics, where he led the development of next-generation DRAM products, including the industry's first HBM-PIM. Kim's research focuses on systems for AI/ML, high-performance and energy-efficient processor, memory, storage, network, and system architectures, as well as energy-efficient computing techniques for mobile/wearable devices and data centers. He has published nearly 300 refereed articles and received numerous prestigious awards, including the IEEE MICRO Best Paper Award, NSF CAREER Award, ACM/IEEE ISCA Paper Award, SIGMICRO Test of Time Awards, and the Intel Outstanding Researcher Award. His academic positions include assistant and associate professorships at the University of Wisconsin, Madison, and associate and full professorships at the University of Illinois, Urbana-Champaign, where he currently holds the endowed chair.

Research topics

Computer Science
Computer architecture
Embedded system
Parallel computing
Computer hardware
Operating system
Computer network
Distributed computing
Human–computer interaction
Engineering
Computer engineering
Telecommunications
Algorithm
Data science

Selected publications

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices
2023 · 108 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Parallel computing
The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on memory expansion and disaggregation technologies based on Compute eXpress Link (CXL). Especially, CXL-based memory expansion technology has recently gained notable attention for its ability not only to economically expand memory capacity and bandwidth but also to decouple memory technologies from a specific memory interface of the CPU. However, since CXL memory devices have not been widely available, they have been emulated using DDR memory in a remote NUMA node. In this paper, for the first time, we comprehensively evaluate a true CXL-ready system based on the latest 4th-generation Intel Xeon CPU with three CXL memory devices from different manufacturers. Specifically, we run a set of microbenchmarks not only to compare the performance of true CXL memory with that of emulated CXL memory but also to analyze the complex interplay between the CPU and CXL memory in depth. This reveals important differences between emulated CXL memory and true CXL memory, some of which will compel researchers to revisit the analyses and proposals from recent work. Next, we identify opportunities for memory-bandwidth-intensive applications to benefit from the use of CXL memory. Lastly, we propose a CXL-memory-aware dynamic page allocation policy, Caption to more efficiently use CXL memory as a bandwidth expander. We demonstrate that Caption can automatically converge to an empirically favorable percentage of pages allocated to CXL memory, which improves the performance of memory-bandwidth-intensive applications by up to 24% when compared to the default page allocation policy designed for traditional NUMA systems.
DOI
Coordinated Science Laboratory 70th Anniversary Symposium: The Future of Computing
arXiv (Cornell University) · 2022
- Computer Science
- Computer Science
- Data science
In 2021, the Coordinated Science Laboratory CSL, an Interdisciplinary Research Unit at the University of Illinois Urbana-Champaign, hosted the Future of Computing Symposium to celebrate its 70th anniversary. CSL's research covers the full computing stack, computing's impact on society and the resulting need for social responsibility. In this white paper, we summarize the major technological points, insights, and directions that speakers brought forward during the Future of Computing Symposium. Participants discussed topics related to new computing paradigms, technologies, algorithms, behaviors, and research challenges to be expected in the future. The symposium focused on new computing paradigms that are going beyond traditional computing and the research needed to support their realization. These needs included stressing security and privacy, the end to end human cyber physical systems and with them the analysis of the end to end artificial intelligence needs. Furthermore, advances that enable immersive environments for users, the boundaries between humans and machines will blur and become seamless. Particular integration challenges were made clear in the final discussion on the integration of autonomous driving, robo taxis, pedestrians, and future cities. Innovative approaches were outlined to motivate the next generation of researchers to work on these challenges. The discussion brought out the importance of considering not just individual research areas, but innovations at the intersections between computing research efforts and relevant application domains, such as health care, transportation, energy systems, and manufacturing.
Publisher OA PDF DOI
Harmony
Proceedings of the VLDB Endowment · 2022 · 18 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Parallel computing
Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.
Publisher DOI
25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications
2022 IEEE International Solid- State Circuits Conference (ISSCC) · 2021 · 188 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Embedded system
In recent years, artificial intelligence (AI) technology has proliferated rapidly and widely into application areas such as speech recognition, health care, and autonomous driving. To increase the capabilities of AI more powerful systems are needed to process a larger amount of data. This requirement has made domain-specific accelerators, such as GPUs and TPUs, popular; as they can provide orders of magnitude higher performance than state-of-the-art CPUs. However, these accelerators can only operate at their peak performance when they get the necessary data from memory as quickly as it is processed: requiring off-chip memory with a high bandwidth and a large capacity [1]. HBM has thus far met the bandwidth and capacity requirement [2-6], but recent AI technologies such as recurrent neural networks require an even higher bandwidth than HBM [7-8]. While a further increase in off-chip bandwidth can be accomplished by various techniques, it is often limited by power constraints at the chip or system level [9]. Hence, it is essential to decrease demand for off-chip bandwidth with unconventional architectures: such as processing-in-memory. In this paper, we present function-Inmemory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide 4× higher processing bandwidth than an off-chip memory solution. Second, we show techniques that do not require any modification to conventional memory controllers and their command protocols, which make FIMDRAM more practical for quick industry adoption. Finally, we conclude this paper with circuitand system-level evaluations of our fabricated FIMDRAM.
DOI
Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product
2021 · 230 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Embedded system
Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5×, respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5×, and the overall energy efficiency of the system running the applications by 3.2×.
DOI
Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks
2020 · 111 citations
- Computer Science
- Computer Science
- Computer architecture
Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly accelerator-based INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multi-tenancy through a new dimension: dynamic architecture fission. To that end, we define Planaria <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omni-directional systolic arrays for DNN acceleration that allows omni-directional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7.4×, 7.2×, 12.2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2.1×, 2.3×, 1.9×).
DOI

Recent grants

CI-P: Planning Simulation Infrastructure Evaluation for Parallel/Distributed Computer Systems
NSF · $100k · 2015–2015
CNS: CSR: Small: Runtime System, Architecture, and Technology Codesign Approach for Heterogeneous Many-Core Processors and Clusters
NSF · $223k · 2015–2016
CAREER: Approximate Computing Systems for Future Teraflops Workloads
NSF · $91k · 2015–2016
CNS: CSR: Small: Runtime System, Architecture, and Technology Codesign Approach for Heterogeneous Many-Core Processors and Clusters
NSF · $450k · 2012–2015
CSR: Medium: Collaborative Research: Scale-Out Near-Data Acceleration of Machine Learning
NSF · $600k · 2017–2022

Frequent coauthors

Trevor Mudge
University of Michigan–Ann Arbor
29 shared
Jung Ho Ahn
Seoul National University
28 shared
Myoungsoo Jung
College of Micronesia-FSM
24 shared
Michael Schulte
Advanced Micro Devices (Canada)
23 shared
Mohammad Alian
University of Kansas
22 shared
Ulya R. Karpuzcu
20 shared
Abhishek Sinkar
Oracle (United States)
19 shared
Murali Annavaram
18 shared

Labs

Siebel School of Computing and Data SciencePI

Education

Ph.D., Computer Science
University of Illinois at Urbana-Champaign
2000
M.S., Computer Science
University of Illinois at Urbana-Champaign
1995
B.S., Computer Science
Seoul National University
1991

Awards & honors

IEEE International Symposium on Microarchitecture (MICRO) Be…
NSF CAREER Award (2010)
ACM/IEEE Most Influential International Symposium on Compute…
SIGMICRO 2021 Test of Time Awards (2021)
Intel Outstanding Researcher Award (2024)

Similar researchers at University of Illinois Urbana-Champaign

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Nam Sung Kim

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you