
Haotian Jiang
· Assistant Professor of Computer ScienceVerifiedUniversity of Chicago · Computer Science
Active 2013–2024
About
Haotian Jiang is an Assistant Professor of Computer Science at the University of Chicago. He previously worked as a Postdoctoral Researcher at Microsoft Research, Redmond. He obtained his PhD from the Paul G. Allen School of Computer Science & Engineering at the University of Washington in December 2022, under the supervision of Yin Tat Lee. His research interests broadly encompass theoretical computer science and applied mathematics, with a primary focus on the design and analysis of algorithms for continuous and discrete optimization problems. He approaches algorithm design through the lens of discrepancy theory. His work in optimization has been recognized with awards such as a Best Student Paper Award at SODA 2021 and a Best Paper Award at SODA 2025.
Research topics
- Computer Science
- Artificial Intelligence
- Data Mining
- Materials science
- Mathematics
- Speech recognition
- Multimedia
- Computer vision
- Human–computer interaction
- Psychology
- Computational science
- Parallel computing
- Algorithm
Selected publications
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
2024-06-16 · 9 citations
articleIn recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework-Audio- Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors-speaking and listening-for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our Project Page.
Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning
2024-06-16 · 2 citations
article1st authorCorrespondingHalftoning is a time-honored printing technique that simulates continuous tones using ink dots (halftone dots). The resurgence of deep learning has catalyzed the emergence of innovative technologies in the printing industry, fostering the advancement of data-driven halftoning methods. Nevertheless, current deep learning-based approaches produce halftones through image-to-image black box transformations, lacking direct control over the movement of individual halftone dots. In this paper, we propose an innovative halftoning method termed “neural dot-controllable halftoning”. This method allows dot-level image dithering by providing direct control over the motion of each ink dot. We conceptualize halftoning as the process of sprinkling dots on a canvas. Initially, a specific quantity of dots are randomly dispersed on the canvas and subsequently adjusted based on the surrounding grayscale and gradient. To establish differentiable transformations between discrete ink dot positions and halftone matrices, we devise a lightweight dot encoding network to spread dense gradients to sparse dots. Dot control offers several advantages to our approach, including the capability to regulate the quantity of halftone dots and enhance specific areas with artifacts in the generated halftones by adjusting the placement of the dots. Our proposed method exhibits superior performance than previous approaches in extensive quantitative and qualitative experiments.
Ego4D: Around the World in 3,600 Hours of Egocentric Video
IEEE Transactions on Pattern Analysis and Machine Intelligence · 2024-07-26 · 11 citations
articleOpen accessWe introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.
Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · 2022-06-01 · 45 citations
article1st authorCorrespondingVideo summarization has recently engaged increasing attention in computer vision communities. However, the scarcity of annotated data has been a key obstacle in this task. To address it, this work explores a new solution for video summarization by transferring samples from a correlated task (i.e., video moment localization) equipped with abundant training data. Our main insight is that the annotated video moments also indicate the semantic highlights of a video, essentially similar to video summary. Approximately, the video summary can be treated as a sparse, redundancy-free version of the video moments. Inspired by this observation, we propose an importance Propagation based collaborative Teaching Network (iPTNet). It consists of two separate modules that conduct video summarization and moment localization, respectively. Each module estimates a frame-wise importance map for indicating keyframes or moments. To perform cross-task sample transfer, we devise an importance propagation module that realizes the conversion between summarization-guided and localization-guided importance maps. This way critically enables optimizing one of the tasks using the data from the other task. Additionally, in order to avoid error amplification caused by batch-wise joint training, we devise a collaborative teaching scheme, which adopts a crosstask mean teaching strategy to realize the joint optimization of the two tasks and provide robust frame-level teaching signals. Extensive experiments on video summarization benchmarks demonstrate that iPTNet significantly outperforms previous state-of-the-art video summarization methods, serving as an effective solution that overcomes the data scarcity issue in video summarization.
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · 2022 · 40 citations
1st authorCorresponding- Computer Science
- Artificial Intelligence
- Computer Science
Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
Ego4D: Around the World in 3,000 Hours of Egocentric Video
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · 2022 · 540 citations
- Computer Science
- Computer Science
- Artificial Intelligence
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/
Good to the Last Bit: Data-Driven Encoding with CodecDB
2021-06-09 · 43 citations
article1st authorCorrespondingColumnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar databases are encoding-oblivious. When storing the data, these systems rely on a global understanding of the dataset or the data types to derive simple rules for encoding selection. Such rule-based selection leads to unsatisfactory performance. Specifically, when performing queries, the systems always decode data into memory, ignoring the possibility of optimizing access to encoded data. We develop CodecDB, an encoding-aware columnar database, to demonstrate the benefit of tightly-coupling the database design with the data encoding schemes. CodecDB chooses in a principled manner the most efficient encoding for a given data column and relies on encoding-aware query operators to optimize access to encoded data. Storage-wise, CodecDB achieves on average 90% accuracy for selecting the best encoding and improves the compression ratio by up to 40% compared to the state-of-the-art encoding selection solution. Query-wise, CodecDB is on average one order of magnitude faster than the latest open-source and commercial columnar databases on the TPC-H benchmark, and on average 3x faster than a recent research project on the Star-Schema Benchmark (SSB).
Decomposed bounded floats for fast compression and queries
Proceedings of the VLDB Endowment · 2021 · 54 citations
- Computer Science
- Computer Science
- Data Mining
Modern data-intensive applications often generate large amounts of low precision float data with a limited range of values. Despite the prevalence of such data, there is a lack of an effective solution to ingest, store, and analyze bounded, low-precision, numeric data. To address this gap, we propose Buff, a new compression technique that uses a decomposed columnar storage and encoding methods to provide effective compression, fast ingestion, and high-speed in-situ adaptive query operators with SIMD support.
arXiv (Cornell University) · 2021-07-09 · 32 citations
preprintOpen accessAugmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
Proceedings of the VLDB Endowment · 2020-02-01 · 32 citations
article1st authorCorrespondingWe propose PIDS, Pattern Inference Decomposed Storage, an innovative storage method for decomposing string attributes in columnar stores. Using an unsupervised approach, PIDS identifies common patterns in string attributes from relational databases, and uses the discovered pattern to split each attribute into sub-attributes. First, by storing and encoding each sub-attribute individually, PIDS can achieve a compression ratio comparable to Snappy and Gzip. Second, by decomposing the attribute, PIDS can push down many query operators to sub-attributes, thereby minimizing I/O and potentially expensive comparison operations, resulting in the faster execution of query operators.
Frequent coauthors
- 9 shared
Aaron J. Elmore
University of Chicago
- 7 shared
Chunwei Liu
Nanjing University of Information Science and Technology
- 6 shared
John Paparrizos
The Ohio State University
- 5 shared
Vamsi Krishna Ithapu
- 4 shared
James M. Rehg
- 4 shared
Andrew A. Chien
Argonne National Laboratory
- 4 shared
Ilija Radosavovic
- 4 shared
Wenqi Jia
Labs
Education
- 2022
Ph.D., Computer Science
Paul G. Allen School of Computer Science & Engineering
Other
Microsoft Research, Redmond
Awards & honors
- Best Paper Award at SODA 2025
- Best Student Paper Award at SODA 2021
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Haotian Jiang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup