I build compilers, programming models, and design automation tools that make it easier to design and program specialized hardware.
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR
Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, André Rösti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang
ACM TRETS · paper · code
Abstract
We introduce MLIR-AIR, an open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs, achieving up to 78.7% compute efficiency on matrix multiplication.
Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device
Niansong Zhang, Wenbo Zhu, Courtney Golden, Dan Ilan, Hongzheng Chen, Christopher Batten, Zhiru Zhang
MICRO 2025 · paper
Abstract
This paper characterizes a commercial compute-in-SRAM device using realistic workloads, proposes key data management optimizations, and demonstrates that it can match GPU-level performance on retrieval-augmented generation tasks while achieving over 46x energy savings.
ASPEN: LLM-Guided E-Graph Rewriting for RTL Datapath Optimization
Niansong Zhang, Chenhui Deng, Johannes Maximilian Kuehn, Chia-Tung Ho, Cunxi Yu, Zhiru Zhang, Haoxing Ren
MLCAD 2025 · paper
Abstract
ASPEN uses LLM-guided e-graph rewriting with real PPA feedback for RTL optimization. With 16.51% area and 6.65% delay improvements over prior methods, ASPEN shows you can have both smart and sound—and it's fully automated.
Cypress: VLSI-Inspired PCB Placement with GPU Acceleration
Niansong Zhang, Anthony Agnesina, Noor Shbat, Yuval Leader, Zhiru Zhang, Haoxing Ren
ISPD 2025 · paper · code
🏆 Best Paper Award
Abstract
We present Cypress, a GPU-accelerated, VLSI-inspired PCB placer that boosts routability by up to 5.9x, cuts track length by 19.7x, and runs up to 492x faster on new realistic benchmarks.
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines
Jinming Zhuang*, Shaojie Xiang*, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou * Equal Contribution
FPGA 2025 · paper · code
🏅 Best Paper Nominee
Abstract
We propose ARIES, a unified MLIR-based compilation flow that abstracts task, tile, and instruction-level parallelism across AMD AI Engine arrays (and optional FPGA fabric), boosting Versal VCK190 GEMM throughput by up to 1.6x over prior work.
Allo: A Programming Model for Composable Accelerator Design
Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang * Equal Contribution
PLDI 2024 · paper · code
Abstract
Allo, a new composable programming model, decouples hardware customizations from algorithms and outperforms existing languages in performance and productivity for specialized hardware accelerator design.
Formal Verification of Source-to-Source Transformations for HLS
Louis-Noël Pouchet, Emily Tucker, Niansong Zhang, Hongzheng Chen, Debjit Pal, Gaberiel Rodríguez, Zhiru Zhang
FPGA 2024 · paper · code
🏆 Best Paper Award
Abstract
We target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to ensuring the correctness of the description provided to the HLS toolchain.
Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
ACM TRETS, Vol. 18, No 1, Article 5 · paper
Abstract
We design a spatial FPGA accelerator for LLM inference that assigns each operator its own hardware block and connects them with on-chip dataflow to cut memory traffic and latency.
Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator
Courtney Golden, Dan Ilan, Caroline Huang, Niansong Zhang, Zhiru Zhang, Christopher Batten
IEEE Computer Architecture Letters · paper
Abstract
We implement a virtual vector instruction set on a commercial Compute-in-SRAM device, and perform detailed instruction microbenchmarking to identify performance benefits and overheads.
Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective
Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, Yu Wang
IEEE Transactions on Computers · paper
🏅 Featured Paper in the May 2023 Issue
Abstract
This paper proposes a Design Space Exploration framework to jointly optimize heterogeneous multi-core architecture, layer scheduling, and compiler mapping for serving DNN workloads on cloud FPGAs.
Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges
Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, Zhiru Zhang
Invited Paper, DAC 2022 · paper
Abstract
We show the advantages of the decoupled programming model and further discuss some of our recent efforts to enable a robust and viable verification solution in the future.
CodedVTR: Codebook-Based Sparse Voxel Transformer with Geometric Guidance
Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang
CVPR 2022 · paper · website · slides · poster · video
Abstract
We propose CodedVTR, a flexible 3D Transformer on sparse voxels that decomposes attention space into linear combinations of learnable prototypes to regularize attention learning, with geometry-aware self-attention to guide training.
HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs
Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, Zhiru Zhang
FPGA 2022 · paper · code
Abstract
We propose an FPGA accelerator programming model that decouples the algorithm specification from optimizations related to orchestrating the placement of data across a customized memory hierarchy.
RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms
Niansong Zhang, Xiang Chen, Nachiket Kapre
Invited Paper, ACM TRETS, Vol. 15, Issue 4, Article 38 · paper
🏆 Best Paper Award
Abstract
We extend the previous work on RapidLayout with cross-SLR routing, placement transfer learning, and placement bootstrapping from a much smaller device to improve runtime and design quality.
RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms
Niansong Zhang, Xiang Chen, Nachiket Kapre
FPL 2020 · paper · code
🏅 Michal Servit Best Paper Award Nominee
Abstract
We build a fast and high-performance evolutionary placer for FPGA-optimized hard block designs that targets high clock frequency such as 650+MHz.