TY - JOUR
T1 - A 3-D IC for Mitigating Energy of Memory Accessing and Data Movement in Accelerator-Based Streaming Architectures
AU - Chen, Lung Yen
AU - Tao, Sen
AU - Verma, Naveen
N1 - Funding Information:
This work was supported in part by Semiconductor Research Corporation, in part by the Air Force Office of Scientific Research, in part by the National Science Foundation, in part by C-FAR/SONIC, in part by MARCO, and in part by DARPA. This paper was approved by Associate Editor Vivek De.
Funding Information:
Manuscript received August 2, 2018; revised December 9, 2018; accepted January 4, 2019. Date of publication January 31, 2019; date of current version May 24, 2019. This work was supported in part by Semiconductor Research Corporation, in part by the Air Force Office of Scientific Research, in part by the National Science Foundation, in part by C-FAR/SONIC, in part by MARCO, and in part by DARPA. This paper was approved by Associate Editor Vivek De. (Corresponding author: Lung-Yen Chen.) L.-Y. Chen and N. Verma are with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: lungyenc@princeton.edu).
Publisher Copyright:
© 1966-2012 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - This paper presents a 3-D integrated circuit (3-D IC) for heterogeneous domain-specific streaming architectures. In such architectures, an array of fine-grained accelerators is provided for executing kernels, and applications are mapped via configuration of the accelerators into a desired computation pipeline. The two-layer 3-D IC addresses architectures for different application domains, through a generic routing-and-memory (RM) layer and a separate compute-accelerator (CA) layer, which could ultimately be selected at assembly time for different application domains. The RM layer provides a configurable routing network, as well as memory for pipeline buffering and computation scratch pad. The routing network is based on a 2-D mesh with low-swing signaling. The memory is organized as 32 fine-grained (1-kB) SRAM tiles for increased interface parallelism, reduced access energy, and modularity, to interface with different accelerators in the CA layer. Memory-driver and sensing circuits are reused by the low-swing routing network, both for repeaters and to directly load pipeline data into accelerator input buffers. For the prototype, the CA layer is implemented as an array of multiplexers, providing off-chip interfacing to any memory title, thereby enabling different accelerators to be emulated by an off-chip field-programmable gate array (FPGA). The 3-D interconnection is achieved by 8-μm-pitch face-to-face (F2F) vias and wafer-level assembly. For the 2.47 × 3.38 mm2 two-layer die, implemented in 130-nm CMOS, the total peak memory bandwidth is 9.2 GB/s/mm2. A compute pipeline for computational photography is demonstrated, with the total energy of the accelerators reduced by over 2 ×, by exploiting parallelism enabled by interfaces to fine-grained RM-layer memory.
AB - This paper presents a 3-D integrated circuit (3-D IC) for heterogeneous domain-specific streaming architectures. In such architectures, an array of fine-grained accelerators is provided for executing kernels, and applications are mapped via configuration of the accelerators into a desired computation pipeline. The two-layer 3-D IC addresses architectures for different application domains, through a generic routing-and-memory (RM) layer and a separate compute-accelerator (CA) layer, which could ultimately be selected at assembly time for different application domains. The RM layer provides a configurable routing network, as well as memory for pipeline buffering and computation scratch pad. The routing network is based on a 2-D mesh with low-swing signaling. The memory is organized as 32 fine-grained (1-kB) SRAM tiles for increased interface parallelism, reduced access energy, and modularity, to interface with different accelerators in the CA layer. Memory-driver and sensing circuits are reused by the low-swing routing network, both for repeaters and to directly load pipeline data into accelerator input buffers. For the prototype, the CA layer is implemented as an array of multiplexers, providing off-chip interfacing to any memory title, thereby enabling different accelerators to be emulated by an off-chip field-programmable gate array (FPGA). The 3-D interconnection is achieved by 8-μm-pitch face-to-face (F2F) vias and wafer-level assembly. For the 2.47 × 3.38 mm2 two-layer die, implemented in 130-nm CMOS, the total peak memory bandwidth is 9.2 GB/s/mm2. A compute pipeline for computational photography is demonstrated, with the total energy of the accelerators reduced by over 2 ×, by exploiting parallelism enabled by interfaces to fine-grained RM-layer memory.
KW - 3-D integrated circuit (3-D IC)
KW - accelerator
KW - high-bandwidth interface
KW - low-swing interconnect
KW - streaming architecture
UR - http://www.scopus.com/inward/record.url?scp=85066427523&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066427523&partnerID=8YFLogxK
U2 - 10.1109/JSSC.2019.2892605
DO - 10.1109/JSSC.2019.2892605
M3 - Article
AN - SCOPUS:85066427523
SN - 0018-9200
VL - 54
SP - 1778
EP - 1788
JO - IEEE Journal of Solid-State Circuits
JF - IEEE Journal of Solid-State Circuits
IS - 6
M1 - 8630831
ER -