TY - GEN
T1 - Tiny but Mighty
T2 - 49th IEEE/ACM International Symposium on Computer Architecture, ISCA 2022
AU - Orenes-Vera, Marcelo
AU - Manocha, Aninda
AU - Balkind, Jonathan
AU - Gao, Fei
AU - Aragón, Juan L.
AU - Wentzlaff, David
AU - Martonosi, Margaret
N1 - Publisher Copyright:
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2022/6/18
Y1 - 2022/6/18
N2 - Modern computing systems employ signifcant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verifcation. We present the frst system implementation of latency tolerance hardware that provides signifcant speedups without requiring any memory hierarchy or processor tile modifcations. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform longlatency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efcient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dualcore FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over softwarebased prefetching and decoupling, respectively. Compared to stateof-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.
AB - Modern computing systems employ signifcant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verifcation. We present the frst system implementation of latency tolerance hardware that provides signifcant speedups without requiring any memory hierarchy or processor tile modifcations. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform longlatency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efcient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dualcore FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over softwarebased prefetching and decoupling, respectively. Compared to stateof-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.
KW - Decoupling
KW - Latency tolerance
KW - Memory
KW - Modular RTL
UR - http://www.scopus.com/inward/record.url?scp=85132823454&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132823454&partnerID=8YFLogxK
U2 - 10.1145/3470496.3527400
DO - 10.1145/3470496.3527400
M3 - Conference contribution
AN - SCOPUS:85132823454
T3 - Proceedings - International Symposium on Computer Architecture
SP - 817
EP - 830
BT - ISCA 2022 - Proceedings of the 49th Annual International Symposium on Computer Architecture
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 June 2022 through 22 June 2022
ER -