TY - GEN
T1 - GhOST
T2 - 51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024
AU - Chaturvedi, Ishita
AU - Godala, Bhargav Reddy
AU - Wu, Yucan
AU - Xu, Ziyang
AU - Iliakis, Konstantinos
AU - Eleftherakis, Panagiotis Eleftherios
AU - Xydis, Sotirios
AU - Soudris, Dimitrios
AU - Sorensen, Tyler
AU - Campanoni, Simone
AU - Aamodt, Tor M.
AU - August, David I.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application's execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage's existing pool of decoded instructions and the existing issue stage's information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia's intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a 3 6% maximum and 6.9 % geometric mean speedup on GPU binaries with only a 0.007 % area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
AB - Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application's execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage's existing pool of decoded instructions and the existing issue stage's information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia's intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a 3 6% maximum and 6.9 % geometric mean speedup on GPU binaries with only a 0.007 % area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
KW - GPU
KW - GPU Microarchitecture
KW - Parallelism
KW - low overhead out-of-order execution
KW - out-of-order execution
UR - https://www.scopus.com/pages/publications/85201156120
UR - https://www.scopus.com/pages/publications/85201156120#tab=citedBy
U2 - 10.1109/ISCA59077.2024.00011
DO - 10.1109/ISCA59077.2024.00011
M3 - Conference contribution
AN - SCOPUS:85201156120
T3 - Proceedings - International Symposium on Computer Architecture
SP - 1
EP - 16
BT - Proceeding - 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture, ISCA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 June 2024 through 3 July 2024
ER -