TY - GEN
T1 - Support for high-frequency streaming in CMPs
AU - Rangan, Ram
AU - Vachharajani, Neil
AU - Stoler, Adam
AU - Ottoni, Guilherme
AU - August, David I.
AU - Cai, George Z.N.
PY - 2006
Y1 - 2006
N2 - As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.
AB - As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.
UR - http://www.scopus.com/inward/record.url?scp=40349096761&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=40349096761&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2006.47
DO - 10.1109/MICRO.2006.47
M3 - Conference contribution
AN - SCOPUS:40349096761
SN - 0769527329
SN - 9780769527321
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 259
EP - 269
BT - Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-39
T2 - 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-39
Y2 - 9 December 2006 through 13 December 2006
ER -