TY - GEN
T1 - Inter-core cooperative TLB prefetchers for chip multiprocessors
AU - Bhattacharjee, Abhishek
AU - Martonosi, Margaret Rose
PY - 2010
Y1 - 2010
N2 - Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19% to 90% of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46% for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.
AB - Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19% to 90% of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46% for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.
KW - Parallelism
KW - Prefetching
KW - Translation lookaside buffer
UR - http://www.scopus.com/inward/record.url?scp=77952252973&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952252973&partnerID=8YFLogxK
U2 - 10.1145/1736020.1736060
DO - 10.1145/1736020.1736060
M3 - Conference contribution
AN - SCOPUS:77952252973
SN - 9781605588391
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 359
EP - 370
BT - ASPLOS XV - 15th International Conference on Architectural Support for Programming Languages and Operating Systems
T2 - 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV
Y2 - 13 March 2010 through 17 March 2010
ER -