TY - GEN
T1 - Coherence domain restriction on large scale systems
AU - Fu, Yaosheng
AU - Nguyen, Tri M.
AU - Wentzlaff, David
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/12/5
Y1 - 2015/12/5
N2 - Designing massive scale cache coherence systems has been an elusive goal. Whether it be on large-scale GPUs, future thousand-core chips, or across million-core warehouse scale computers, having shared memory, even to a limited extent, improves programmability. This work sidesteps the traditional challenges of creating massively scalable cache coherence by restricting coherence to flexible subsets (domains) of a system's total cores and home nodes. This paper proposes Coherence Domain Restriction (CDR), a novel coherence framework that enables the creation of thousand to million core systems that use shared memory while maintaining low storage and energy overhead. Inspired by the observation that the majority of cache lines are only shared by a subset of cores either due to limited application parallelism or limited page sharing, CDR restricts the coherence domain from global cache coherence to VM-level, application-level, or page-level. We explore two types of restriction, one which limits the total number of sharers that can access a coherence domain and one which limits the number and location of home nodes that partake in a coherence domain. Each independent coherence domain only tracks the cores in its domain instead of the whole system, thereby removing the need for a coherence scheme built on top of CDR to scale. Sharer Restriction achieves constant storage overhead as core count increases while Home Restriction provides localized communication enabling higher performance. Unlike previous systems, CDR is flexible and does not restrict the location of the home nodes or sharers within a domain. We evaluate CDR in the context of a 1024-core chip and in the novel application of shared memory to a 1,000,000-core warehouse scale computer. Sharer Restriction results in significant area savings, while Home Restriction in the 1024-core chip and 1,000,000-core system increases performance by 29% and 23.04x respectively when comparing with global home placement. We implemented the entire CDR framework in a 25-core processor taped out in IBM's 32nm SOI process and present a detailed area characterization.
AB - Designing massive scale cache coherence systems has been an elusive goal. Whether it be on large-scale GPUs, future thousand-core chips, or across million-core warehouse scale computers, having shared memory, even to a limited extent, improves programmability. This work sidesteps the traditional challenges of creating massively scalable cache coherence by restricting coherence to flexible subsets (domains) of a system's total cores and home nodes. This paper proposes Coherence Domain Restriction (CDR), a novel coherence framework that enables the creation of thousand to million core systems that use shared memory while maintaining low storage and energy overhead. Inspired by the observation that the majority of cache lines are only shared by a subset of cores either due to limited application parallelism or limited page sharing, CDR restricts the coherence domain from global cache coherence to VM-level, application-level, or page-level. We explore two types of restriction, one which limits the total number of sharers that can access a coherence domain and one which limits the number and location of home nodes that partake in a coherence domain. Each independent coherence domain only tracks the cores in its domain instead of the whole system, thereby removing the need for a coherence scheme built on top of CDR to scale. Sharer Restriction achieves constant storage overhead as core count increases while Home Restriction provides localized communication enabling higher performance. Unlike previous systems, CDR is flexible and does not restrict the location of the home nodes or sharers within a domain. We evaluate CDR in the context of a 1024-core chip and in the novel application of shared memory to a 1,000,000-core warehouse scale computer. Sharer Restriction results in significant area savings, while Home Restriction in the 1024-core chip and 1,000,000-core system increases performance by 29% and 23.04x respectively when comparing with global home placement. We implemented the entire CDR framework in a 25-core processor taped out in IBM's 32nm SOI process and present a detailed area characterization.
KW - cache coherence
KW - home placement
KW - shared memory
UR - http://www.scopus.com/inward/record.url?scp=84959886551&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959886551&partnerID=8YFLogxK
U2 - 10.1145/2830772.2830832
DO - 10.1145/2830772.2830832
M3 - Conference contribution
AN - SCOPUS:84959886551
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 686
EP - 698
BT - Proceedings - 48th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2015
PB - IEEE Computer Society
T2 - 48th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2015
Y2 - 5 December 2015 through 9 December 2015
ER -