TY - GEN
T1 - Integrating performance monitoring and communication in parallel computers
AU - Martonosi, Margaret
AU - Ofelt, David
AU - Heinrich, Mark
N1 - Publisher Copyright:
© 1996 ACM.
PY - 1996/5/15
Y1 - 1996/5/15
N2 - A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.
AB - A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.
UR - http://www.scopus.com/inward/record.url?scp=0039771960&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0039771960&partnerID=8YFLogxK
U2 - 10.1145/233013.233035
DO - 10.1145/233013.233035
M3 - Conference contribution
AN - SCOPUS:0039771960
T3 - SIGMETRICS 1996 - Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
SP - 138
EP - 147
BT - SIGMETRICS 1996 - Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
PB - Association for Computing Machinery, Inc
T2 - 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1996
Y2 - 23 May 1996 through 26 May 1996
ER -