TY - GEN
T1 - Remote store programming
T2 - 5th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC 2010
AU - Hoffmann, Henry
AU - Wentzlaff, David
AU - Agarwal, Anant
PY - 2010
Y1 - 2010
N2 - This paper presents remote store programming (RSP), a programming paradigm which combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to cache-coherent shared memory (CCSM) for several important embedded benchmarks using the TILEPro64 processor. RSP is shown to be faster than CCSM for all eight benchmarks using 64 cores. For five of the eight benchmarks, RSP is shown to be more than 1.5 × faster than CCSM. For a 2D FFT implemented on 64 cores, RSP is over 3 × faster than CCSM. RSP's features, performance, and hardware simplicity make it well suited to the embedded processing domain.
AB - This paper presents remote store programming (RSP), a programming paradigm which combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to cache-coherent shared memory (CCSM) for several important embedded benchmarks using the TILEPro64 processor. RSP is shown to be faster than CCSM for all eight benchmarks using 64 cores. For five of the eight benchmarks, RSP is shown to be more than 1.5 × faster than CCSM. For a 2D FFT implemented on 64 cores, RSP is over 3 × faster than CCSM. RSP's features, performance, and hardware simplicity make it well suited to the embedded processing domain.
UR - http://www.scopus.com/inward/record.url?scp=77949650516&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77949650516&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-11515-8_3
DO - 10.1007/978-3-642-11515-8_3
M3 - Conference contribution
AN - SCOPUS:77949650516
SN - 3642115144
SN - 9783642115141
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 17
BT - High Performance Embedded Architectures and Compilers - 5th International Conference, HiPEAC 2010, Proceedings
Y2 - 25 January 2010 through 27 January 2010
ER -