TY - JOUR
T1 - Generation of distributed logic-memory architectures through high-level synthesis
AU - Huang, Chao
AU - Ravi, Srivaths
AU - Raghunathan, Anand
AU - Jha, Niraj K.
N1 - Funding Information:
Manuscript received October 4, 2002; revised June 2, 2003 and June 7, 2004. This work was supported by Defense Advanced Research Projects Agency (DARPA) under Contract DAAB07-00-C-L516. This paper was recommended by Associate Editor M. F. Jacome.
PY - 2005/11
Y1 - 2005/11
N2 - With the increasing cost of on-chip global communication, high-performance designs for data-intensive applications require architectures that distribute hardware resources (computing logic, memories, interconnect, etc.) throughout the chip, while restricting computations and communications to geographic proximities. In this paper, we present a methodology for high-level synthesis (HLS) of distributed logic-memory architectures, i.e., architectures that have logic and memory distributed across several partitions in a chip. Conventional HLS tools are capable of extracting parallelism from a behavior for architectures that assume a monolithic controller/datapath communicating with a memory or memory hierarchy. This paper provides techniques to extend the synthesis frontier to more general architectures that can extract both coarse- and fine-grained parallelism from data accesses and computations in a synergistic manner. Our methodology selects many possible ways of organizing data and computations, carefully examines the tradeoffs (i.e., communication overheads, synchronization costs, area overheads) in choosing one solution over another, and utilizes conventional HLS techniques for intermediate steps. We have evaluated the proposed framework on several benchmarks by generating register-transfer level (RTL) implementations using an existing commercial HLS tool with and without our enhancements, and by subjecting the resulting RTL circuits to logic synthesis and layout. The results show that circuits designed as distributed logic-memory architectures using our framework achieve significant (up to 5.3 ×, average of 3.5 ×) performance improvements over well-optimized conventional designs with small area overheads (up to 19.3%, 15.1% on average). At the same time, the reduction in the energy-delay product is by an average of 5.9 × (up to 11.0 ×).
AB - With the increasing cost of on-chip global communication, high-performance designs for data-intensive applications require architectures that distribute hardware resources (computing logic, memories, interconnect, etc.) throughout the chip, while restricting computations and communications to geographic proximities. In this paper, we present a methodology for high-level synthesis (HLS) of distributed logic-memory architectures, i.e., architectures that have logic and memory distributed across several partitions in a chip. Conventional HLS tools are capable of extracting parallelism from a behavior for architectures that assume a monolithic controller/datapath communicating with a memory or memory hierarchy. This paper provides techniques to extend the synthesis frontier to more general architectures that can extract both coarse- and fine-grained parallelism from data accesses and computations in a synergistic manner. Our methodology selects many possible ways of organizing data and computations, carefully examines the tradeoffs (i.e., communication overheads, synchronization costs, area overheads) in choosing one solution over another, and utilizes conventional HLS techniques for intermediate steps. We have evaluated the proposed framework on several benchmarks by generating register-transfer level (RTL) implementations using an existing commercial HLS tool with and without our enhancements, and by subjecting the resulting RTL circuits to logic synthesis and layout. The results show that circuits designed as distributed logic-memory architectures using our framework achieve significant (up to 5.3 ×, average of 3.5 ×) performance improvements over well-optimized conventional designs with small area overheads (up to 19.3%, 15.1% on average). At the same time, the reduction in the energy-delay product is by an average of 5.9 × (up to 11.0 ×).
KW - Distributed architectures
KW - High-level synthesis
KW - Logic-memory architectures
KW - Scheduling
UR - http://www.scopus.com/inward/record.url?scp=27744551146&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=27744551146&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2005.852276
DO - 10.1109/TCAD.2005.852276
M3 - Article
AN - SCOPUS:27744551146
SN - 0278-0070
VL - 24
SP - 1694
EP - 1710
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IS - 11
ER -