TY - JOUR
T1 - Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis
AU - Huang, Chao
AU - Ravi, Srivaths
AU - Raghunathan, Anand
AU - Jha, Niraj K.
N1 - Funding Information:
Manuscript received July 25, 2005; revised August 15, 2006. This work was supported in part by NJCST Center for Embedded System-On-A-Chip Design and by NEC Laboratories America. C. Huang is with the Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061 USA (e-mail: [email protected]). S. Ravi is with the DSPS Design Team, Texas Instruments, Bangalore 560 093, India (e-mail: [email protected]). A. Raghunathan is with NEC Laboratories America, Princeton, NJ 08540 USA (e-mail: [email protected]). N. K. Jha is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2007.904096
PY - 2007/11
Y1 - 2007/11
N2 - Memory-intensive applications present unique challenges to an application-specific integrated circuit (ASIC) designer in terms of the choice of memory organization, memory size requirements, bandwidth and access latencies, etc. The high potential of single-chip distributed logic-memory architectures in addressing many of these issues has been recognized in general-purpose computing, and more recently, in ASIC design. The high-level synthesis (HLS) techniques presented in this paper are motivated by the fact that many memory-intensive applications exhibit irregular array data access patterns. Synthesis should, therefore, be capable of determining a partitioned architecture, wherein array data and computations may have to be heterogeneously distributed for achieving the best performance speed-up. We use a combination of clustering and min-cut style partitioning techniques to yield distributed architectures, based on simulation profiling while considering various factors including data access locality, balanced workloads, inter-partition communication, etc. Our experiments with several benchmark applications show that the proposed techniques yielded two-way partitioned architectures that can achieve upto 2.1× (average of 1.9×) performance speed-up over conventional HLS solutions, while achieving upto 1.5× (average of 1.4×) performance speed-up over the best homogeneous partitioning solution feasible. At the same time, the reduction in the energy-delay product over conventional single-memory designs is upto 2.7× (average of 2.0×). A larger amount of partitioning makes further system performance improvement achievable at the cost of chip area.
AB - Memory-intensive applications present unique challenges to an application-specific integrated circuit (ASIC) designer in terms of the choice of memory organization, memory size requirements, bandwidth and access latencies, etc. The high potential of single-chip distributed logic-memory architectures in addressing many of these issues has been recognized in general-purpose computing, and more recently, in ASIC design. The high-level synthesis (HLS) techniques presented in this paper are motivated by the fact that many memory-intensive applications exhibit irregular array data access patterns. Synthesis should, therefore, be capable of determining a partitioned architecture, wherein array data and computations may have to be heterogeneously distributed for achieving the best performance speed-up. We use a combination of clustering and min-cut style partitioning techniques to yield distributed architectures, based on simulation profiling while considering various factors including data access locality, balanced workloads, inter-partition communication, etc. Our experiments with several benchmark applications show that the proposed techniques yielded two-way partitioned architectures that can achieve upto 2.1× (average of 1.9×) performance speed-up over conventional HLS solutions, while achieving upto 1.5× (average of 1.4×) performance speed-up over the best homogeneous partitioning solution feasible. At the same time, the reduction in the energy-delay product over conventional single-memory designs is upto 2.7× (average of 2.0×). A larger amount of partitioning makes further system performance improvement achievable at the cost of chip area.
KW - Application-specific integrated circuit (ASIC)
KW - Behavioral synthesis
KW - High-level synthesis
KW - Memory-intensive application
KW - Partitioning
UR - http://www.scopus.com/inward/record.url?scp=35448990090&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35448990090&partnerID=8YFLogxK
U2 - 10.1109/TVLSI.2007.904096
DO - 10.1109/TVLSI.2007.904096
M3 - Article
AN - SCOPUS:35448990090
SN - 1063-8210
VL - 15
SP - 1191
EP - 1203
JO - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
JF - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IS - 11
ER -