TY - GEN
T1 - Eliminating memory bottlenecks for a JPEG encoder through distributed logic-memory architecture and computation-unit integrated memory
AU - Huang, Chao
AU - Ravi, Srivaths
AU - Raghunathan, Anand
AU - Jha, Niraj K.
PY - 2005
Y1 - 2005
N2 - Several application domains, including multimedia and network processing, are highly memory intensive, making memory a bottleneck to designing higher performance and lower power application-specific integrated circuits (ASICs). Design methodologies based on innovative architectures, namely distributed logic-memory architectures and computation-unit integrated memories, have been shown to improve circuit performance significantly. In this paper, these design methodologies are discussed and evaluated through the implementation of an ASIC for the JPEG still image compression standard. The implemented system is capable of stand-alone image compression, and has been synthesized using the TSMC 0.13μm 1.20V eight-layer metal CMOS process. A four-way distributed implementation can achieve an execution time of 2.23ms (a speed-up of 2.87X) for a 128 × 128 input image at the cost of chip area overhead of 51.4% while the energy-delay product is reduced by 2.35X. Design metrics of various other implementations are also compared.
AB - Several application domains, including multimedia and network processing, are highly memory intensive, making memory a bottleneck to designing higher performance and lower power application-specific integrated circuits (ASICs). Design methodologies based on innovative architectures, namely distributed logic-memory architectures and computation-unit integrated memories, have been shown to improve circuit performance significantly. In this paper, these design methodologies are discussed and evaluated through the implementation of an ASIC for the JPEG still image compression standard. The implemented system is capable of stand-alone image compression, and has been synthesized using the TSMC 0.13μm 1.20V eight-layer metal CMOS process. A four-way distributed implementation can achieve an execution time of 2.23ms (a speed-up of 2.87X) for a 128 × 128 input image at the cost of chip area overhead of 51.4% while the energy-delay product is reduced by 2.35X. Design metrics of various other implementations are also compared.
UR - http://www.scopus.com/inward/record.url?scp=33847095113&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33847095113&partnerID=8YFLogxK
U2 - 10.1109/CICC.2005.1568651
DO - 10.1109/CICC.2005.1568651
M3 - Conference contribution
AN - SCOPUS:33847095113
SN - 0780390237
SN - 9780780390232
T3 - Proceedings of the Custom Integrated Circuits Conference
SP - 239
EP - 242
BT - Proceedings of the IEEE 2005 Custom Integrated Circuits Conference
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE 2005 Custom Integrated Circuits Conference
Y2 - 18 September 2005 through 21 September 2005
ER -