Parallel-stage decoupled software pipelining

Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, David I. August

Research output: Chapter in Book/Report/Conference proceedingConference contribution

102 Scopus citations


In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that compilers automatically extract thread-level parallelism from single-threaded applications. DOALL is a popular automatic technique for loop-level parallelization employed successfully in the domains of scientific and numeric computing. While DOALL generally scales well with the number of iterations of the loop, its applicability is limited by the presence of loop-carried dependences. A parallelization technique with greater applicability is decoupled software pipelining (DSWP), which parallelizes loops even in the presence of loop-carried dependences. However, the scalability of DSWP is limited by the size of the loop body and the number of recurrences it contains, which are usually smaller than the loop iteration count. This work proposes a novel non-speculative compiler parallelization technique called parallel-stage decoupled software pipelining (PS-DSWP). The goal of PS-DSWP is to combine the applicability of DSWP with the scalability of DOALL parallelization. A key insight of PS-DSWP is that, after isolating the recurrences in their own stages in DSWP, portions of the loop suitable for DOALL parallelization may be exposed. PS-DSWP extends DSWP to benefit from these opportunities, utilizing multiple threads to execute the same stage of a DSWPed loop in parallel. This paper describes the PS-DSWP transformation in detail and discusses its implementation in a research compiler. PS-DSWP produces an average speedup of 114% (up to a maximum of 155%) with 6 threads on loops from a set of 5 applications. Our experiments also demonstrate that PS-DSWP achieves better scalability with the number of threads than DSWP.

Original languageEnglish (US)
Title of host publicationProceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization
Number of pages10
StatePublished - 2008

Publication series

NameProceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization

All Science Journal Classification (ASJC) codes

  • Software


  • Automatic parallelization
  • DSWP
  • Multi-core architectures
  • Pipelined parallelism
  • TLP


Dive into the research topics of 'Parallel-stage decoupled software pipelining'. Together they form a unique fingerprint.

Cite this