TY - GEN
T1 - Skia
T2 - 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
AU - Pepi, Chrysanthos
AU - Godala, Bhargav Reddy
AU - Tibrewala, Krishnam
AU - Chacon, Gino A.
AU - Gratz, Paul V.
AU - Jiménez, Daniel A.
AU - Pokam, Gilles A.
AU - August, David I.
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/3/30
Y1 - 2025/3/30
N2 - Modern processors implement a decoupled front-end, often using a form of Fetch Directed Instruction Prefetching (FDIP), to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1-I cache). As contemporary data center applications become more complex, their code footprints also grow, resulting in a high number of Branch Target Buffer (BTB) misses. These BTB missing branches typically have previously been decoded and placed in the BTB, but have since been evicted, leading to BTB misses now. FDIP can alleviate L1-I cache misses, but its reliance on the BPU's tracking structures means that when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1-I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched. Nevertheless, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (taken branch). We call branch instructions present in the ignored portion of the cache line ''Shadow Branches.'' Here we present Skia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skia delivers a geomean speedup of ∼5.7% over an 8K-entry BTB (78KB) and ∼2% versus adding an equal amount of state to the BTB, across 16 front-end bound applications. Since many branches stored in the SBB are distinct compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skia across all examined sizes until saturation.
AB - Modern processors implement a decoupled front-end, often using a form of Fetch Directed Instruction Prefetching (FDIP), to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1-I cache). As contemporary data center applications become more complex, their code footprints also grow, resulting in a high number of Branch Target Buffer (BTB) misses. These BTB missing branches typically have previously been decoded and placed in the BTB, but have since been evicted, leading to BTB misses now. FDIP can alleviate L1-I cache misses, but its reliance on the BPU's tracking structures means that when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1-I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched. Nevertheless, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (taken branch). We call branch instructions present in the ignored portion of the cache line ''Shadow Branches.'' Here we present Skia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skia delivers a geomean speedup of ∼5.7% over an 8K-entry BTB (78KB) and ∼2% versus adding an equal amount of state to the BTB, across 16 front-end bound applications. Since many branches stored in the SBB are distinct compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skia across all examined sizes until saturation.
UR - http://www.scopus.com/inward/record.url?scp=105002559459&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002559459&partnerID=8YFLogxK
U2 - 10.1145/3676641.3716273
DO - 10.1145/3676641.3716273
M3 - Conference contribution
AN - SCOPUS:105002559459
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 1091
EP - 1106
BT - ASPLOS 2025 - Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
PB - Association for Computing Machinery
Y2 - 30 March 2025 through 3 April 2025
ER -