This paper shows that the performance bottleneck in software MPEG-2 video decoders has shifted to memory operations, as microprocessor technologies have been improving at a fast rate during the past few years. We exploit concurrencies between the processor and the memory sub-system at macroblock level to alleviate the performance bottleneck. First, the paper introduces an interleaved-block order data layout to improve cache performance. Second, the paper describes an algorithm to explicitly prefetch macroblocks for motion compensation. Finally, the paper presents an algorithm to schedule interleaved decoding and output at macroblock level. Our implementation and experiments show that these methods successfully hide the latency of memory and frame buffer. These techniques improve the performance of an already optimized software MPEG-2 decoder by about a factor of two. On a 933 MHz Pentium III PC, the decoder can play 720p HDTV streams at over 62 frames per second.