Recently there has been a surge of interest in developing performance debugging tools to help programmers tune their applications for better memory performance [2, 4, 10]. These tools vary both in the detail of feedback provided to the user, and in the runtime overhead of using them. MemSpy  is a simulation-based tool which gives programmers detailed statistics on the memory system behavior of applications. It provides information on the frequency and causes of cache misses, and presents it in terms of source-level data and code objects with which the programmer is familiar. However, using MemSpy increases a program's execution time by roughly 10 to 40 fold. This overhead is generally acceptable for applications with execution times of several minutes or less, but it can be inconvenient when tuning applications with very long execution times. This paper examines the use of trace sampling techniques to reduce the execution time overhead of tools like MemSpy. When simulating one tenth of the references, we find that MemSpy's execution time overhead is improved by a factor of 4 to 6. That is, the execution time when using MemSpy is generally within a factor of 3 to 8 times the normal execution time. With this improved performance, we observe only small errors in the performance statistics reported by MemSpy. On moderate sized caches of 16KB to 128KB, simulating as few as one tenth of the references (in samples of 0.5M references each) allows us to estimate the program's actual cache miss rate with an absolute error no greater than 0.3% on our five benchmarks. These errors are quite tolerable within the context of performance debugging. With larger caches we can also obtain good accuracy by using longer sample lengths. We conclude that, used with care, trace sampling is a powerful technique that makes possible performance debugging tools which provide both detailed memory statistics and low execution time overheads.