Abstract
Loop tiling is an effective optimizing transformation to reduce the memory access cost of a program, especially for dense matrix computations. However, the success of loop tiling is heavily dependent on the appropriate selection of tile shapes and sizes. In this paper we examine several existing tile selection algorithms in a unified framework, and quantify their performance improvements for three dense matrix computation kernels and three target architectures. In addition, a new tiling algorithm is discussed that was inspired by the observed behavior of previous algorithms. Four different quality metrics are introduced to measure the performance improvements of the algorithms over untitled versions of the three program kernels. The experiments showed that tile selection algorithms can be either very similar in performance, or significantly different depending on the chosen performance metric. For the measured test cases, our new selection algorithm had a better overall performance across the different performance metrics.