Developer's blog

Go to Notes

How Much Access to RAM Costs

Every computer program uses the main memory (RAM) to store data. RAM is considered to be fast storage compared even to the fastest NVMe SSD or 100Gbit network. It seems that the RAM speed is high enough and that is why no optimizations are required: just read or write any memory cell when the program needs it.

From the famous article What Every Programmer Should Know About Memory, I’d found that memory access is more tricky than I used to think about.

Modern CPUs have a small amount of blazing-fast internal memory — the registers. There is no delay when accessing data from the registers, the functional units can operate on this data directly. Each CPU core has its own registers, so the access time does not depend on the other cores’ load.

If there is no requested data in the CPU registers it should be loaded from the RAM to execute operations on it. Accessing the RAM is 10 times slower than accessing the registers, so the CPU has to request the data and wait a significant amount of time before it will be loaded into the registers. Also, there is a shared bus between the RAM and CPUs, that is why if one core actively access the RAM it can influence the accesses time for the other cores.

To reduce the average cost to access data from the main memory a mechanism called CPU cache was introduced. It’s a small amount of fast on-chip memory, which stores copies of the data from the RAM. An access time to the cache is much smaller than to the RAM. Most CPUs have a hierarchy of multiple cache levels (L1, L2, L3). Each next level of cache is bigger but slower.

Data is transferred from the RAM and stored in the cache in blocks of fixed size, called cache lines. When the CPU needs to read a location in memory, it first checks for a corresponding entry in the cache. If the requested data is not found in the cache, a cache miss is occurred. A cache miss is quite expensive because it requires not only to load new data from the memory into the cache but also to evict one of the existing entries. That is why one of the crucial software developer’s tasks is organizing memory access of the program in such a way that minimizes the number of cache misses.

Consider the practical task. Matrix multiplication is one of the most important matrix operations. It is used widely in such areas as machine learning or 3D graphics. For the huge matrices, the computations can take a lot of time, so the optimization of this operation can increase the speed of the whole program significantly.

The description of an algorithm can be easily found on Wikipedia.

The naive Plain C implementation looks like this:

void matrix_multiply(size_t sz, double **mul1, double **mul2, double **res) {
  size_t i, j, k;

  for (i = 0; i < sz; i++) {
    for (j = 0; j < sz; j++) {
      for (k = 0; k < sz; k++) {
        res[i][j] += mul1[i][k] * mul2[k][j];
      }
    }
  }
}

The function receives two square matrices mul1 and mul2 of size sz x sz and calculates their multiplication into res. It’s expected, that res is initialized by zeros.

Assume the matrix of size 1000x1000. In this case, CPU cache can’t hold an entire matrix. In the inner loop, the elements of mul1 are accessed sequentially, so it uses CPU cache effectively. For the mul2 argument, the elements are accessed more-or-less randomly. For each element of the result matrix, 1000 random elements should be retrieved from the RAM.

I got such measurements on my PC:

matrix size execution time in cycles
100x100 733
300x300 21485
500x500 101610
1000x1000 843546
2000x2000 25787985
3000x3000 107763896

First optimization is a mul2 transposition:

void matrix_multiply_t(size_t sz, double **mul1, double **mul2, double **res) {
  size_t i, j, k;

  double tmp;

  for (i = 0; i < sz; i++) {
    for (j = i; j < sz; j++) {
      tmp = mul2[i][j];
      mul2[i][j] = mul2[j][i];
      mul2[j][i] = tmp;
    }
  }

  for (i = 0; i < sz; i++) {
    for (j = 0; j < sz; j++) {
      for (k = 0; k < sz; k++) {
        res[i][j] += mul1[i][k] * mul2[j][k];
      }
    }
  }
}

After mul2 transposition, it’s possible to access both arguments sequentially, so CPU cache is used effectively. The function needs some time for the transposition operation that is why the results can be similar to naive implementation for small input matrices.

The function modifies mul2 argument, that is why it can’t be used further. It’s possible to make a copy of mul2, but in this case, the function requires additional memory, so the software developer should decide what is the best option for the particular task.

The execution results:

matrix size execution time (cycles) relative execution time
100x100 719 98.09%
300x300 20885 97.21%
500x500 96832 95.30%
1000x1000 788834 93.51%
2000x2000 7268999 28.19%
3000x3000 24179636 22.44%

The second optimization is accessing data in such a way that utilizes CPU cache as much as possible without matrix transposition:

#define SM (CLS / sizeof(double))

void matrix_multiply_fast(size_t sz, double **mul1, double **mul2, double **res) {
  size_t i, i2, j, j2, k, k2;

  for (i = 0; i < sz; i += SM) {
    for (j = 0; j < sz; j += SM) {
      for (k = 0; k < sz; k += SM) {
        size_t i2_max = MIN(SM, sz - i);
        for (i2 = 0; i2 < i2_max; i2++) {
          double *rres = res[i + i2] + j;
          double *rmul1 = mul1[i + i2] + k;
          size_t k2_max = MIN(SM, sz - k);
          for (k2 = 0; k2 < k2_max; k2++) {
            double *rmul2 = mul2[k + k2] + j;
            size_t j2_max = MIN(SM, sz - j);
            for (j2 = 0; j2 < j2_max; j2++) {
              res[i + i2][j + j2] += mul1[i + i2][k + k2] * mul2[j + j2][k + k2];
            }
          }
        }
      }
    }
  }
}

The algorithm makes the same calculations as a naive implementation but in a different order. There are six nested loops. The outer loops iterate with intervals of SM (the cache line size divided by sizeof(double)). This divides the multiplication into several smaller problems which can be handled with more cache locality. The inner loops iterate over the missing indexes of the outer loops. k2 and j2 loops are in a different order because, in the actual computation, only one expression depends on k2 but two depend on j2.

CLS macro is defined by the command getconf LEVEL1_DCACHE_LINESIZE during the compilation. This command returns L1 cache line size, so it makes it possible for inner loops to utilize CPU cache effectively.

The execution results:

matrix size execution time (cycles) relative execution time
100x100 606 82.67%
300x300 15503 72.16%
500x500 72558 71.41%
1000x1000 600223 71.15%
2000x2000 5203588 20.18%
3000x3000 17501126 16.24%

The most optimized version can be found in the article (p.50):

#define SM (CLS / sizeof(double))

void matrix_multiply_seq(size_t sz, double **mul1, double **mul2, double **res) {
  size_t i, i2, j, j2, k, k2;

  double *rres, *rmul1, *rmul2;

  for (i = 0; i < sz; i += SM) {
    for (j = 0; j < sz; j += SM) {
      for (k = 0; k < sz; k += SM) {
        size_t i2_max = MIN(SM, sz - i);
        for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < i2_max; ++i2, rres += sz, rmul1 += sz) {
          size_t k2_max = MIN(SM, sz - k);
          for (k2 = 0, rmul2 = &mul2[k][j]; k2 < k2_max; ++k2, rmul2 += sz) {
            size_t j2_max = MIN(SM, sz - j);
            for (j2 = 0; j2 < j2_max; ++j2) {
              rres[j2] += rmul1[k2] * rmul2[j2];
            }
          }
        }
      }
    }
  }
}

Compared to matrix_multiply_fast this function uses temp variables to hold intermediate values and also uses += sz instruction to iterate over the matrices. As a result, the compilation produces more compact machine code, so less instruction should be executed for each iteration.

This function requires data to be stored sequentially in the memory, so mul1[i + 1][0] == (mul1[i] + sz)[0].

The execution results:

matrix size execution time (cycles) relative execution time
100x100 459 62.62%
300x300 11773 54.80%
500x500 56441 55.55%
1000x1000 480246 56.93%
2000x2000 4164429 16.15%
3000x3000 14001852 12.99%

From the results, it can be found that execution time does not differ a lot for small matrices, but as matrices become bigger the difference grows significantly, that is why the optimizations are worth implementing.

algorithms comparation

I run all examples on the machine with Intel Core i7-10700K processor and 3000 MHz DDR4 memory on Ubuntu 20.04 with gcc 9.3.0.

I’d uploaded the code I’d used into GitHub repo, so anyone can run it and measure an execution time on the different hardware.

Concluding, writing the code considering cache locality can increase the performance dramatically, but sometimes it can make the program harder to read and understand. As Donald Knuth had written in The Art of Computer Programming: “Premature optimization is the root of all evil”, so make such performance tuning if only it’s required.