Memory hierarchy

So far we have used the term memory when refering to the volatile storage attached to the CPU, which is shared between the CPU cores within the supercomputer node. However, in practice there is a hierarchy of different levels of memory. Sometimes one can include also the memory of other nodes (which cannot be directly accesses) as a part of the hierarchy.

When a CPU core calculates 2.1 + 4.3, it first fetches the two numbers from the main memory into registers in the CPU core. The result appears also first in a register, and the value is then pushed from the register to the main memory. As we have discussed earlier, the speed of accessing memory can be a performance bottleneck. In order to alleviate that, modern CPUs (both in desktop and in supercomputers) have memory caches.

Memory cache

Memory cache is basically a small amount of scratch memory close to the CPU-core that is very fast. The basic idea is that when a CPU core needs to load something from the main memory, it first looks in the cache. If the data is already in the cache, it can be fetched from the cache to the register much faster than from the main memory. Also, when a result of a computation is stored, it is first put from the register into the cache, and copied to the main memory only later on (e.g. when the cache becomes full).

Think again of the analogy with workers in an office with a whiteboard. The whiteboard is the main memory, and the workers are now doing their computations at their desks. Every time a worker reads from or writes to the whiteboard they need to leave the desk. Imagine now that each worker has a small notebook: when you need to read data from the whiteboard, you fill your notebook with everything you need and then you can work happily at your desk for a longer period of time. Especially when you do many computations with the same data, the notebook/cache can speed up the overall memory access a lot.

Writing data

Cache works also very well for multiple workers if they only ever read data. Unfortunately, real programs also write data, i.e. workers will want to modify the data on the whiteboard. If two people are working on the same data at the same time, we have a problem: if one worker changes some numbers in their notebook then the other workers need to be informed about it. The compromise solution is to let everyone know whenever you modify any results in your notebook. Whenever you alter a number, you have to shout out:

"I’ve just changed the entry for the 231st salary - if you have a copy of it then you’ll need to get the new value from me!"

Although this is OK for a small number of workers, it clearly has problems when there are lots of workers. Imagine 100 workers: whenever you change a number you have to let 99 other people know about it, which wastes time. Even worse, you have to be continually listening for updates from 99 other workers instead of concentrating on doing your own calculations.

This is the fundamental dilemma: memory access is so slow that we need small and fast caches to access data as fast as we can process it. However, whenever we write data there is an overhead which grows with the number of CPU-cores and will eventually make everything slow down again.

Keeping the data consistent and up-to-date on all the CPU-cores is called cache coherency. It means that we always have up-to-date values in our notebook (or, at the very least, that we know when our notebook is out of date and we must return to the whiteboard). Ensuring cache coherency is the major obstacle to building very large multicore processors.

Moving data into and out from caches is handled by the hardware, and programmers cannot directly control it. However, the way that data and computations are organized in the program code can have an effect on how efficiently the caches can be utilized. For example, if we are simulating particles in three dimensions, the coordinates of the particles can be stored either in three lists of N numbers (each list containing the x, y, or z coordinate of the N particles) or in N lists of three numbers (each list containing the three coordinates of a single particle). Depending on the computations done with the coordinates, the different data layouts can have differing behaviour with the caches. For example, if one does first computations only with the x coordinates, using three lists of N numbers may be more efficient.