-
Code/data locality (compactness, % of each cache line that gets used)
-
Predictable access patterns: pre-fetch (instructions and data) friendly. This explains branching costs, why linear transversal might be faster than trees at smaller scales because of pointer chasing, why bubble sort is the fastest if the chunks fit in the cache.
-
Avoid false sharing: shared cache line unnecessarily with other threads/cores (due to how the data is packed) might have cache invalidating each other when anyone writes.