Caches
On this page, we provide details on the caches of some of the processors that we analyzed.
In particular, we provide information on the cache replacement policies, which are undocumented in the official manuals.
The results were obtained using the nanoBench Cache Analyzer, which is available on GitHub.
The repository also contains a simulator for all the policies described on this page.
Further information on the policies can be found in our paper nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems and in Andreas Abel's PhD thesis Automatic Generation of Models of Microarchitectures.
We provide results for the following processors:
Here, we provide a set of graphs for all CPUs showing the latencies when accessing memory areas of different sizes.
- Core i5-1035G1 (Ice Lake)
- L1 data cache
- Size: 48 kB
- Associativity: 12
- Number of sets: 64
- Way size: 4 kB
- Latency: 5 cycles Link
- Replacement policy: LRU3PLRU4 Link 1 Link 2
- This policy uses three PLRU trees with 4 elements each
- The trees are ordered in an LRU fashion
- Upon a cache miss, the element that the bits of the least-recently accessed tree point to is replaced
- L2 cache
- Size: 512 kB
- Associativity: 8
- Number of sets: 1024
- Way size: 64 kB
- Latency: 13 cycles Link
- Replacement policy: QLRU_H00_M1_R0_U1 Link
- Same as the Cannon Lake L2 policy (see below)
- L3 cache
- Size: 6 MB
- Associativity: 12
- Number of CBoxes: 4
- Number of slices: 8
- Number of sets (per slice): 1024
- Way size (per slice): 64 kB
- Latency: 41 cycles Link
- Replacement policy: Adaptive Link
- Sets 0, 33, 132, 165, 264, 297, 396, 429, 528, 561, 660, 693, 792, 825, 924, 957 (in all CBoxes): QLRU_H00_M1_R0_U1 Link
- The remaining sets can switch between the QLRU_H00_M1_R0_U1 policy and a variant of this policy that occasionally inserts new blocks with age 3 instead of age 1
- Core i3-8121U (Cannon Lake)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 4
- Number of sets: 1024
- Way size: 64 kB
- Latency: 12 cycles Link
- Replacement policy: QLRU_H00_M1_R0_U1 Link
- Similar to the policy used in sets 512-575 of the L3 cache on Haswell (see below), but:
- A cache hit always sets the age to 0
- If after an access, there is no more line whose age is 3, only the ages of all other lines are increased
- L3 cache
- Size: 4 MB
- Associativity: 16
- Number of CBoxes: 2
- Number of slices: 4
- Number of sets (per slice): 1024
- Way size (per slice): 64 kB
- Latency: 37 cycles Link
- Replacement policy: Adaptive Link
- Sets 0, 33, 132, 165, 264, 297, 396, 429, 528, 561, 660, 693, 792, 825, 924, 957 (in all CBoxes): QLRU_H11_M1_R0_U0 Link
- The remaining sets can switch between the QLRU_H11_M1_R0_U0 policy and a variant of this policy that occasionally inserts new blocks with age 3 instead of age 1 when the cache is not full
- Same as the Skylake L3 policy
- Core i7-8700K (Coffee Lake)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 4
- Number of sets: 1024
- Way size: 64 kB
- Latency: 12 cycles Link
- Replacement policy: QLRU_H00_M1_R2_U1 Link
- Same as the Skylake L2 policy
- L3 cache
- Size: 12 MB
- Associativity: 16
- Number of CBoxes: 6
- Number of slices: 12
- Number of sets (per slice): 1024
- Way size (per slice): 64 kB
- Latency: 41 cycles Link
- Replacement policy: Adaptive Link
- Sets 0, 33, 132, 165, 264, 297, 396, 429, 528, 561, 660, 693, 792, 825, 924, 957 (in all CBoxes): QLRU_H11_M1_R0_U0 Link
- The remaining sets can switch between the QLRU_H11_M1_R0_U0 policy and a variant of this policy that occasionally inserts new blocks with age 3 instead of age 1 when the cache is not full
- Same as the Skylake L3 policy
- Core i7-7700 (Kaby Lake)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 4
- Number of sets: 1024
- Way size: 64 kB
- Latency: 12 cycles Link
- Replacement policy: QLRU_H00_M1_R2_U1 Link
- Same as the Skylake L2 policy
- L3 cache
- Size: 8 MB
- Associativity: 16
- Number of CBoxes: 4
- Number of slices: 8
- Number of sets (per slice): 1024
- Way size (per slice): 64 kB
- Latency: 41 cycles Link
- Replacement policy: Adaptive Link
- Sets 0, 33, 132, 165, 264, 297, 396, 429, 528, 561, 660, 693, 792, 825, 924, 957 (in all CBoxes): QLRU_H11_M1_R0_U0 Link
- The remaining sets can switch between the QLRU_H11_M1_R0_U0 policy and a variant of this policy that occasionally inserts new blocks with age 3 instead of age 1 when the cache is not full
- Same as the Skylake L3 policy
- Core i7-6500U (Skylake)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 4
- Number of sets: 1024
- Way size: 64 kB
- Latency: 12 cycles Link
- Replacement policy: QLRU_H00_M1_R2_U1 Link
- Similar to the Cannon Lake L2 policy, but:
- If the cache is empty (after executing the WBINVD instruction), blocks are inserted from right to left
- The initial ages of blocks inserted into an empty cache can depend on the previous state
- See also Vila et al.
- L3 cache
- Size: 4 MB
- Associativity: 16
- Number of CBoxes: 2
- Number of slices: 4
- Number of sets (per slice): 1024
- Way size (per slice): 64 kB
- Latency: 34 cycles Link
- Replacement policy: Adaptive Link
- Sets 0, 33, 132, 165, 264, 297, 396, 429, 528, 561, 660, 693, 792, 825, 924, 957 (in all CBoxes): QLRU_H11_M1_R0_U0 Link
- Same as the policy used in sets 512-575 on Haswell (see below)
- See also Vila et al.
- The remaining sets can switch between the QLRU_H11_M1_R0_U0 policy and a variant of this policy that occasionally inserts new blocks with age 3 instead of age 1 when the cache is not full
- Core i5-5200U (Broadwell)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (insertion order if empty depends on previous state) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 12 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 3 MB
- Associativity: 12
- Number of CBoxes: 2
- Number of slices: 2
- Number of sets (per slice): 2048
- Way size (per slice): 128 kB
- Latency: 38 cycles Link
- Replacement policy: Set Dueling Link 1 Link 2 Link 3
- Same as the Haswell L3 policy
- Xeon E3-1225 v3 (Haswell)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (insertion order if empty depends on previous state) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 12 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 8 MB
- Associativity: 16
- Number of CBoxes: 4
- Number of slices: 4
- Number of sets (per slice): 2048
- Way size (per slice): 128 kB
- Latency: 36 cycles Link
- Replacement policy: Set Dueling Link
- Sets 512-575 in CBox 0: QLRU_H11_M1_R0_U0 Link
- Two bits per cache line; they represent the age of the block
- A cache hit sets the age to 1 if it was 2 or 3 before, and to 0 otherwise
- Upon a cache miss, the leftmost line whose age is 3 is replaced, and the age is set to 1
- If after an access (hit or miss), there is no more line whose age is 3, the ages of all lines (including the accessed line) are increased by 1; this is repeated until there is at least one line whose age is 3
- If the cache is empty (after executing the WBINVD instruction), blocks are inserted from left to right
- Note that the description of the policy in this paper is incorrect.
The policy from the paper corresponds to the QLRU_H21_M2_R0_U0_UMO variant in the linked table;
the table (and the corresponding table for the other CPUs above) contain several counterexamples that show that this is not the policy actually implemented.
- Sets 768-831 in CBox 0: Link
- Similar to the policy described above, but it seems that in about 15/16 of the cases, new blocks are inserted with age 3 instead of age 1
- The remaining sets are "follower sets" that use the policy that performs better
- Core i5-3470 (Ivy Bridge)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (insertion order if empty depends on previous state) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 12 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 6 MB
- Associativity: 12
- In CBox 0, the associativity seems to be only 11
- Number of CBoxes: 4
- Number of slices: 4
- Number of sets (per slice): 2048
- Way size (per slice): 128 kB
- Latency: 30 cycles Link
- Replacement policy: Set Dueling Link
- Sets 512-575 (in all CBoxes): QLRU_H11_M1_R1_U2 Link
- Similar to the policy used in sets 512-575 on Haswell, but:
- If after an access, there is no more line whose age is 3, the ages of all lines (including the accessed line) are increased by 1;
however, unlike with the Haswell policy, this step is not repeated if there is still no line whose age is 3
- Upon a cache miss, the leftmost line whose age is 3 is replaced; if there is no such line, the leftmost line is replaced
- Sets 768-831 (in all CBoxes): Link
- Similar to the policy described above, but it seems that in about 15/16 of the cases, new blocks are inserted with age 3 instead of age 1
- The remaining sets are "follower sets" that use the policy that performs better
- See also Intel Ivy Bridge Cache Replacement Policy
- Core i7-2600 (Sandy Bridge)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (insertion order if empty depends on previous state) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 12 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 8 MB
- Associativity: 16
- In CBox 0, the associativity seems to be only 15, see also Kayaalp et al.
- Number of CBoxes: 4
- Number of slices: 4
- Number of sets (per slice): 2048
- Way size (per slice): 128 kB
- Latency: 29 cycles Link
- Replacement policy: MRU_N Link
- Similar to the Nehalem policy (see below), but the status bits are not updated until all lines are filled
- Core i5-650 (Westmere)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 10 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 4 MB
- Associativity: 16
- Number of sets: 4096
- Way size: 256 kB
- Latency: 45 cycles Link
- Replacement policy: MRU Link
- Same as the Nehalem policy (see below)
- Core i5-750 (Nehalem)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 4 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L2 cache
- Size: 256 kB
- Associativity: 8
- Number of sets: 512
- Way size: 32 kB
- Latency: 10 cycles Link
- Replacement policy: Tree-PLRU (with linear insertion order if empty) Link 1 Link 2
- L3 cache
- Size: 8 MB
- Associativity: 16
- Number of sets: 8192
- Way size: 512 kB
- Latency: 46 cycles Link
- Replacement policy: MRU Link
- One bit per cache line
- An access sets the bit to 0
- Upon a cache miss, the leftmost line whose bit is 1 is replaced, and the bit is set to 0
- Whenever the last 1-bit is set to 0, all other bits are set to 1
- Initially (after a WBINVD instruction), all bits are 1
- See also Eklov et al.
- Core 2 Duo E8400 (Wolfdale)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 3 cycles Link
- Replacement policy: Tree-PLRU (inital insertion order unknown) Link 1 Link 2
- L2 cache
- Size: 6 MB
- Associativity: 24
- Number of sets: 4096
- Way size: 256 kB
- Latency: 15 cycles Link
- Replacement policy: Rand-PLRU Link
- Core 2 Duo E6750 (Conroe)
- L1 data cache
- Size: 32 kB
- Associativity: 8
- Number of sets: 64
- Way size: 4 kB
- Latency: 3 cycles Link
- Replacement policy: Tree-PLRU (inital insertion order unknown) Link 1 Link 2
- L2 cache
- Size: 4 MB
- Associativity: 16
- Number of sets: 4096
- Way size: 256 kB
- Latency: 14 cycles Link
- Replacement policy: PLRU-Rand Link