MUL (R32) - Throughput and Uops
With unroll_count=500 and no inner loop
Code:
0: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 1.0
Core cycles: 4.75
Reference cycles: 4.75
RS_UOPS_DISPATCHED: 3.0
UOPS_PORT_0: 1.0
UOPS_PORT_1: 1.0
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.0
With loop_count=1000 and unroll_count=10
Code:
0: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 1.2
Core cycles: 4.77
Reference cycles: 4.77
RS_UOPS_DISPATCHED: 3.2
UOPS_PORT_0: 1.03
UOPS_PORT_1: 1.03
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.13
With loop_count=100 and unroll_count=100
Code:
0: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 1.02
Core cycles: 4.75
Reference cycles: 4.75
RS_UOPS_DISPATCHED: 3.02
UOPS_PORT_0: 0.94
UOPS_PORT_1: 1.01
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.07
With additional dependency-breaking instructions
With unroll_count=500 and no inner loop
Code:
0: 48 31 c0 xor rax,rax 3: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 2.0
Core cycles: 1.53
Reference cycles: 1.53
RS_UOPS_DISPATCHED: 4.0
UOPS_PORT_0: 1.29
UOPS_PORT_1: 1.32
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.39
With loop_count=1000 and unroll_count=10
Code:
0: 48 31 c0 xor rax,rax 3: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 2.2
Core cycles: 1.58
Reference cycles: 1.58
RS_UOPS_DISPATCHED: 4.2
UOPS_PORT_0: 1.36
UOPS_PORT_1: 1.38
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.46
With loop_count=100 and unroll_count=100
Code:
0: 48 31 c0 xor rax,rax 3: 41 f7 e0 mul r8d
Show nanoBench command
Results:
Instructions retired: 2.02
Core cycles: 1.52
Reference cycles: 1.52
RS_UOPS_DISPATCHED: 4.02
UOPS_PORT_0: 1.31
UOPS_PORT_1: 1.32
UOPS_PORT_2: 0.0
UOPS_PORT_3: 0.0
UOPS_PORT_4: 0.0
UOPS_PORT_5: 1.39