PCMPESTRM (XMM, XMM, I8)
Summary:
"Packed Compare Explicit Length Strings, Return Mask"
Reference:
https://www.felixcloutier.com/x86/PCMPESTRM.html
Extension:
SSE4
Category:
SSE
ISA-Set:
SSE42
CPL:
3
iform:
PCMPESTRM_XMMdq_XMMdq_IMMb
iclass:
PCMPESTRM
ASM:
PCMPESTRM
Operands
Operand 1 (r): Register (XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15)
Operand 2 (r): Register (XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15)
Operand 3 (r): 8-bit immediate
Operand 4 (r, suppressed): Register (EAX)
Operand 5 (r, suppressed): Register (EDX)
Operand 6 (w, suppressed): Register (XMM0)
Operand 7 (w, suppressed): Flags (AF: w, CF: w, OF: w, PF: w, SF: w, ZF: w)
Available performance data
Alder Lake-P
Alder Lake-E
Rocket Lake
Tiger Lake
Ice Lake
Cascade Lake
Cannon Lake
Skylake-X
Coffee Lake
Kaby Lake
Skylake
Broadwell
Haswell
Ivy Bridge
Sandy Bridge
Westmere
Nehalem
Tremont
Goldmont Plus
Goldmont
Airmont
AMD Zen 4
AMD Zen 3
AMD Zen 2
AMD Zen+
Alder Lake-P
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
16
Latency operand 5 → 6:
≤16
Latency operand 5 → 7:
16
Throughput
Computed from the port usage: 3.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+3*p015+1*p06+1*p1+1*p5
Alder Lake-E
Measurements
Latencies
Latency operand 1 → 6:
6
Latency operand 1 → 7:
≤17
Latency operand 2 → 6:
9
Latency operand 2 → 7:
≤19
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
19
Latency operand 5 → 6:
≤18
Latency operand 5 → 7:
20
Throughput
Measured (loop):
9.11
Measured (unrolled):
10.00
Number of μops
Executed: 11
Microcode Sequencer (MS): 10
Requires the complex decoder
Rocket Lake
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
16
Latency operand 5 → 6:
≤16
Latency operand 5 → 7:
16
Throughput
Computed from the port usage: 3.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+3*p015+1*p06+1*p1+1*p5
Tiger Lake
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
16
Latency operand 5 → 6:
≤16
Latency operand 5 → 7:
16
Throughput
Computed from the port usage: 3.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+3*p015+1*p06+1*p1+1*p5
Ice Lake
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
16
Latency operand 5 → 6:
≤16
Latency operand 5 → 7:
16
Throughput
Computed from the port usage: 3.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+3*p015+1*p06+1*p1+1*p5
Cascade Lake
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤13
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
17
Latency operand 5 → 6:
≤17
Latency operand 5 → 7:
17
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
Cannon Lake
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
16
Latency operand 5 → 6:
≤16
Latency operand 5 → 7:
16
Throughput
Computed from the port usage: 3.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+3*p015+1*p06+1*p1+1*p5
Skylake-X
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤13
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
17
Latency operand 5 → 6:
≤17
Latency operand 5 → 7:
17
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
IACA 2.3
Throughput
Computed from the port usage: 4.00
IACA:
4.05
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 3.0
Throughput
Computed from the port usage: 4.00
IACA:
3.47
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
Coffee Lake
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤13
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
17
Latency operand 5 → 6:
≤17
Latency operand 5 → 7:
17
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
Kaby Lake
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤13
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
17
Latency operand 5 → 6:
≤17
Latency operand 5 → 7:
17
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
Skylake
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤12
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤13
Latency operand 4 → 6:
≤17
Latency operand 4 → 7:
17
Latency operand 5 → 6:
≤17
Latency operand 5 → 7:
17
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
IACA 2.3
Throughput
Computed from the port usage: 4.00
IACA:
4.05
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 3.0
Throughput
Computed from the port usage: 4.00
IACA:
3.47
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
Broadwell
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤11
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤11
Latency operand 4 → 6:
≤15
Latency operand 4 → 7:
15
Latency operand 5 → 6:
≤15
Latency operand 5 → 7:
15
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
IACA 2.2
Throughput
Computed from the port usage: 4.00
IACA:
4.00 (with the -no_interiteration flag: 4.00)
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 2.3
Throughput
Computed from the port usage: 4.00
IACA:
4.05
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 3.0
Throughput
Computed from the port usage: 4.00
IACA:
3.55
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
Haswell
Measurements
Latencies
Latency operand 1 → 6:
11
Latency operand 1 → 7:
≤11
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤11
Latency operand 4 → 6:
≤15
Latency operand 4 → 7:
15
Latency operand 5 → 6:
≤15
Latency operand 5 → 7:
15
Throughput
Computed from the port usage: 4.00
Measured (loop):
5.00
Measured (unrolled):
5.00
Number of μops
Executed: 9
Retire slots: 9
Decoded (MITE): 4
Microcode Sequencer (MS): 5
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p06+1*p1+4*p5
IACA 2.2
Throughput
Computed from the port usage: 4.00
IACA:
4.00 (with the -no_interiteration flag: 4.00)
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 2.3
Throughput
Computed from the port usage: 4.00
IACA:
4.05
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
IACA 3.0
Throughput
Computed from the port usage: 4.00
IACA:
3.57
Number of μops:
9
Port usage:
4*p0+1*p015+1*p0156+3*p5
Ivy Bridge
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤11
Latency operand 2 → 6:
13
Latency operand 2 → 7:
≤14
Latency operand 4 → 6:
≤19
Latency operand 4 → 7:
20
Latency operand 5 → 6:
≤19
Latency operand 5 → 7:
20
Throughput
Computed from the port usage: 3.50
Measured (loop):
4.00
Measured (unrolled):
4.00
Number of μops
Executed: 8
Retire slots: 8
Decoded (MITE): 4
Microcode Sequencer (MS): 4
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p05+1*p1+3*p5
Sandy Bridge
Measurements
Latencies
Latency operand 1 → 6:
10
Latency operand 1 → 7:
≤11
Latency operand 2 → 6:
13
Latency operand 2 → 7:
≤14
Latency operand 4 → 6:
≤19
Latency operand 4 → 7:
20
Latency operand 5 → 6:
≤19
Latency operand 5 → 7:
20
Throughput
Computed from the port usage: 3.50
Measured (loop):
4.00
Measured (unrolled):
4.00
Number of μops
Executed: 8
Retire slots: 8
Decoded (MITE): 4
Microcode Sequencer (MS): 4
Requires the complex decoder (no other instruction can be decoded with simple decoders in the same cycle)
Port usage:
3*p0+1*p05+1*p1+3*p5
Westmere
Measurements
Latencies
Latency operand 1 → 6:
7
Latency operand 1 → 7:
≤9
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤14
Latency operand 4 → 7:
14
Latency operand 5 → 6:
≤14
Latency operand 5 → 7:
14
Throughput
Computed from the port usage: 3.00
Measured (loop):
6.00
Measured (unrolled):
6.00
Number of μops
Executed: 9
Retire slots: 9
Microcode Sequencer (MS): 31
Requires the complex decoder
Port usage:
5*p015+1*p05+3*p1
Nehalem
Measurements
Latencies
Latency operand 1 → 6:
7
Latency operand 1 → 7:
≤9
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤12
Latency operand 4 → 6:
≤14
Latency operand 4 → 7:
14
Latency operand 5 → 6:
≤14
Latency operand 5 → 7:
14
Throughput
Computed from the port usage: 3.00
Measured (loop):
6.00
Measured (unrolled):
6.00
Number of μops
Executed: 9
Retire slots: 9
Microcode Sequencer (MS): 25
Requires the complex decoder
Port usage:
5*p015+1*p05+3*p1
Tremont
Measurements
Latencies
Latency operand 1 → 6:
7
Latency operand 1 → 7:
≤17
Latency operand 2 → 6:
8
Latency operand 2 → 7:
≤18
Latency operand 4 → 6:
≤18
Latency operand 4 → 7:
19
Latency operand 5 → 6:
≤19
Latency operand 5 → 7:
20
Throughput
Measured (loop):
9.02
Measured (unrolled):
9.00
Number of μops
Executed: 11
Microcode Sequencer (MS): 10
Requires the complex decoder
Goldmont Plus
Measurements
Latencies
Latency operand 1 → 6:
17
Latency operand 1 → 7:
≤29
Latency operand 2 → 6:
11
Latency operand 2 → 7:
≤24
Latency operand 4 → 6:
≤26
Latency operand 4 → 7:
31
Latency operand 5 → 6:
≤25
Latency operand 5 → 7:
30
Throughput
Measured (loop):
12.00
Measured (unrolled):
12.00
Number of μops
Executed: 10
Microcode Sequencer (MS): 10
Requires the complex decoder
Goldmont
Measurements
Latencies
Latency operand 1 → 6:
15
Latency operand 1 → 7:
≤28
Latency operand 2 → 6:
12
Latency operand 2 → 7:
≤23
Latency operand 4 → 6:
≤25
Latency operand 4 → 7:
30
Latency operand 5 → 6:
≤24
Latency operand 5 → 7:
29
Throughput
Measured (loop):
14.00
Measured (unrolled):
14.00
Number of μops
Executed: 9
Microcode Sequencer (MS): 9
Requires the complex decoder
Airmont
Measurements
Latencies
Latency operand 1 → 6:
17
Latency operand 1 → 7:
≤23
Latency operand 2 → 6:
17
Latency operand 2 → 7:
≤23
Latency operand 4 → 6:
≤22
Latency operand 4 → 7:
23
Latency operand 5 → 6:
≤21
Latency operand 5 → 7:
22
Throughput
Measured (loop):
17.00
Measured (unrolled):
17.00
Number of μops
Executed: 8
Microcode Sequencer (MS): 8
Requires the complex decoder
AMD Zen 4
Measurements
Latencies
Latency operand 1 → 6:
8
Latency operand 1 → 7:
≤11
Latency operand 2 → 6:
7
Latency operand 2 → 7:
≤11
Latency operand 4 → 6:
≤16
Latency operand 4 → 7:
15
Latency operand 5 → 6:
≤14
Latency operand 5 → 7:
13
Throughput
Computed from the port usage: 1.50
Measured (loop):
3.00
Measured (unrolled):
3.00
Number of μops
Executed: 7
Port usage:
2*FP01+1*FP0123+1*FP1+1*FP45
AMD Zen 3
Measurements
Latencies
Latency operand 1 → 6:
6
Latency operand 1 → 7:
≤10
Latency operand 2 → 6:
6
Latency operand 2 → 7:
≤10
Latency operand 4 → 6:
≤13
Latency operand 4 → 7:
13
Latency operand 5 → 6:
≤12
Latency operand 5 → 7:
12
Throughput
Computed from the port usage: 1.00
Measured (loop):
3.00
Measured (unrolled):
3.00
Number of μops
Executed: 7
Port usage:
3*FP0123+1*FP1+1*FP45
Documentation
Latency: 4
Throughput: 3.00
Number of μops: ucode
AMD Zen 2
Measurements
Latencies
Latency operand 1 → 6:
8
Latency operand 1 → 7:
≤10
Latency operand 2 → 6:
7
Latency operand 2 → 7:
≤10
Latency operand 4 → 6:
≤14
Latency operand 4 → 7:
13
Latency operand 5 → 6:
≤13
Latency operand 5 → 7:
12
Throughput
Measured (loop):
3.00
Measured (unrolled):
3.00
Number of μops
Executed: 7
Documentation
Latency: 4
Throughput: 3.00
Number of μops: ucode
AMD Zen+
Measurements
Latencies
Latency operand 1 → 6:
7
Latency operand 1 → 7:
≤10
Latency operand 2 → 6:
7
Latency operand 2 → 7:
≤10
Latency operand 4 → 6:
≤14
Latency operand 4 → 7:
13
Latency operand 5 → 6:
≤13
Latency operand 5 → 7:
12
Throughput
Measured (loop):
3.00
Measured (unrolled):
3.00
Number of μops
Executed: 7
Documentation
Latency: 4
Throughput: 3.00
Number of μops: ucode