VPSHUFB_EVEX (YMM, YMM, YMM)
Summary:
"Packed Shuffle Bytes"
Reference:
https://www.felixcloutier.com/x86/pshufb
Extension:
AVX512EVEX
Category:
AVX512
ISA-Set:
AVX512BW_256
CPL:
3
iform:
VPSHUFB_YMMu8_MASKmskw_YMMu8_YMMu8_AVX512
iclass:
VPSHUFB
ASM:
{evex} VPSHUFB
Operands
Operand 1 (w): Register (YMM0, YMM1, YMM2, YMM3, YMM4, YMM5, YMM6, YMM7, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13, YMM14, YMM15, YMM16, YMM17, YMM18, YMM19, YMM20, YMM21, YMM22, YMM23, YMM24, YMM25, YMM26, YMM27, YMM28, YMM29, YMM30, YMM31)
Operand 2 (r): Register (YMM0, YMM1, YMM2, YMM3, YMM4, YMM5, YMM6, YMM7, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13, YMM14, YMM15, YMM16, YMM17, YMM18, YMM19, YMM20, YMM21, YMM22, YMM23, YMM24, YMM25, YMM26, YMM27, YMM28, YMM29, YMM30, YMM31)
Operand 3 (r): Register (YMM0, YMM1, YMM2, YMM3, YMM4, YMM5, YMM6, YMM7, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13, YMM14, YMM15, YMM16, YMM17, YMM18, YMM19, YMM20, YMM21, YMM22, YMM23, YMM24, YMM25, YMM26, YMM27, YMM28, YMM29, YMM30, YMM31)
Available performance data
Emerald Rapids
Alder Lake-P
Rocket Lake
Tiger Lake
Ice Lake
Cascade Lake
Cannon Lake
Skylake-X
AMD Zen 5
AMD Zen 4
Emerald Rapids
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p15
Alder Lake-P
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p15
Rocket Lake
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p15
Tiger Lake
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p15
Ice Lake
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p15
Documentation
Latency: 1.0
Throughput: 0.5
Cascade Lake
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 1.00
Measured (loop):
1.00
Measured (unrolled):
1.00
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p5
Cannon Lake
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 1.00
Measured (loop):
1.00
Measured (unrolled):
1.00
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p5
Skylake-X
Measurements
Latencies
Latency operand 2 → 1:
1
Latency operand 3 → 1:
1
Throughput
Computed from the port usage: 1.00
Measured (loop):
1.00
Measured (unrolled):
1.00
Number of μops
Executed: 1
Retire slots: 1
Decoded (MITE): 1
Microcode Sequencer (MS): 0
Port usage:
1*p5
IACA 2.3
Throughput
Computed from the port usage: 1.00
IACA:
1.00
Number of μops:
1
Port usage:
1*p5
IACA 3.0
Throughput
Computed from the port usage: 1.00
IACA:
0.98
Number of μops:
1
Port usage:
1*p5
AMD Zen 5
Measurements
Latencies
Latency operand 2 → 1:
2
Latency operand 3 → 1:
2
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Port usage:
1*FP12
Documentation
Latency: 2
Throughput: 0.50
Number of μops: 1
Port usage: FP1/2
AMD Zen 4
Measurements
Latencies
Latency operand 2 → 1:
2
Latency operand 3 → 1:
2
Throughput
Computed from the port usage: 0.50
Measured (loop):
0.50
Measured (unrolled):
0.50
Number of μops
Executed: 1
Port usage:
1*FP12
Documentation
Latency: 2
Throughput: 0.50
Number of μops: 1
Port usage: FP1/2