OV - turborvb-mpi.x - Loop 10287

0x7f4dd4 XOR	%R8D,%R8D

0x7f4dd7 INC	%R8

0x7f4dda MOV	%RSI,%RDI

0x7f4ddd TEST	%RDX,%RDX

0x7f4de0 JLE	7f4e15

0x7f4de2 MOV	-0x2f8(%RBP),%R11

0x7f4de9 LEA	(%R11,%R9,1),%RCX

  (10288) 0x7f4ded XOR	%R15D,%R15D

  (10288) 0x7f4df0 INC	%R15

  (10288) 0x7f4df3 TEST	%R13,%R13

  (10288) 0x7f4df6 JLE	7f4e09

  (10288) 0x7f4df8 LEA	(%RCX,%RDI,1),%R11

    (10289) 0x7f4dfc MOV	%RSI,-0x8(%R11,%R15,8)

    (10289) 0x7f4e01 INC	%R15

    (10289) 0x7f4e04 CMP	%R13,%R15

    (10289) 0x7f4e07 JLE	7f4dfc

  (10288) 0x7f4e09 INC	%R8

  (10288) 0x7f4e0c LEA	(%RDI,%R13,8),%RDI

  (10288) 0x7f4e10 CMP	%RDX,%R8

  (10288) 0x7f4e13 JLE	7f4ded

0x7f4e15 INC	%R10

0x7f4e18 ADD	-0xa8(%RBP),%R9

0x7f4e1f CMP	-0x2e8(%RBP),%R10

0x7f4e26 JLE	7f4dd4

  ***  This Panel is Intentionally Left Blank.  ***

It is due to a lack of debug symbols in the given object
(loop or function).

Path /

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	13.06
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.40
Bottlenecks
Function	compute_eloc_logpsi_IP_task4
Source
Source loop unroll info	NA
Source loop unroll confidence level	NA
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	2.33
CQA cycles if no scalar integer	2.33
CQA cycles if FP arith vectorized	2.33
CQA cycles if fully vectorized	0.18
Front-end cycles	2.33
DIV/SQRT cycles	1.67
P0 cycles	1.67
P1 cycles	1.33
P2 cycles	1.33
P3 cycles	0.00
P4 cycles	1.67
P5 cycles	1.67
P6 cycles	0.00
P7 cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	2.42
Stall cycles (UFS)	0.00
Nb insns	10.33
Nb uops	9.33
Nb loads	2.67
Nb stores	0.00
Nb stack references	2.67
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	9.07
Bytes prefetched	0.00
Bytes loaded	21.33
Bytes stored	0.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.67
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	NA
Vectorization ratio mul	NA
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	11.04
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	NA
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	12.50
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	10.42

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	12.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.33
Bottlenecks	micro-operation queue,
Function	compute_eloc_logpsi_IP_task4
Source
Source loop unroll info	NA
Source loop unroll confidence level	NA
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	2.00
CQA cycles if no scalar integer	2.00
CQA cycles if FP arith vectorized	2.00
CQA cycles if fully vectorized	0.17
Front-end cycles	2.00
DIV/SQRT cycles	1.50
P0 cycles	1.50
P1 cycles	1.00
P2 cycles	1.00
P3 cycles	0.00
P4 cycles	1.50
P5 cycles	1.50
P6 cycles	0.00
P7 cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	2.09
Stall cycles (UFS)	0.00
Nb insns	9.00
Nb uops	8.00
Nb loads	2.00
Nb stores	0.00
Nb stack references	2.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	8.00
Bytes prefetched	0.00
Bytes loaded	16.00
Bytes stored	0.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	NA
Vectorization ratio mul	NA
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	11.25
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	NA
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	12.50
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	10.42

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	13.54
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.43
Bottlenecks	micro-operation queue,
Function	compute_eloc_logpsi_IP_task4
Source
Source loop unroll info	NA
Source loop unroll confidence level	NA
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	2.50
CQA cycles if no scalar integer	2.50
CQA cycles if FP arith vectorized	2.50
CQA cycles if fully vectorized	0.18
Front-end cycles	2.50
DIV/SQRT cycles	1.75
P0 cycles	1.75
P1 cycles	1.50
P2 cycles	1.50
P3 cycles	0.00
P4 cycles	1.75
P5 cycles	1.75
P6 cycles	0.00
P7 cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	2.59
Stall cycles (UFS)	0.00
Nb insns	11.00
Nb uops	10.00
Nb loads	3.00
Nb stores	0.00
Nb stack references	3.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	9.60
Bytes prefetched	0.00
Bytes loaded	24.00
Bytes stored	0.00
Stride 0	1.00
Stride 1	0.00
Stride n	0.00
Stride unknown	1.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	NA
Vectorization ratio store	NA
Vectorization ratio mul	NA
Vectorization ratio add_sub	0.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	0.00
Vector-efficiency ratio all	10.94
Vector-efficiency ratio load	NA
Vector-efficiency ratio store	NA
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	12.50
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	10.42

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	compute_eloc_logpsi_IP_task4
Source file and lines
Module	turborvb-mpi.x

Warnings:
Non-innermost loop: analyzing only self part (ignoring child loops).
This loop has 3 execution paths.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = max(0,x) (Fortran instrinsic procedure)

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. Only 11% of vector register length is used (average across all SSE/AVX instructions). By vectorizing your loop, you can lower the cost of an iteration from 2.33 to 0.18 cycles (13.06x speedup).

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

No data for this section

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 21 bytes.

General properties

nb instructions	10.33
nb uops	9.33
loop length	40.33
used x86 registers	8.33
used mmx registers	0
used xmm registers	0
used ymm registers	0
used zmm registers	0
nb stack references	2.67

Front-end

MACRO FUSION NOT POSSIBLE

micro-operation queue	2.33 cycles
front end	2.33 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	1.67	1.67	1.33	1.33	0.00	1.67	1.67	0.00
cycles	1.67	1.67	1.33	1.33	0.00	1.67	1.67	0.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	2.42
Stall cycles	0.00

Cycles summary

Front-end	2.33
Dispatch	1.67
Data deps.	1.00
Overall L1	2.33

Vectorization ratios

all	0%
load	0%
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Vector efficiency ratios

all	11%
load	12%
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	12%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	10%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.33 cycles. At this rate:

7% of peak load performance is reached (9.07 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))

Function	compute_eloc_logpsi_IP_task4
Source file and lines
Module	turborvb-mpi.x

Warnings:
Non-innermost loop: analyzing only self part (ignoring child loops).
This loop has 3 execution paths.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = max(0,x) (Fortran instrinsic procedure)

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. Only 11% of vector register length is used (average across all SSE/AVX instructions). By vectorizing your loop, you can lower the cost of an iteration from 2.00 to 0.17 cycles (12.00x speedup).

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

No data for this section

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 16 bytes.

General properties

nb instructions	9
nb uops	8
loop length	33
used x86 registers	7
used mmx registers	0
used xmm registers	0
used ymm registers	0
used zmm registers	0
nb stack references	2

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	2.00 cycles
front end	2.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	1.50	1.50	1.00	1.00	0.00	1.50	1.50	0.00
cycles	1.50	1.50	1.00	1.00	0.00	1.50	1.50	0.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	2.09
Stall cycles	0.00

Cycles summary

Front-end	2.00
Dispatch	1.50
Data deps.	1.00
Overall L1	2.00

Vectorization ratios

all	0%
load	0%
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Vector efficiency ratios

all	11%
load	12%
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	12%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	10%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.00 cycles. At this rate:

6% of peak load performance is reached (8.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 2.00 to 1.50 cycles (1.33x speedup).

ASM code

In the binary file, the address of the loop is: 7f4dd4

Instruction	Nb FU	P0	P1	P2	P3	P5	P6	Latency	Recip. throughput
XOR %R8D,%R8D	1	0	0	0	0	0	0	0	0.25
INC %R8	1	0.25	0.25	0	0	0.25	0.25	1	0.25
MOV %RSI,%RDI	1	0	0	0	0	0	0	0	0.25
TEST %RDX,%RDX	1	0.25	0.25	0	0	0.25	0.25	1	0.25
JLE 7f4e15 <compute_eloc_logpsi_IP_task4_+0x190>	1	0.50	0	0	0	0	0.50	0	0.50-1
INC %R10	1	0.25	0.25	0	0	0.25	0.25	1	0.25
ADD -0xa8(%RBP),%R9	1	0.25	0.25	0.50	0.50	0.25	0.25	1	0.50
CMP -0x2e8(%RBP),%R10	1	0.25	0.25	0.50	0.50	0.25	0.25	1	0.50
JLE 7f4dd4 <compute_eloc_logpsi_IP_task4_+0x14f>	1	0.50	0	0	0	0	0.50	0	0.50-1

Function	compute_eloc_logpsi_IP_task4
Source file and lines
Module	turborvb-mpi.x

Warnings:
Non-innermost loop: analyzing only self part (ignoring child loops).
This loop has 3 execution paths.

The presence of multiple execution paths is typically the main/first bottleneck.
Try to simplify control inside loop: ideally, try to remove all conditional expressions, for example by (if applicable):

hoisting them (moving them outside the loop)
turning them into conditional moves, MIN or MAX

Ex: if (x<0) x=0 => x = max(0,x) (Fortran instrinsic procedure)

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. Only 10% of vector register length is used (average across all SSE/AVX instructions). By vectorizing your loop, you can lower the cost of an iteration from 2.50 to 0.18 cycles (13.54x speedup).

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

No data for this section

Slow data structures access

Detected data structures (typically arrays) that cannot be efficiently read/written

Details

Constant unknown stride: 1 occurrence(s)

Non-unit stride (uncontiguous) accesses are not efficiently using data caches

Workaround

Try to reorganize arrays of structures to structures of arrays
Consider to permute loops (see vectorization gain report)

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 24 bytes.

General properties

nb instructions	11
nb uops	10
loop length	44
used x86 registers	9
used mmx registers	0
used xmm registers	0
used ymm registers	0
used zmm registers	0
nb stack references	3

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	2.50 cycles
front end	2.50 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	1.75	1.75	1.50	1.50	0.00	1.75	1.75	0.00
cycles	1.75	1.75	1.50	1.50	0.00	1.75	1.75	0.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	2.59
Stall cycles	0.00

Cycles summary

Front-end	2.50
Dispatch	1.75
Data deps.	1.00
Overall L1	2.50

Vectorization ratios

all	0%
load	NA (no load vectorizable/vectorized instructions)
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	0%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	0%

Vector efficiency ratios

all	10%
load	NA (no load vectorizable/vectorized instructions)
store	NA (no store vectorizable/vectorized instructions)
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	12%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	10%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.50 cycles. At this rate:

7% of peak load performance is reached (9.60 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 2.50 to 1.75 cycles (1.43x speedup).

ASM code

In the binary file, the address of the loop is: 7f4dd4

Instruction	Nb FU	P0	P1	P2	P3	P5	P6	Latency	Recip. throughput
XOR %R8D,%R8D	1	0	0	0	0	0	0	0	0.25
INC %R8	1	0.25	0.25	0	0	0.25	0.25	1	0.25
MOV %RSI,%RDI	1	0	0	0	0	0	0	0	0.25
TEST %RDX,%RDX	1	0.25	0.25	0	0	0.25	0.25	1	0.25
JLE 7f4e15 <compute_eloc_logpsi_IP_task4_+0x190>	1	0.50	0	0	0	0	0.50	0	0.50-1
MOV -0x2f8(%RBP),%R11	1	0	0	0.50	0.50	0	0	4-5	0.50
LEA (%R11,%R9,1),%RCX	1	0	0.50	0	0	0.50	0	1	0.50
INC %R10	1	0.25	0.25	0	0	0.25	0.25	1	0.25
ADD -0xa8(%RBP),%R9	1	0.25	0.25	0.50	0.50	0.25	0.25	1	0.50
CMP -0x2e8(%RBP),%R10	1	0.25	0.25	0.50	0.50	0.25	0.25	1	0.50
JLE 7f4dd4 <compute_eloc_logpsi_IP_task4_+0x14f>	1	0.50	0	0	0	0	0.50	0	0.50-1

Report Configuration

Vectorization

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Vectorization

Details

Workaround

Execution units bottlenecks

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Vectorization

Details

Workaround

Execution units bottlenecks

Slow data structures access

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code