OV - - Loop 10501 - turborvb-mpi.x

0x800cb4 VMOVSD	0x8(%R11,%R10,1),%XMM0    [1]

0x800cbb VMULSD	%XMM0,%XMM0,%XMM3

0x800cbf VMOVSD	(%R11,%R10,1),%XMM1    [1]

0x800cc5 VFMADD231SD	%XMM1,%XMM1,%XMM3

0x800cca VMOVSD	0x10(%R11,%R10,1),%XMM2    [1]

0x800cd1 VFMADD231SD	%XMM2,%XMM2,%XMM3

0x800cd6 ADD	$0x18,%R11

0x800cda VSQRTSD	%XMM3,%XMM3,%XMM3

0x800cde VMAXSD	%XMM4,%XMM3,%XMM3

0x800ce2 VMOVSD	%XMM3,(%R8,%RDI,8)    [2]

0x800ce8 INC	%RDI

0x800ceb CMP	%R14,%RDI

0x800cee JLE	800cb4

/home/eoseret/TREX/turborvb/src/c_adjoint_forward/upwf.f90: 249 - 250

--------------------------------------------------------------------------------

249:                         r(i, k) = max(dsqrt(rmu(1, i, k)**2 + rmu(2, i, k)**2 + rmu(3, i, k)**2), mindist)

250:                     end do

Path /

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	2.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.50 - 2.00
Bottlenecks	P0,
Function	upnewwf
Source	upwf.f90:249-250
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	4.50 - 6.00
CQA cycles if no scalar integer	4.50 - 6.00
CQA cycles if FP arith vectorized	2.25 - 3.00
CQA cycles if fully vectorized	2.25 - 3.00
Front-end cycles	3.00
DIV/SQRT cycles	2.50
P0 cycles	2.50
P1 cycles	1.50
P2 cycles	1.50
P3 cycles	1.00
P4 cycles	1.50
P5 cycles	1.50
P6 cycles	1.00
P7 cycles	4.50 - 6.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	5.37 - 6.42
Stall cycles (UFS)	1.92 - 2.96
Nb insns	13.00
Nb uops	12.00
Nb loads	3.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	1.33 - 1.00
Nb FLOP add-sub	0.00
Nb FLOP mul	1.00
Nb FLOP fma	2.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	1.00
Nb FLOP rsqrt	0.00
Bytes/cycle	5.33 - 7.11
Bytes prefetched	0.00
Bytes loaded	24.00
Bytes stored	8.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	0.00
Vectorization ratio add_sub	NA
Vectorization ratio fma	0.00
Vectorization ratio div_sqrt	0.00
Vectorization ratio other	0.00
Vector-efficiency ratio all	12.50
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	12.50
Vector-efficiency ratio mul	12.50
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	12.50
Vector-efficiency ratio div_sqrt	12.50
Vector-efficiency ratio other	12.50

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	2.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.50 - 2.00
Bottlenecks	P0,
Function	upnewwf
Source	upwf.f90:249-250
Source loop unroll info	not unrolled or unrolled with no peel/tail loop
Source loop unroll confidence level	max
Unroll/vectorization loop type	NA
Unroll factor	NA
CQA cycles	4.50 - 6.00
CQA cycles if no scalar integer	4.50 - 6.00
CQA cycles if FP arith vectorized	2.25 - 3.00
CQA cycles if fully vectorized	2.25 - 3.00
Front-end cycles	3.00
DIV/SQRT cycles	2.50
P0 cycles	2.50
P1 cycles	1.50
P2 cycles	1.50
P3 cycles	1.00
P4 cycles	1.50
P5 cycles	1.50
P6 cycles	1.00
P7 cycles	4.50 - 6.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	5.37 - 6.42
Stall cycles (UFS)	1.92 - 2.96
Nb insns	13.00
Nb uops	12.00
Nb loads	3.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	1.33 - 1.00
Nb FLOP add-sub	0.00
Nb FLOP mul	1.00
Nb FLOP fma	2.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	1.00
Nb FLOP rsqrt	0.00
Bytes/cycle	5.33 - 7.11
Bytes prefetched	0.00
Bytes loaded	24.00
Bytes stored	8.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	0.00
Vectorization ratio load	0.00
Vectorization ratio store	0.00
Vectorization ratio mul	0.00
Vectorization ratio add_sub	NA
Vectorization ratio fma	0.00
Vectorization ratio div_sqrt	0.00
Vectorization ratio other	0.00
Vector-efficiency ratio all	12.50
Vector-efficiency ratio load	12.50
Vector-efficiency ratio store	12.50
Vector-efficiency ratio mul	12.50
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	12.50
Vector-efficiency ratio div_sqrt	12.50
Vector-efficiency ratio other	12.50

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	upnewwf
Source file and lines	upwf.f90:249-250
Module	turborvb-mpi.x

The loop is defined in /home/eoseret/TREX/turborvb/src/c_adjoint_forward/upwf.f90:249-250.

The related source loop is not unrolled or unrolled with no peel/tail loop.

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. 8 data elements could be processed at once in vector registers.

By vectorizing your loop, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

Execution units bottlenecks

Performance is limited by execution of divide and square root operations (the divide/square root unit is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

Workaround

Reduce the number of division or square root instructions:
- If denominator is constant over iterations, use reciprocal (replace x/y with x*(1/y)). Check precision impact. This will be done by your compiler with no-prec-div or Ofast
Check whether you really need double precision. If not, switch to single precision to speedup execution

Expensive FP math instructions/calls

Detected performance impact from expensive FP math instructions/calls. By removing/reexpressing them, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

FMA

Detected 2 FMA (fused multiply-add) operations.

Type of elements and instruction set

5 SSE or AVX instructions are processing arithmetic or math operations on double precision FP elements in scalar mode (one at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 6 FP arithmetical operations:

2: addition or subtraction (all inside FMA instructions)
3: multiply (2 inside FMA instructions)
1: square root

The binary loop is loading 24 bytes (3 double precision FP elements). The binary loop is storing 8 bytes (1 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.19 FP operations per loaded or stored byte.

General properties

nb instructions	13
nb uops	12
loop length	60
used x86 registers	5
used mmx registers	0
used xmm registers	5
used ymm registers	0
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	3.00 cycles
front end	3.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	2.50	2.50	1.50	1.50	1.00	1.50	1.50	1.00
cycles	2.50	2.50	1.50	1.50	1.00	1.50	1.50	1.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	4.50-6.00
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	5.37-6.42
Stall cycles	1.92-2.96
PRF full (events)	1.92-3.95

Cycles summary

Front-end	3.00
Dispatch	2.50
DIV/SQRT	4.50-6.00
Data deps.	1.00
Overall L1	4.50-6.00

Vectorization ratios

all	0%
load	0%
store	0%
mul	0%
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	0%
div/sqrt	0%
other	0%

Vector efficiency ratios

all	12%
load	12%
store	12%
mul	12%
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	12%
div/sqrt	12%
other	12%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 6.00 cycles. At this rate:

3% of peak load performance is reached (4.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
2% of peak store performance is reached (1.33 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 800cb4

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput
VMOVSD 0x8(%R11,%R10,1),%XMM0	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VMULSD %XMM0,%XMM0,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD (%R11,%R10,1),%XMM1	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VFMADD231SD %XMM1,%XMM1,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD 0x10(%R11,%R10,1),%XMM2	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VFMADD231SD %XMM2,%XMM2,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
ADD $0x18,%R11	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
VSQRTSD %XMM3,%XMM3,%XMM3	1	1	0	0	0	0	0	0	0	13-19	4.50-6
VMAXSD %XMM4,%XMM3,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD %XMM3,(%R8,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1
INC %RDI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
CMP %R14,%RDI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
JLE 800cb4 <upnewwf_+0x11fc>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1

Function	upnewwf
Source file and lines	upwf.f90:249-250
Module	turborvb-mpi.x

The loop is defined in /home/eoseret/TREX/turborvb/src/c_adjoint_forward/upwf.f90:249-250.

The related source loop is not unrolled or unrolled with no peel/tail loop.

gain
potential
hint
expert

Vectorization

Your loop is not vectorized. 8 data elements could be processed at once in vector registers.

By vectorizing your loop, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

Details

All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.

Workaround

Try another compiler or update/tune your current one:
- use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive.
Remove inter-iterations dependences from your loop and make it unit-stride:
- If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly: Fortran storage order is column-major: do i do j a(i,j) = b(i,j) (slow, non stride 1) => do i do j a(j,i) = b(i,j) (fast, stride 1)
- If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA): do i a(i)%x = b(i)%x (slow, non stride 1) => do i a%x(i) = b%x(i) (fast, stride 1)

Execution units bottlenecks

Performance is limited by execution of divide and square root operations (the divide/square root unit is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

Workaround

Reduce the number of division or square root instructions:
- If denominator is constant over iterations, use reciprocal (replace x/y with x*(1/y)). Check precision impact. This will be done by your compiler with no-prec-div or Ofast
Check whether you really need double precision. If not, switch to single precision to speedup execution

Expensive FP math instructions/calls

Detected performance impact from expensive FP math instructions/calls. By removing/reexpressing them, you can lower the cost of an iteration from 6.00 to 3.00 cycles (2.00x speedup).

FMA

Detected 2 FMA (fused multiply-add) operations.

Type of elements and instruction set

5 SSE or AVX instructions are processing arithmetic or math operations on double precision FP elements in scalar mode (one at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 6 FP arithmetical operations:

2: addition or subtraction (all inside FMA instructions)
3: multiply (2 inside FMA instructions)
1: square root

The binary loop is loading 24 bytes (3 double precision FP elements). The binary loop is storing 8 bytes (1 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.19 FP operations per loaded or stored byte.

General properties

nb instructions	13
nb uops	12
loop length	60
used x86 registers	5
used mmx registers	0
used xmm registers	5
used ymm registers	0
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	3.00 cycles
front end	3.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	2.50	2.50	1.50	1.50	1.00	1.50	1.50	1.00
cycles	2.50	2.50	1.50	1.50	1.00	1.50	1.50	1.00

Execution ports to units layout:

P0 (256 bits): VPU, ALU, DIV/SQRT
P1 (256 bits): ALU, VPU
P2 (512 bits): store address, load
P3 (512 bits): store address, load
P4 (512 bits): store data
P5 (512 bits): ALU, VPU
P6: ALU
P7: store address

Cycles executing div or sqrt instructions	4.50-6.00
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	5.37-6.42
Stall cycles	1.92-2.96
PRF full (events)	1.92-3.95

Cycles summary

Front-end	3.00
Dispatch	2.50
DIV/SQRT	4.50-6.00
Data deps.	1.00
Overall L1	4.50-6.00

Vectorization ratios

all	0%
load	0%
store	0%
mul	0%
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	0%
div/sqrt	0%
other	0%

Vector efficiency ratios

all	12%
load	12%
store	12%
mul	12%
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	12%
div/sqrt	12%
other	12%

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 6.00 cycles. At this rate:

3% of peak load performance is reached (4.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
2% of peak store performance is reached (1.33 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 800cb4

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput
VMOVSD 0x8(%R11,%R10,1),%XMM0	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VMULSD %XMM0,%XMM0,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD (%R11,%R10,1),%XMM1	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VFMADD231SD %XMM1,%XMM1,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD 0x10(%R11,%R10,1),%XMM2	1	0	0	0.50	0.50	0	0	0	0	4-5	0.50
VFMADD231SD %XMM2,%XMM2,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
ADD $0x18,%R11	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
VSQRTSD %XMM3,%XMM3,%XMM3	1	1	0	0	0	0	0	0	0	13-19	4.50-6
VMAXSD %XMM4,%XMM3,%XMM3	1	0.50	0.50	0	0	0	0	0	0	4	0.50
VMOVSD %XMM3,(%R8,%RDI,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1
INC %RDI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
CMP %R14,%RDI	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
JLE 800cb4 <upnewwf_+0x11fc>	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1

Report Configuration

Vectorization

Details

Workaround

Execution units bottlenecks

Workaround

Expensive FP math instructions/calls

FMA

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Vectorization

Details

Workaround

Execution units bottlenecks

Workaround

Expensive FP math instructions/calls

FMA

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code