| Loop Id: 184 | Module: exec | Source: advec_mom_kernel.f90:81-196 [...] | Coverage: 0.01% |
|---|
| Loop Id: 184 | Module: exec | Source: advec_mom_kernel.f90:81-196 [...] | Coverage: 0.01% |
|---|
0x433ec0 SUB 0x2c0(%RSP),%R13 |
0x433ec8 MOV %R11,%R15 |
0x433ecb IMUL %R13,%R15 |
0x433ecf LEA 0x1(%R13),%RDX |
0x433ed3 IMUL %R11,%RDX |
0x433ed7 IMUL %R12,%R13 |
0x433edb XOR %ECX,%ECX |
0x433edd VPBROADCASTQ %RCX,%ZMM7 |
0x433ee3 VPSUBQ %ZMM7,%ZMM0,%ZMM7 |
0x433ee9 VPCMPNLEUQ %ZMM1,%ZMM7,%K1 |
0x433ef0 MOV 0x118(%RSP),%R12 |
0x433ef8 ADD %R12,%R15 |
0x433efb ADD 0x18(%RSP),%RCX |
0x433f00 MOV %RCX,%RDI |
0x433f03 SUB %R9,%RDI |
0x433f06 VMOVUPD (%R15,%RDI,8),%ZMM7{%K1}{z} |
0x433f0d VMOVAPD %ZMM7,%ZMM6{%K1} |
0x433f13 MOV %R9,%R11 |
0x433f16 NOT %R11 |
0x433f19 ADD %RCX,%R11 |
0x433f1c VMOVUPD (%R15,%R11,8),%ZMM7{%K1}{z} |
0x433f23 VMOVAPD %ZMM7,%ZMM5{%K1} |
0x433f29 VADDPD %ZMM5,%ZMM6,%ZMM7 |
0x433f2f ADD %R12,%RDX |
0x433f32 VMOVUPD (%RDX,%R11,8),%ZMM8{%K1}{z} |
0x433f39 VMOVAPD %ZMM8,%ZMM4{%K1} |
0x433f3f VMOVUPD (%RDX,%RDI,8),%ZMM8{%K1}{z} |
0x433f46 VMOVAPD %ZMM8,%ZMM3{%K1} |
0x433f4c VADDPD %ZMM3,%ZMM4,%ZMM8 |
0x433f52 VADDPD %ZMM8,%ZMM7,%ZMM7 |
0x433f58 VMULPD %ZMM2,%ZMM7,%ZMM7 |
0x433f5e ADD 0x58(%RSP),%R13 |
0x433f63 VMOVUPD %ZMM7,(%R13,%RDI,8){%K1} |
0x433f6b JMP 433f70 |
0x433f70 LEA 0x1(%R14),%ECX |
0x433f74 INC %EBX |
0x433f76 CMP %EAX,%R14D |
0x433f79 MOV %ECX,%R14D |
0x433f7c JE 4322ab |
0x433f82 TEST %ESI,%ESI |
0x433f84 JS 433f70 |
0x433f86 MOV 0x80(%RSP),%RCX |
0x433f8e LEA -0x2(%RCX,%R14,1),%ECX |
0x433f93 MOV 0x128(%RSP),%RDX |
0x433f9b MOV (%RDX),%R11 |
0x433f9e MOV 0x98(%RSP),%RDX |
0x433fa6 MOV (%RDX),%R12 |
0x433fa9 MOVSXD %ECX,%R13 |
0x433fac TEST %R8,%R8 |
0x433faf JE 433ec0 |
0x433fb5 MOVSXD %EBX,%RDI |
0x433fb8 MOV 0x28(%RSP),%RCX |
0x433fbd ADD %RDI,%RCX |
0x433fc0 ADD 0x20(%RSP),%RDI |
0x433fc5 SUB 0x2c0(%RSP),%R13 |
0x433fcd MOV %R11,%R15 |
0x433fd0 IMUL %R13,%R15 |
0x433fd4 LEA 0x1(%R13),%RDX |
0x433fd8 IMUL %R11,%RDX |
0x433fdc IMUL %R12,%R13 |
0x433fe0 IMUL %R11,%RCX |
0x433fe4 ADD %R10,%RCX |
0x433fe7 IMUL %RDI,%R11 |
0x433feb ADD %R10,%R11 |
0x433fee IMUL %R12,%RDI |
0x433ff2 ADD 0x50(%RSP),%RDI |
0x433ff7 XOR %R12D,%R12D |
0x433ffa NOPW (%RAX,%RAX,1) |
(185) 0x434000 VMOVUPD (%R11,%R12,8),%ZMM7 |
(185) 0x434007 VADDPD -0x8(%R11,%R12,8),%ZMM7,%ZMM7 |
(185) 0x434012 VADDPD -0x8(%RCX,%R12,8),%ZMM7,%ZMM7 |
(185) 0x43401d VADDPD (%RCX,%R12,8),%ZMM7,%ZMM7 |
(185) 0x434024 VMULPD %ZMM2,%ZMM7,%ZMM7 |
(185) 0x43402a VMOVUPD %ZMM7,(%RDI,%R12,8) |
(185) 0x434031 ADD $0x8,%R12 |
(185) 0x434035 CMP %R8,%R12 |
(185) 0x434038 JB 434000 |
0x43403a MOV %R8,%RCX |
0x43403d CMP 0x1e0(%RSP),%R8 |
0x434045 JNE 433edd |
0x43404b JMP 433f70 |
/scratch_na/users/xoserete/qaas_runs/171-415-7919/intel/CloverLeafFC/build/CloverLeafFC/CloverLeaf_ref/kernels/advec_mom_kernel.f90: 81 - 196 |
-------------------------------------------------------------------------------- |
81: IF(mom_sweep.EQ.1)THEN ! x 1 |
[...] |
191: DO k=y_min-2,y_max+2 |
192: !$OMP SIMD |
193: DO j=x_min,x_max+1 |
194: ! Find staggered mesh mass fluxes and nodal masses and volumes. |
195: node_flux(j,k)=0.25_8*(mass_flux_y(j-1,k )+mass_flux_y(j ,k ) & |
196: +mass_flux_y(j-1,k+1)+mass_flux_y(j ,k+1)) |
| Coverage (%) | Name | Source Location | Module |
|---|---|---|---|
| ►100.00+ | __kmp_invoke_microtask | libiomp5.so | |
| ○ | __kmp_invoke_task_func | libiomp5.so |
| Path / |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 1.71 |
| CQA speedup if FP arith vectorized | 1.00 |
| CQA speedup if fully vectorized | 1.63 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.24 |
| Bottlenecks | |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source | advec_mom_kernel.f90:81-81,advec_mom_kernel.f90:191-196 |
| Source loop unroll info | NA |
| Source loop unroll confidence level | NA |
| Unroll/vectorization loop type | NA |
| Unroll factor | NA |
| CQA cycles | 7.58 |
| CQA cycles if no scalar integer | 4.43 |
| CQA cycles if FP arith vectorized | 7.58 |
| CQA cycles if fully vectorized | 4.65 |
| Front-end cycles | 6.42 |
| DIV/SQRT cycles | 4.53 |
| P0 cycles | 7.27 |
| P1 cycles | 3.33 |
| P2 cycles | 3.33 |
| P3 cycles | 0.25 |
| P4 cycles | 4.55 |
| P5 cycles | 4.47 |
| P6 cycles | 0.25 |
| P7 cycles | 0.25 |
| P8 cycles | 0.25 |
| P9 cycles | 4.55 |
| P10 cycles | 3.33 |
| P11 cycles | 0.00 |
| Inter-iter dependencies cycles | 0 |
| FE+BE cycles (UFS) | 6.80 - 6.80 |
| Stall cycles (UFS) | 0.08 |
| Nb insns | 39.75 |
| Nb uops | 38.50 |
| Nb loads | 10.00 |
| Nb stores | 0.50 |
| Nb stack references | 6.50 |
| FLOP/cycle | 2.11 |
| Nb FLOP add-sub | 12.00 |
| Nb FLOP mul | 4.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 23.65 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 192.00 |
| Bytes stored | 32.00 |
| Stride 0 | 0.75 |
| Stride 1 | 0.00 |
| Stride n | 0.00 |
| Stride unknown | 4.75 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 34.16 |
| Vectorization ratio load | 44.44 |
| Vectorization ratio store | 100.00 |
| Vectorization ratio mul | 66.67 |
| Vectorization ratio add_sub | 40.00 |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 28.13 |
| Vector-efficiency ratio all | 39.92 |
| Vector-efficiency ratio load | 51.39 |
| Vector-efficiency ratio store | 100.00 |
| Vector-efficiency ratio mul | 70.83 |
| Vector-efficiency ratio add_sub | 44.92 |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 33.91 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 1.00 |
| CQA speedup if FP arith vectorized | 1.00 |
| CQA speedup if fully vectorized | 16.00 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.00 |
| Bottlenecks | micro-operation queue, P0, P1, P5, P6, P10, |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source | advec_mom_kernel.f90:81-81,advec_mom_kernel.f90:191-196 |
| Source loop unroll info | NA |
| Source loop unroll confidence level | NA |
| Unroll/vectorization loop type | NA |
| Unroll factor | NA |
| CQA cycles | 1.00 |
| CQA cycles if no scalar integer | 1.00 |
| CQA cycles if FP arith vectorized | 1.00 |
| CQA cycles if fully vectorized | 0.06 |
| Front-end cycles | 1.00 |
| DIV/SQRT cycles | 1.00 |
| P0 cycles | 1.00 |
| P1 cycles | 0.00 |
| P2 cycles | 0.00 |
| P3 cycles | 0.00 |
| P4 cycles | 1.00 |
| P5 cycles | 1.00 |
| P6 cycles | 0.00 |
| P7 cycles | 0.00 |
| P8 cycles | 0.00 |
| P9 cycles | 1.00 |
| P10 cycles | 0.00 |
| P11 cycles | 0.00 |
| Inter-iter dependencies cycles | 0 |
| FE+BE cycles (UFS) | 1.07 - 1.08 |
| Stall cycles (UFS) | 0.00 |
| Nb insns | 7.00 |
| Nb uops | 6.00 |
| Nb loads | 0.00 |
| Nb stores | 0.00 |
| Nb stack references | 0.00 |
| FLOP/cycle | 0.00 |
| Nb FLOP add-sub | 0.00 |
| Nb FLOP mul | 0.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 0.00 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 0.00 |
| Bytes stored | 0.00 |
| Stride 0 | 0.00 |
| Stride 1 | 0.00 |
| Stride n | 0.00 |
| Stride unknown | 0.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 0.00 |
| Vectorization ratio load | NA |
| Vectorization ratio store | NA |
| Vectorization ratio mul | NA |
| Vectorization ratio add_sub | 0.00 |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 0.00 |
| Vector-efficiency ratio all | 6.25 |
| Vector-efficiency ratio load | NA |
| Vector-efficiency ratio store | NA |
| Vector-efficiency ratio mul | NA |
| Vector-efficiency ratio add_sub | 6.25 |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 6.25 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 2.53 |
| CQA speedup if FP arith vectorized | 1.00 |
| CQA speedup if fully vectorized | 1.12 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.19 |
| Bottlenecks | micro-operation queue, |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source | advec_mom_kernel.f90:81-81,advec_mom_kernel.f90:191-196 |
| Source loop unroll info | NA |
| Source loop unroll confidence level | NA |
| Unroll/vectorization loop type | NA |
| Unroll factor | NA |
| CQA cycles | 8.00 |
| CQA cycles if no scalar integer | 3.17 |
| CQA cycles if FP arith vectorized | 8.00 |
| CQA cycles if fully vectorized | 7.13 |
| Front-end cycles | 8.00 |
| DIV/SQRT cycles | 5.20 |
| P0 cycles | 6.73 |
| P1 cycles | 4.33 |
| P2 cycles | 4.33 |
| P3 cycles | 0.50 |
| P4 cycles | 5.20 |
| P5 cycles | 5.20 |
| P6 cycles | 0.50 |
| P7 cycles | 0.50 |
| P8 cycles | 0.50 |
| P9 cycles | 5.20 |
| P10 cycles | 4.33 |
| P11 cycles | 0.00 |
| Inter-iter dependencies cycles | 0 |
| FE+BE cycles (UFS) | 8.32 |
| Stall cycles (UFS) | 0.00 |
| Nb insns | 50.00 |
| Nb uops | 48.00 |
| Nb loads | 13.00 |
| Nb stores | 1.00 |
| Nb stack references | 7.00 |
| FLOP/cycle | 4.00 |
| Nb FLOP add-sub | 24.00 |
| Nb FLOP mul | 8.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 49.00 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 328.00 |
| Bytes stored | 64.00 |
| Stride 0 | 1.00 |
| Stride 1 | 0.00 |
| Stride n | 0.00 |
| Stride unknown | 6.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 71.43 |
| Vectorization ratio load | 66.67 |
| Vectorization ratio store | 100.00 |
| Vectorization ratio mul | 100.00 |
| Vectorization ratio add_sub | 80.00 |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 62.50 |
| Vector-efficiency ratio all | 74.11 |
| Vector-efficiency ratio load | 70.83 |
| Vector-efficiency ratio store | 100.00 |
| Vector-efficiency ratio mul | 100.00 |
| Vector-efficiency ratio add_sub | 81.25 |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 65.63 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 1.00 |
| CQA speedup if FP arith vectorized | 1.00 |
| CQA speedup if fully vectorized | 11.62 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.59 |
| Bottlenecks | P1, |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source | advec_mom_kernel.f90:81-81,advec_mom_kernel.f90:191-196 |
| Source loop unroll info | NA |
| Source loop unroll confidence level | NA |
| Unroll/vectorization loop type | NA |
| Unroll factor | NA |
| CQA cycles | 10.07 |
| CQA cycles if no scalar integer | 10.07 |
| CQA cycles if FP arith vectorized | 10.07 |
| CQA cycles if fully vectorized | 0.87 |
| Front-end cycles | 6.33 |
| DIV/SQRT cycles | 4.40 |
| P0 cycles | 10.07 |
| P1 cycles | 3.33 |
| P2 cycles | 3.33 |
| P3 cycles | 0.00 |
| P4 cycles | 4.60 |
| P5 cycles | 4.40 |
| P6 cycles | 0.00 |
| P7 cycles | 0.00 |
| P8 cycles | 0.00 |
| P9 cycles | 4.60 |
| P10 cycles | 3.33 |
| P11 cycles | 0.00 |
| Inter-iter dependencies cycles | 0 |
| FE+BE cycles (UFS) | 7.17 |
| Stall cycles (UFS) | 0.33 |
| Nb insns | 38.00 |
| Nb uops | 38.00 |
| Nb loads | 10.00 |
| Nb stores | 0.00 |
| Nb stack references | 8.00 |
| FLOP/cycle | 0.00 |
| Nb FLOP add-sub | 0.00 |
| Nb FLOP mul | 0.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 7.95 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 80.00 |
| Bytes stored | 0.00 |
| Stride 0 | 1.00 |
| Stride 1 | 0.00 |
| Stride n | 0.00 |
| Stride unknown | 4.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 0.00 |
| Vectorization ratio load | 0.00 |
| Vectorization ratio store | NA |
| Vectorization ratio mul | 0.00 |
| Vectorization ratio add_sub | 0.00 |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 0.00 |
| Vector-efficiency ratio all | 10.83 |
| Vector-efficiency ratio load | 12.50 |
| Vector-efficiency ratio store | NA |
| Vector-efficiency ratio mul | 12.50 |
| Vector-efficiency ratio add_sub | 10.94 |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 9.38 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 3.22 |
| CQA speedup if FP arith vectorized | 1.00 |
| CQA speedup if fully vectorized | 1.07 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.09 |
| Bottlenecks | P1, |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source | advec_mom_kernel.f90:81-81,advec_mom_kernel.f90:191-196 |
| Source loop unroll info | NA |
| Source loop unroll confidence level | NA |
| Unroll/vectorization loop type | NA |
| Unroll factor | NA |
| CQA cycles | 11.27 |
| CQA cycles if no scalar integer | 3.50 |
| CQA cycles if FP arith vectorized | 11.27 |
| CQA cycles if fully vectorized | 10.53 |
| Front-end cycles | 10.33 |
| DIV/SQRT cycles | 7.50 |
| P0 cycles | 11.27 |
| P1 cycles | 5.67 |
| P2 cycles | 5.67 |
| P3 cycles | 0.50 |
| P4 cycles | 7.40 |
| P5 cycles | 7.30 |
| P6 cycles | 0.50 |
| P7 cycles | 0.50 |
| P8 cycles | 0.50 |
| P9 cycles | 7.40 |
| P10 cycles | 5.67 |
| P11 cycles | 0.00 |
| Inter-iter dependencies cycles | 0 |
| FE+BE cycles (UFS) | 10.63 |
| Stall cycles (UFS) | 0.00 |
| Nb insns | 64.00 |
| Nb uops | 62.00 |
| Nb loads | 17.00 |
| Nb stores | 1.00 |
| Nb stack references | 11.00 |
| FLOP/cycle | 2.84 |
| Nb FLOP add-sub | 24.00 |
| Nb FLOP mul | 8.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 37.63 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 360.00 |
| Bytes stored | 64.00 |
| Stride 0 | 1.00 |
| Stride 1 | 0.00 |
| Stride n | 0.00 |
| Stride unknown | 9.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 65.22 |
| Vectorization ratio load | 66.67 |
| Vectorization ratio store | 100.00 |
| Vectorization ratio mul | 100.00 |
| Vectorization ratio add_sub | 80.00 |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 50.00 |
| Vector-efficiency ratio all | 68.48 |
| Vector-efficiency ratio load | 70.83 |
| Vector-efficiency ratio store | 100.00 |
| Vector-efficiency ratio mul | 100.00 |
| Vector-efficiency ratio add_sub | 81.25 |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 54.38 |
| Path / |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source file and lines | advec_mom_kernel.f90:81-196 |
| Module | exec |
| nb instructions | 39.75 |
| nb uops | 38.50 |
| loop length | 183.75 |
| used x86 registers | 12 |
| used mmx registers | 0 |
| used xmm registers | 0 |
| used ymm registers | 0 |
| used zmm registers | 4.50 |
| nb stack references | 6.50 |
| micro-operation queue | 6.42 cycles |
| front end | 6.42 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 4.53 | 5.15 | 3.33 | 3.33 | 0.25 | 4.55 | 4.48 | 0.25 | 0.25 | 0.25 | 4.55 | 3.33 |
| cycles | 4.53 | 7.27 | 3.33 | 3.33 | 0.25 | 4.55 | 4.48 | 0.25 | 0.25 | 0.25 | 4.55 | 3.33 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 0.00 |
| FE+BE cycles | 6.80-6.80 |
| Stall cycles | 0.08 |
| PRF_INT full (events) | 0.13 |
| Front-end | 6.42 |
| Dispatch | 7.27 |
| Data deps. | 0.00 |
| Overall L1 | 7.58 |
| all | 11% |
| load | 0% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | 0% |
| add-sub | 25% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 10% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 34% |
| load | 44% |
| store | 100% |
| mul | 66% |
| add-sub | 40% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 28% |
| all | 19% |
| load | 12% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | 12% |
| add-sub | 30% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 17% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 39% |
| load | 51% |
| store | 100% |
| mul | 70% |
| add-sub | 44% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 33% |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source file and lines | advec_mom_kernel.f90:81-196 |
| Module | exec |
| nb instructions | 7 |
| nb uops | 6 |
| loop length | 22 |
| used x86 registers | 5 |
| used mmx registers | 0 |
| used xmm registers | 0 |
| used ymm registers | 0 |
| used zmm registers | 0 |
| nb stack references | 0 |
| micro-operation queue | 1.00 cycles |
| front end | 1.00 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| cycles | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 0.00 |
| FE+BE cycles | 1.07-1.08 |
| Stall cycles | 0.00 |
| Front-end | 1.00 |
| Dispatch | 1.00 |
| Data deps. | 0.00 |
| Overall L1 | 1.00 |
| all | 0% |
| load | NA (no load vectorizable/vectorized instructions) |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 0% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 0% |
| all | 6% |
| load | NA (no load vectorizable/vectorized instructions) |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 6% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 6% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LEA 0x1(%R14),%ECX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
| INC %EBX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| CMP %EAX,%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| MOV %ECX,%R14D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| JE 4322ab <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x5cb> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| TEST %ESI,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JS 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source file and lines | advec_mom_kernel.f90:81-196 |
| Module | exec |
| nb instructions | 50 |
| nb uops | 48 |
| loop length | 242 |
| used x86 registers | 14 |
| used mmx registers | 0 |
| used xmm registers | 0 |
| used ymm registers | 0 |
| used zmm registers | 9 |
| nb stack references | 7 |
| ADD-SUB / MUL ratio | 3.00 |
| micro-operation queue | 8.00 cycles |
| front end | 8.00 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 5.20 | 5.20 | 4.33 | 4.33 | 0.50 | 5.20 | 5.20 | 0.50 | 0.50 | 0.50 | 5.20 | 4.33 |
| cycles | 5.20 | 6.73 | 4.33 | 4.33 | 0.50 | 5.20 | 5.20 | 0.50 | 0.50 | 0.50 | 5.20 | 4.33 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 0.00 |
| FE+BE cycles | 8.32 |
| Stall cycles | 0.00 |
| Front-end | 8.00 |
| Dispatch | 6.73 |
| Data deps. | 0.00 |
| Overall L1 | 8.00 |
| all | 25% |
| load | 0% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 50% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 25% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 71% |
| load | 66% |
| store | 100% |
| mul | 100% |
| add-sub | 80% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 62% |
| all | 32% |
| load | 12% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 53% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 31% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 74% |
| load | 70% |
| store | 100% |
| mul | 100% |
| add-sub | 81% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 65% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SUB 0x2c0(%RSP),%R13 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| MOV %R11,%R15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| IMUL %R13,%R15 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| LEA 0x1(%R13),%RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| IMUL %R11,%RDX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| IMUL %R12,%R13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| XOR %ECX,%ECX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| VPBROADCASTQ %RCX,%ZMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| VPSUBQ %ZMM7,%ZMM0,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
| VPCMPNLEUQ %ZMM1,%ZMM7,%K1 | |||||||||||||||
| MOV 0x118(%RSP),%R12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| ADD %R12,%R15 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD 0x18(%RSP),%RCX | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| MOV %RCX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| SUB %R9,%RDI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%R15,%RDI,8),%ZMM7{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM7,%ZMM6{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| MOV %R9,%R11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| NOT %R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD %RCX,%R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%R15,%R11,8),%ZMM7{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM7,%ZMM5{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VADDPD %ZMM5,%ZMM6,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| ADD %R12,%RDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%RDX,%R11,8),%ZMM8{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM8,%ZMM4{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VMOVUPD (%RDX,%RDI,8),%ZMM8{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM8,%ZMM3{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VADDPD %ZMM3,%ZMM4,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| VADDPD %ZMM8,%ZMM7,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| VMULPD %ZMM2,%ZMM7,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
| ADD 0x58(%RSP),%R13 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| VMOVUPD %ZMM7,(%R13,%RDI,8){%K1} | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 1 |
| JMP 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.84 |
| LEA 0x1(%R14),%ECX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
| INC %EBX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| CMP %EAX,%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| MOV %ECX,%R14D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| JE 4322ab <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x5cb> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| TEST %ESI,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JS 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| MOV 0x80(%RSP),%RCX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| LEA -0x2(%RCX,%R14,1),%ECX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| MOV 0x128(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV 0x98(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOVSXD %ECX,%R13 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
| TEST %R8,%R8 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JE 433ec0 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x21e0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source file and lines | advec_mom_kernel.f90:81-196 |
| Module | exec |
| nb instructions | 38 |
| nb uops | 38 |
| loop length | 166 |
| used x86 registers | 14 |
| used mmx registers | 0 |
| used xmm registers | 0 |
| used ymm registers | 0 |
| used zmm registers | 0 |
| nb stack references | 8 |
| micro-operation queue | 6.33 cycles |
| front end | 6.33 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 4.40 | 7.00 | 3.33 | 3.33 | 0.00 | 4.60 | 4.40 | 0.00 | 0.00 | 0.00 | 4.60 | 3.33 |
| cycles | 4.40 | 10.07 | 3.33 | 3.33 | 0.00 | 4.60 | 4.40 | 0.00 | 0.00 | 0.00 | 4.60 | 3.33 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 0.00 |
| FE+BE cycles | 7.17 |
| Stall cycles | 0.33 |
| PRF_INT full (events) | 0.50 |
| Front-end | 6.33 |
| Dispatch | 10.07 |
| Data deps. | 0.00 |
| Overall L1 | 10.07 |
| all | 0% |
| load | 0% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | 0% |
| add-sub | 0% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 0% |
| all | 10% |
| load | 12% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | 12% |
| add-sub | 10% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 9% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LEA 0x1(%R14),%ECX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
| INC %EBX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| CMP %EAX,%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| MOV %ECX,%R14D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| JE 4322ab <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x5cb> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| TEST %ESI,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JS 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| MOV 0x80(%RSP),%RCX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| LEA -0x2(%RCX,%R14,1),%ECX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| MOV 0x128(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV 0x98(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOVSXD %ECX,%R13 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
| TEST %R8,%R8 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JE 433ec0 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x21e0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| MOVSXD %EBX,%RDI | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
| MOV 0x28(%RSP),%RCX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| ADD %RDI,%RCX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD 0x20(%RSP),%RDI | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| SUB 0x2c0(%RSP),%R13 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| MOV %R11,%R15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| IMUL %R13,%R15 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| LEA 0x1(%R13),%RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| IMUL %R11,%RDX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| IMUL %R12,%R13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| IMUL %R11,%RCX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD %R10,%RCX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| IMUL %RDI,%R11 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD %R10,%R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| IMUL %R12,%RDI | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD 0x50(%RSP),%RDI | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| XOR %R12D,%R12D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| NOPW (%RAX,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| MOV %R8,%RCX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| CMP 0x1e0(%RSP),%R8 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| JNE 433edd <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x21fd> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| JMP 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.08 |
| Function | advec_mom_kernel_.DIR.OMP.PARALLEL.2 |
| Source file and lines | advec_mom_kernel.f90:81-196 |
| Module | exec |
| nb instructions | 64 |
| nb uops | 62 |
| loop length | 305 |
| used x86 registers | 15 |
| used mmx registers | 0 |
| used xmm registers | 0 |
| used ymm registers | 0 |
| used zmm registers | 9 |
| nb stack references | 11 |
| ADD-SUB / MUL ratio | 3.00 |
| micro-operation queue | 10.33 cycles |
| front end | 10.33 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 7.50 | 7.40 | 5.67 | 5.67 | 0.50 | 7.40 | 7.30 | 0.50 | 0.50 | 0.50 | 7.40 | 5.67 |
| cycles | 7.50 | 11.27 | 5.67 | 5.67 | 0.50 | 7.40 | 7.30 | 0.50 | 0.50 | 0.50 | 7.40 | 5.67 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 0.00 |
| FE+BE cycles | 10.63 |
| Stall cycles | 0.00 |
| Front-end | 10.33 |
| Dispatch | 11.27 |
| Data deps. | 0.00 |
| Overall L1 | 11.27 |
| all | 20% |
| load | 0% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 50% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 16% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 65% |
| load | 66% |
| store | 100% |
| mul | 100% |
| add-sub | 80% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 50% |
| all | 27% |
| load | 12% |
| store | NA (no store vectorizable/vectorized instructions) |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | 53% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 23% |
| all | 100% |
| load | 100% |
| store | 100% |
| mul | 100% |
| add-sub | 100% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 100% |
| all | 68% |
| load | 70% |
| store | 100% |
| mul | 100% |
| add-sub | 81% |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 54% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VPBROADCASTQ %RCX,%ZMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| VPSUBQ %ZMM7,%ZMM0,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.50 |
| VPCMPNLEUQ %ZMM1,%ZMM7,%K1 | |||||||||||||||
| MOV 0x118(%RSP),%R12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| ADD %R12,%R15 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD 0x18(%RSP),%RCX | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| MOV %RCX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| SUB %R9,%RDI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%R15,%RDI,8),%ZMM7{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM7,%ZMM6{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| MOV %R9,%R11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| NOT %R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD %RCX,%R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%R15,%R11,8),%ZMM7{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM7,%ZMM5{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VADDPD %ZMM5,%ZMM6,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| ADD %R12,%RDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| VMOVUPD (%RDX,%R11,8),%ZMM8{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM8,%ZMM4{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VMOVUPD (%RDX,%RDI,8),%ZMM8{%K1}{z} | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.50 |
| VMOVAPD %ZMM8,%ZMM3{%K1} | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
| VADDPD %ZMM3,%ZMM4,%ZMM8 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| VADDPD %ZMM8,%ZMM7,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
| VMULPD %ZMM2,%ZMM7,%ZMM7 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
| ADD 0x58(%RSP),%R13 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| VMOVUPD %ZMM7,(%R13,%RDI,8){%K1} | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 1 |
| JMP 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.84 |
| LEA 0x1(%R14),%ECX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1-2 | 0.20 |
| INC %EBX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| CMP %EAX,%R14D | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| MOV %ECX,%R14D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| JE 4322ab <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x5cb> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| TEST %ESI,%ESI | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JS 433f70 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x2290> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| MOV 0x80(%RSP),%RCX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| LEA -0x2(%RCX,%R14,1),%ECX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| MOV 0x128(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV 0x98(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOV (%RDX),%R12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| MOVSXD %ECX,%R13 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
| TEST %R8,%R8 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 2 | 0.20 |
| JE 433ec0 <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x21e0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
| MOVSXD %EBX,%RDI | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
| MOV 0x28(%RSP),%RCX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
| ADD %RDI,%RCX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| ADD 0x20(%RSP),%RDI | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| SUB 0x2c0(%RSP),%R13 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| MOV %R11,%R15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| IMUL %R13,%R15 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| LEA 0x1(%R13),%RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| IMUL %R11,%RDX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| IMUL %R12,%R13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| IMUL %R11,%RCX | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD %R10,%RCX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| IMUL %RDI,%R11 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD %R10,%R11 | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
| IMUL %R12,%RDI | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
| ADD 0x50(%RSP),%RDI | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| XOR %R12D,%R12D | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| NOPW (%RAX,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
| MOV %R8,%RCX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
| CMP 0x1e0(%RSP),%R8 | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 |
| JNE 433edd <advec_mom_kernel_mod_mp_advec_mom_kernel_.DIR.OMP.PARALLEL.2+0x21fd> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
