OV - exec - Global

Help is available by moving the cursor above any symbol or by checking MAQAO website.

Total Time (s)		6.88
Max (Thread Active Time) (s)		4.49
Average Active Time (s)		3.22
Activity Ratio (%)		80.4
Average number of active threads		29.996
Affinity Stability (%)		83.1
Time in analyzed loops (%)		95.5
Time in analyzed innermost loops (%)		93.2
Time in user code (%)		95.9
Compilation Options Score (%)		100.0
Array Access Efficiency (%)		75.0

Potential Speedups
Perfect Flow Complexity		1.01
Perfect OpenMP/MPI/Pthread/TBB		1.02
Perfect OpenMP/MPI/Pthread/TBB + Perfect Load Distribution		1.44
No Scalar Integer	Potential Speedup	1.01
No Scalar Integer	Nb Loops to get 80%	2
FP Vectorised	Potential Speedup	1.17
FP Vectorised	Nb Loops to get 80%	1
Fully Vectorised	Potential Speedup	3.61
Fully Vectorised	Nb Loops to get 80%	1
FP Arithmetic Only	Potential Speedup	3.99
FP Arithmetic Only	Nb Loops to get 80%	1

Source Object	Issue
▼libllama.so–
○hashtable.h
▼libggml-cpu.so–
○binary-ops.cpp
○traits.cpp
○kai_rhs_pack_nxk_qsi4c32pscalef16_qsu4c32s16s0.c
○kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm.c
○vec.cpp
○kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon.c
○ops.cpp
○ggml-cpu.c
○quants.c
▼libggml-base.so–
▼–
○	-g is missing for some functions (possibly ones added by the compiler), it is needed to have more accurate reports. Other recommended flags are: -O2/-O3, -march=(target)
○	-O2, -O3 or -Ofast is missing.
○	-mcpu=native is missing.
▼exec–
▼–
○	-g is missing for some functions (possibly ones added by the compiler), it is needed to have more accurate reports. Other recommended flags are: -O2/-O3, -march=(target)
○	-O2, -O3 or -Ofast is missing.
○	-mcpu=native is missing.

Application	/home/eoseret/Tools/QaaS/qaas_runs/ip-172-31-18-66/176-131-5415/llama.cpp/run/binaries/gcc_2/exec
Timestamp	2025-10-24 14:34:00	Universal Timestamp	1761316440
Number of processes observed	1	Number of threads observed	64
Experiment Type	MPI; OpenMP;
Machine	ip-172-31-18-66
Architecture	aarch64	Micro Architecture	ARM_NEOVERSE_V1
OS Version	Linux 6.14.0-1012-aws #12~24.04.1-Ubuntu SMP Fri Aug 15 00:07:14 UTC 2025
Architecture used during static analysis	aarch64	Micro Architecture used during static analysis	ARM_NEOVERSE_V1
Frequency Driver	NA	Frequency Governor	NA
Huge Pages	madvise	Hyperthreading	off
Number of sockets	1	Number of cores per socket	64
Compilation Options	exec: N/A libggml-base.so: N/A libggml-cpu.so: GNU C11 14.2.0 -mcpu=neoverse-v1 -msve-vector-bits=256 -mlittle-endian -mabi=lp64 -g -O3 -O3 -O3 -std=gnu11 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC -fopenmp libllama.so: GNU C++17 14.2.0 -mcpu=neoverse-v1 -msve-vector-bits=256 -mlittle-endian -mabi=lp64 -g -O3 -O3 -O3 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC

Dataset
Run Command	<executable> -m meta-llama-3.1-8b-instruct-Q4_0.gguf -t 64 -b 2048 -ub 512 -npp 128 -ntg 0 -npl 16 -c 16384 --seed 0 --output-format jsonl
MPI Command	mpirun -n <number_processes> --bind-to none --report-bindings
Number Processes	1
Number Nodes	1
Filter	Not Used
Profile Start	Not Used
Profile Stop	Not Used