!pyperf system tune

Tune the system configuration to run benchmarks

System state
============

CPU: use 16 logical CPUs: 0-15
Perf event: Maximum sample rate: 100000 per second
ASLR: Full randomization
Linux scheduler: No CPU is isolated
CPU Frequency: 0-3,8-11=min=400 MHz, max=4600 MHz; 4-7=min=400 MHz, max=4700 MHz; 12-15=min=400 MHz, max=3500 MHz
CPU scaling governor (intel_pstate): powersave
Turbo Boost (intel_pstate): Turbo Boost enabled
IRQ affinity: irqbalance service: inactive
IRQ affinity: Default IRQ affinity: CPU 0-15
Power supply: the power cable is plugged

Advices
=======

Perf event: Set max sample rate to 1
Linux scheduler: Use isolcpus=<cpu list> kernel parameter to isolate CPUs
Linux scheduler: Use rcu_nocbs=<cpu list> kernel parameter (with isolcpus) to not schedule RCU on isolated CPUs
CPU scaling governor (intel_pstate): Use CPU scaling governor 'performance'
Turbo Boost (intel_pstate): Disable Turbo Boost to get more reliable CPU frequency

Errors
======

Perf event: Failed to write into /proc/sys/kernel/perf_event_max_sample_rate: [Errno 13] Permission denied: '/proc/sys/kernel/perf_event_max_sample_rate'
CPU Frequency: Unable to write scaling_max_freq of CPU 0: [Errno 13] Permission denied: '/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq'
CPU scaling governor (intel_pstate): Failed to set the CPU scaling governor: [Errno 13] Permission denied: '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
Turbo Boost (intel_pstate): Failed to disable Turbo Boost: failed to write into /sys/devices/system/cpu/intel_pstate/no_turbo: [Errno 13] Permission denied: '/sys/devices/system/cpu/intel_pstate/no_turbo'
IRQ affinity: Failed to write '0000ffff' into /proc/irq/27/smp_affinity: [Errno 13] Permission denied: '/proc/irq/27/smp_affinity'
ERROR: At least one operation failed with permission error, retry as root

!cat /proc/cpuinfo | grep -m 1 -A 26 '^processor'
!getconf -a | grep CACHE
!lstopo-no-graphics --no-io --no-legend

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 154
model name	: 12th Gen Intel(R) Core(TM) i7-12650H
stepping	: 3
microcode	: 0x430
cpu MHz		: 1647.921
cache size	: 24576 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 10
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 32
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb
bogomips	: 5376.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 49152
LEVEL1_DCACHE_ASSOC                12
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  1310720
LEVEL2_CACHE_ASSOC                 10
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  25165824
LEVEL3_CACHE_ASSOC                 12
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 
LEVEL4_CACHE_LINESIZE              
Machine (62GB total) + Package L#0
  NUMANode L#0 (P#0 62GB)
  L3 L#0 (24MB)
    L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)
    L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#5)
    L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#7)
    L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4
      PU L#8 (P#8)
      PU L#9 (P#9)
    L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5
      PU L#10 (P#10)
      PU L#11 (P#11)
    L2 L#6 (2048KB)
      L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#12 (P#12)
      L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#13 (P#13)
      L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#14 (P#14)
      L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#15 (P#15)

import os
import numpy as np

TMP='/dev/shm'
CXXFLAGS='-std=c++20 -DNDEBUG -O3 -march=native'
LDDFLAGS=''

def buildit(name, compiler, target, size, cfg):
    !rm -f f"{TMP}/{target}"
    CMD=f"CXX={compiler} CXXFLAGS=\"{CXXFLAGS}\" LDDFLAGS=\"{LDDFLAGS}\" SIZE={size} PROBABILITY={cfg['probability']} SEED={cfg['seed']} TMP={TMP} BENCHMARK={name} /usr/bin/time -f %e make -j -s"
    time = !{CMD} {target}
    assert _exit_code == 0
    try:
        return float(time[0])
    except:
        print(time)
    
def timeit(target, cpu = 1, runs = 3):
    times = []
    checksums = []
    !strip {TMP}/{target}
    for _ in range(runs):
        result = !taskset -c {cpu} {TMP}/{target}
        tokens = result[0].split(':')
        times.append(float(tokens[1]))
        checksums.append(int(tokens[0]))
    print('.', end='')
    binary_size = !stat -c %s {TMP}/{target}
    return times, checksums, int(binary_size[0])

import joblib, random, string
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from collections import defaultdict

options = ['if_else', 'switch_case', 'map', 'unordered_map', 'boost_unordered_flat_map', 'frozen', 'gperf', 'mph']
palette = sns.color_palette()
gperf   = """
struct data { const char* key; int value; };
%compare-strncmp
%language=C++
%enum
%includes
%readonly-tables
%global-table
%struct-type
%define slot-name key
%define lookup-function-name find
%%
"""

def generate_integers(size, rng):
    result = []
    for i in range(size):
        while True:
          new = random.randint(rng[0], rng[1])
          if new not in result:
            result.append(new)
            break
    return result

def generate_strings(size, length):
    result = []
    for i in range(size):
        while True:
          new = ''.join(random.choices(string.ascii_letters + string.digits, k=random.randint(length[0], length[1])))
          if new not in result:
            result.append(new)
            break
    return result

def generate(filename, size, cfg):
  if 'length' in cfg:
    data = generate_strings(size, cfg['length'])
    with open(filename + ".ipp", 'w') as file:
      for index, name in enumerate(data):
          file.write(f"std::pair{{\"{name}\"sv, {index + 1}}},\n")
    with open(filename + ".gperf", 'w') as file:
      file.write(gperf)
      for index, name in enumerate(data):
          file.write(f"\"{name}\", {index + 1}\n")
    return data
  elif 'range' in cfg:
    data = generate_integers(size, cfg['range'])
    with open(filename + ".ipp", 'w') as file:
      for index, value in enumerate(data):
          file.write(f"std::pair{{{value}, {index + 1}}},\n")
    return data

def benchmark(name, compilers, targets, sizes, combinations):
    for combination in combinations:
        _, ax_time   = plt.subplots(1, len(compilers), figsize=(22, 7))
        _, ax_build  = plt.subplots(1, len(compilers), figsize=(22, 7))
        _, ax_binary = plt.subplots(1, len(compilers), figsize=(22, 7))
        hashed_checksums = defaultdict(list)
        y_min = [np.iinfo(np.uint32).max, np.iinfo(np.uint32).max, np.iinfo(np.uint32).max]
        y_max = [np.iinfo(np.uint32).min, np.iinfo(np.uint32).min, np.iinfo(np.uint32).min]

        for i, compiler in enumerate(compilers):
            times = defaultdict(list)
            builds = defaultdict(list)
            binaries = defaultdict(list)

            for size in sizes:
                data = generate(f"{TMP}/data", size, combination)
                
                for target in targets:
                    time = buildit(name, compiler, target, size, combination)
                    builds[target].append(time)
                    ts, checksums, binary_size = timeit(target)
                    times[target].append(np.median(ts))
                    binaries[target].append(binary_size)
                    assert len(checksums) > 0 and all(checksum == checksums[0] for checksum in checksums), f"{compiler}, {target}, {size}, {combination}, {checksums}"
                    hash = joblib.hash([data, combination])
                    hashed_checksums[hash].append(checksums[0])
                    assert all(checksum == hashed_checksums[hash][0] for checksum in hashed_checksums[hash]), f"{compiler}, {target}, {size}, {combination}, {hashed_checksums[hash]}"

            def update(ax, n, data, title, ylabel):
                for target, data in data.items():
                    ax[i].plot(sizes, data, marker='o', markersize=3, label=target, color=palette[options.index(target)])
                    y_min[n] = min(y_min[n], min(data))
                    y_max[n] = max(y_max[n], max(data))

                ax[i].set_title(title, fontsize=9)
                ax[i].grid(True)
                for spine in ax[i].spines.values():
                    spine.set_visible(False)
                ax[i].set_xlabel('size')
                ax[i].set_ylabel(ylabel)
                ax[i].ticklabel_format(style='plain', axis='both')
                ax[i].legend(loc='upper left')
            
            update(ax_time, 0, times, f"[{name} - lookup time | smaller is better]\ncompiler: {compiler}, cxxflags: {CXXFLAGS}" + " \n" + f"{', '.join([f'{name}:{value}' for name, value in combination.items()])}", "lookup time [s]")
            update(ax_build, 1, builds, f"[{name} - build time | smaller is better]\ncompiler: {compiler}, cxxflags: {CXXFLAGS}" + " \n" + f"{', '.join([f'{name}:{value}' for name, value in combination.items()])}", "build time [s]")
            update(ax_binary, 2, binaries, f"[{name} - binary size (stripped) | smaller is better]\ncompiler: {compiler}, cxxflags: {CXXFLAGS}" + " \n" + f"{', '.join([f'{name}:{value}' for name, value in combination.items()])}", "binary size [bytes]")

        def ylim(ax, n):
            ax[1].yaxis.tick_right()
            for i in range(len(compilers)):
                ax[i].set_ylim(y_min[n], y_max[n] * 1.1)

        ylim(ax_time, 0)
        ylim(ax_build, 1)
        ylim(ax_binary, 2)

        plt.tight_layout()
        plt.show()

seed         = 42
range_1_32   = [1, 4, 8, 12, 16, 20, 24, 28, 32]
range_1_1024 = [1, 128, 256, 512, 640, 768, 896, 1024]

benchmark('int_to_int', ['clang++-19', 'g++-13'], ['if_else', 'switch_case', 'map', 'unordered_map', 'boost_unordered_flat_map', 'frozen', 'mph'], sizes = range_1_32, combinations = [        
  {'range': [0, np.iinfo(np.uint16).max], 'probability': 1.00, 'seed': seed},
  {'range': [0, np.iinfo(np.uint16).max], 'probability': 0.50, 'seed': seed},
  {'range': [0, np.iinfo(np.uint16).max], 'probability': 0.00, 'seed': seed},
])

benchmark('int_to_int', ['clang++-19', 'g++-13'], ['unordered_map', 'boost_unordered_flat_map', 'frozen', 'mph'], sizes = range_1_1024, combinations = [        
  {'range': [0, np.iinfo(np.uint32).max], 'probability': 1.00, 'seed': seed},
  {'range': [0, np.iinfo(np.uint32).max], 'probability': 0.50, 'seed': seed},
  {'range': [0, np.iinfo(np.uint32).max], 'probability': 0.00, 'seed': seed},
])

benchmark('str_to_int', ['clang++-19', 'g++-13'], ['if_else', 'map', 'unordered_map', 'boost_unordered_flat_map', 'frozen', 'gperf', 'mph'], sizes = range_1_32, combinations = [        
  {'length': [1, 8], 'probability': 1.00, 'seed': seed},
  {'length': [1, 8], 'probability': 0.50, 'seed': seed},
  {'length': [1, 8], 'probability': 0.00, 'seed': seed},
])

benchmark('str_to_int', ['clang++-19', 'g++-13'], ['unordered_map', 'boost_unordered_flat_map', 'frozen', 'gperf', 'mph'], sizes = range_1_1024, combinations = [        
  {'length': [1, 8], 'probability': 1.00, 'seed': seed},
  {'length': [1, 8], 'probability': 0.50, 'seed': seed},
  {'length': [1, 8], 'probability': 0.00, 'seed': seed},
])

benchmark('str_to_int', ['clang++-19', 'g++-13'], ['unordered_map', 'boost_unordered_flat_map', 'frozen', 'gperf', 'mph'], sizes = range_1_32, combinations = [        
  {'length': [8, 8], 'probability': 1.00, 'seed': seed},
  {'length': [8, 8], 'probability': 0.50, 'seed': seed},
  {'length': [8, 8], 'probability': 0.00, 'seed': seed},
])

benchmark('str_to_int', ['clang++-19', 'g++-13'], ['unordered_map', 'boost_unordered_flat_map', 'frozen', 'gperf', 'mph'], sizes = range_1_1024, combinations = [        
  {'length': [8, 8], 'probability': 1.00, 'seed': seed},
  {'length': [8, 8], 'probability': 0.50, 'seed': seed},
  {'length': [8, 8], 'probability': 0.00, 'seed': seed},
])

..............................................................................................................................

..............................................................................................................................

..............................................................................................................................

................................................................

................................................................

................................................................

..............................................................................................................................

..............................................................................................................................

..............................................................................................................................

................................................................................

`int lookup(int id) { static constexpr std::array ids{ std::pair{54u, 91u}, std::pair{64u, 324u}, std::pair{91u, 234u}, }; return mph::lookup(id); }`	`lookup: // g++ -DNDEBUG -std=c++20 -O3 imull $1275516394, %edi, %eax shrl $23, %eax movl $24029728, %ecx shrxl %eax, %ecx, %eax andl $511, %eax retq`	https://godbolt.org/z/ndr5K8n8K
`int lookup(int id) { static constexpr std::array ids{ std::pair{54u, 91u}, std::pair{324u, 54u}, std::pair{64u, 324u}, std::pair{234u, 64u}, std::pair{91u, 234u}, }; return mph::lookup(id); }`	`lookup: // g++ -DNDEBUG -std=c++20 -O3 andl $7, %edi leaq lookup(%rip), %rax movl (%rax,%rdi,4), %eax retq lookup: .long 324 .long 0 .long 64 .long 234 .long 54 .long 0 .long 91`	https://godbolt.org/z/aE3117Goh
`int find(int id) { static constexpr std::array ids{ std::pair{27629, 1}, std::pair{6280, 2}, // 1..128 pairs... std::pair{33691, 128}, }; return *mph::find(id); }`	`find: // g++ -DNDEBUG -std=c++20 -O3 -mbmi2 -mavx512f vpbroadcastd %edi, %zmm0 shll $4, %edi movzbl %dil, %ecx leaq find vpcmpeqd (%rdx,%rcx,4), %zmm0, %k0 kmovw %k0, %esi kortestw %k0, %k0 rep bsfq %rax, %rax movl $64, %eax addl %eax, %ecx xorl %eax, %eax testw %si, %si cmovnel 1024(%rdx,%rcx,4), %eax vzeroupper retq find: ... // see godbolt`	https://godbolt.org/z/35oYfaWGo
`int find(std::string_view str) { static constexpr auto symbols = std::array{ std::pair{"AMZN"sv, 1}, std::pair{"AAPL"sv, 2}, std::pair{"GOOGL"sv, 3}, std::pair{"META"sv, 4}, std::pair{"MSFT"sv, 5}, std::pair{"NVDA"sv, 6}, std::pair{"TSLA"sv, 7}, }; return *mph::find(str); }`	`find: // g++ -DNDEBUG -std=c++20 -O3 -mbmi2 shll $3, %edi bzhil %edi, (%rsi), %eax movl $789, %ecx pextl %ecx, %eax, %ecx leaq find(%rip), %rdx xorl %esi, %esi cmpl (%rdx,%rcx,8), %eax movzbl 4(%rdx,%rcx,8), %eax cmovnel %esi, %eax retq find: ... // see godbolt`	https://godbolt.org/z/5T3EYnsnW

Perfect hashing bechmark - https://en.wikipedia.org/wiki/Perfect_hash_function (with keys known at compile-time)¶

Compilers¶

CXXFLAGS¶

Targets (with default settings) - https://github.com/boost-ext/mph/tree/benchmark/perfect_hashing ¶

Benchmarks¶

Measurements¶

Parameters (known at compile-time)¶

Notes - 'There is no silver bullet if it comes to the performance, just trade-offs!'¶

mph [lookup/find] - https://github.com/boost-ext ¶

Acknowledgments¶

Perfect hashing bechmark - https://en.wikipedia.org/wiki/Perfect_hash_function (with keys known at compile-time)¶

Compilers¶

CXXFLAGS¶

Targets (with default settings) - https://github.com/boost-ext/mph/tree/benchmark/perfect_hashing¶

Benchmarks¶

Measurements¶

Parameters (known at compile-time)¶

Notes - 'There is no silver bullet if it comes to the performance, just trade-offs!'¶

mph [lookup/find] - https://github.com/boost-ext¶

Acknowledgments¶

Targets (with default settings) - https://github.com/boost-ext/mph/tree/benchmark/perfect_hashing ¶

mph [lookup/find] - https://github.com/boost-ext ¶