Beware with Geekbench v6 results!

By Davidson Francis | Published: 2024-04-23 | Last Modified: 2024-04-23

Geekbench is a widely renowned benchmarking tool that I have personally used for several years to compare various machines I have access to. Among them are my desktop and my laptop, featuring an i7 2600 and an i5 7300HQ, respectively.

I have been using both machines for many years, and their performance is remarkably similar. This has been confirmed through synthetic benchmarks as well as real-world use cases over the years, such as lengthy program builds, and many other CPU-bound tasks.

Recently, I ran Geekbench v6.2.0 on both my machines, and a significant surprise occurred:

Image 1: My desktop, and my laptop, respectively

This had never happened before: a nearly 50% difference is undeniably substantial! I have consistently run benchmarks on these two machines, including (but not limited to):

Geekbench v5: i5 7300HQ v5 and i7 2600 v5
PassMark: i5 7300HQ vs i7 2600
Stress-NG:

# i5 7300HQ:
$ time ./stress-ng --matrix 0 -t 30s --metrics-brief
stress-ng: info:  [5690] setting to a 30 secs run per stressor
stress-ng: info:  [5690] dispatching hogs: 4 matrix
stress-ng: metrc: [5690] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [5690]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: metrc: [5690] matrix           353700     30.00    119.94      0.00     11789.51        2948.90
stress-ng: info:  [5690] skipped: 0
stress-ng: info:  [5690] passed: 4: matrix (4)
stress-ng: info:  [5690] failed: 0
stress-ng: info:  [5690] metrics untrustworthy: 0
stress-ng: info:  [5690] successful run completed in 30.01 secs

# i7 2600
$ time ./stress-ng --matrix 0 -t 30s --metrics-brief
stress-ng: info:  [1465] setting to a 30 secs run per stressor
stress-ng: info:  [1465] dispatching hogs: 8 matrix
stress-ng: metrc: [1465] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [1465]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: metrc: [1465] matrix           455846     30.00    239.97      0.00     15194.20        1899.56
stress-ng: info:  [1465] skipped: 0
stress-ng: info:  [1465] passed: 8: matrix (8)
stress-ng: info:  [1465] failed: 0
stress-ng: info:  [1465] metrics untrustworthy: 0
stress-ng: info:  [1465] successful run completed in 30.01 secs

And in all my tests —literally all of them (as evident above)— I had never encountered such a significant difference, which typically remains around a maximum of 10%.

What's Changed?

A discerning reader will notice a subtle distinction at the top, just below the scores, between the benchmarks: Geekbench 6.2.0 for Linux AVX2 for the i5 and Geekbench 6.2.0 for Linux x86 (64-bit) for the i7. Does this resolve the issue? Did the i5 run with AVX2 support while the i7 did not, thereby explaining the discrepancy in benchmarks and everything else, right?

Unfortunately, no.

Upon downloading Geekbench v6.2.0, we observed the following files:

$ ls -lah
total 477M
drwxr-xr-x 2 david users 4.0K Apr 23 22:53 .
drwxr-xr-x 4 david users 4.0K Apr 23 22:53 ..
-rw-r--r-- 1 david users 302M Sep 11  2023 geekbench-workload.plar
-rw-r--r-- 1 david users 4.2M Sep 11  2023 geekbench.plar
-rwxr-xr-x 1 david users 3.4M Sep 11  2023 geekbench6
-rwxr-xr-x 1 david users  88M Sep 11  2023 geekbench_avx2
-rwxr-xr-x 1 david users  80M Apr 23 22:53 geekbench_x86_64

$ md5sum *
fc758366e0dd1457875c0c97222365b4  geekbench-workload.plar
f67c40302b064de7c06e3fb9567a7ba0  geekbench.plar
1680c4f456ece1c9661bb6f26991fdb9  geekbench6
00e848cea509532ebf103f215c3db949  geekbench_avx2
f4d9d9b019f052e8fad0b59fccfd1e2f  geekbench_x86_64

The geekbench6 acts as a 'dispatcher', meaning it selects the appropriate executable based on the CPU's supported features. The other two binaries likely represent a version with AVX2 and a generic x86_64, presumably without SIMD code, correct?

Ultimately, what matters is that I should also be able to manually execute geekbench_x86_64 on both machines to have an identical comparison environment (since my i7 only supports up to AVX1).

Here is the new result from my i5:

Image 2: i5 7300HQ running geekbench_x86_64 with Geekbench v6.2.0

Indeed, there was a slight reduction in the difference, but it's still significantly large compared to the other benchmarks shown earlier, even when compared to Geekbench v5 itself!

I also considered the operating system, libraries, faulty hardware, thermal throttling, etc., but the official results for the i5 and i7 also follow the same pattern... why?

Image 3: Official i5 and i7 results for Geekbench v6

Is the execution truly identical on both machines?

Initial Hypothesis: Execution Path Analysis

If a significant difference persists, it may indicate a potential variation in the execution flow between the two machines, suggesting that the executed code is not identical... although that should be impossible, right?

How do we examine the execution path? Enter perf!

The initial idea is quite straightforward:

Sample the Geekbench run using perf on both systems
Compare the reports
Profit

Let's get started:

$ time perf record --call-graph dwarf ./geekbench_x86_64
Geekbench 6.2.0 : https://www.geekbench.com/

Geekbench 6 requires an active internet connection and automatically uploads
benchmark results to the Geekbench Browser.

Upgrade to Geekbench 6 Pro to enable offline use and unlock other features:

  https://store.primatelabs.com/v6

Enter your Geekbench 6 Pro license using the following command line:

  ./geekbench_x86_64 --unlock <email> <key>

System Information
  Operating System              Slackware 14.2 x86_64 (post 14.2 -current)
  Kernel                        Linux 5.4.186 x86_64
  Model                         Acer Nitro AN515-51
  Motherboard                   KBL Freed_KLS
  BIOS                          Insyde Corp. V1.22

CPU Information
  Name                          Intel Core i5-7300HQ
  Topology                      1 Processor, 4 Cores
  Identifier                    GenuineIntel Family 6 Model 158 Stepping 9
  Base Frequency                3.50 GHz
  L1 Instruction Cache          32.0 KB x 2
  L1 Data Cache                 32.0 KB x 2
  L2 Cache                      256 KB x 2
  L3 Cache                      6.00 MB

Memory Information
  Size                          23.4 GB


Single-Core
  Running File Compression
  Running Navigation
^C[ perf record: Woken up 3194 times to write data ]
Warning:
Processed 107281 events and lost 38 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 799.249 MB perf.data (100206 samples) ]


real  0m46.667s
user  0m26.732s
sys 0m1.794s

$ ls -lah perf.data
-rw------- 1 david users 800M Apr 24 21:14 perf.data

This isn't going well... 800MB for a brief test span... I B(genuinely) will not have enough disk space for extended minutes of execution.

A New Approach...

Let's examine the site's report for a benchmark with a significant discrepancy and run only that particular test instead of the entire suite. This way, the test will finish fast, and we won't have an exponentially large perf.data output.

Reviewing the reports, the Object Detection test stands out as an excellent candidate: scoring 66 versus 545 points on the i7 and i5, respectively.

Now, let's determine which flag is needed for Geekbench:

$ ./geekbench_x86_64 --help
Geekbench 6.2.0 : https://www.geekbench.com/

Usage:

  ./geekbench_x86_64 [ options ]

Options:

  -h, --help                  print this message
  --unlock EMAIL KEY          unlock Geekbench using EMAIL and KEY

  --cpu                       run the CPU benchmark
  --sysinfo                   display system information and exit
INTEL-MESA: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0


  --gpu [API]                 run the GPU benchmark
                                API can be one of: OpenCL (default), Vulkan
  --gpu-list                  list available GPU platforms, devices and exit
  --gpu-platform-id ID        use GPU platform ID (default is 0)
  --gpu-device-id ID          use GPU device ID (default is 0)

  If no options are given, the default action is to run the CPU benchmark.

Hmm... there isn't one...? So, what's next? I'm unaware of any perf flag that could assist with this.

Let's Hack?

The previous idea is good and indeed seems promising, but without an official way to run a single benchmark, let's add our own method of doing so!

Since the binary is not stripped, it wasn't exactly difficult to obtain the methods for all the benchmarks: I simply paused the running benchmark and analyzed the backtrace for potential candidate functions. After some time, I arrived at this:

+-------------------------------------------------------------------------------------------------------+
+ File offset  |  Mangled C++ symbol                        |  Demangled C++ symbol/function            +
+-------------------------------------------------------------------------------------------------------+
+   0x4b9370   |  _ZN23FileCompressionWorkload6workerEi     |  FileCompressionWorkload::worker(int)     +
+   0x3ccd80   |  _ZN18NavigationWorkload6workerEi          |  NavigationWorkload::worker(int)          +
+   0x4c1a20   |  _ZN20HTML5BrowserWorkload6workerEi        |  HTML5BrowserWorkload::worker(int)        +
+   0x3ee400   |  _ZN20PDFRenderingWorkload6workerEi        |  PDFRenderingWorkload::worker(int)        +
+   0x3f1d70   |  _ZN13PhotoWorkload6workerEi               |  PhotoWorkload::worker(int)               +
+   0x4b12c0   |  _ZN16ClangTBBWorkload6workerEi            |  ClangTBBWorkload::worker(int)            +
+   0x3f8810   |  _ZN14PythonWorkload6workerEi              |  PythonWorkload::worker(int)              +
+   0x496810   |  _ZN24AssetCompressionWorkload6workerEi    |  AssetCompressionWorkload::worker(int)    +
+   0x3e4ed0   |  _ZN23ObjectDetectionWorkload6workerEi     |  ObjectDetectionWorkload::worker(int)     +
+   0x499400   |  _ZN25BackgroundBlurTBBWorkload6workerEi   |  BackgroundBlurTBBWorkload::worker(int)   +
+   0x4bdcd0   |  _ZN27HorizonDetectionTBBWorkload6workerEi |  HorizonDetectionTBBWorkload::worker(int) +
+   0x4c3690   |  _ZN18InpaintTBBWorkload6workerEi          |  InpaintTBBWorkload::worker(int)          +
+   0x4bd560   |  _ZN14HDRTBBWorkload6workerEi              |  HDRTBBWorkload::worker(int)              +
+   0x4a72e0   |  _ZN14CameraWorkload6workerEi              |  CameraWorkload::worker(int)              +
+   0x3f97a0   |  _ZN19RaytraceTBBWorkload6workerEi         |  RaytraceTBBWorkload::worker(int)         +
+   0x3fb420   |  _ZN14SfMTBBWorkload6workerEi              |  SfMTBBWorkload::worker(int)              +
+-------------------------------------------------------------------------------------------------------+

The list above contains all the benchmark functions, but that alone is not sufficient. It's not as if I can invoke any of these functions from any point in the code.

Further analyzing the backtrace, I found these functions (mangled):

_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions
_ZN14WorkloadDriver3runE11SectionType12WorkloadTypePK15WorkloadOptions

or (demangled):

SectionDriver::run(SectionType, std::set<WorkloadType, std::less<WorkloadType>, std::allocator<WorkloadType> >, WorkloadOptions const*)
WorkloadDriver::run(SectionType, WorkloadType, WorkloadOptions const*)

SectionDriver::run() iterates through the list of benchmarks and, for each one, invokes WorkloadDriver::run() with the appropriate benchmark code, specifically at:

_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions:
[snip]
  3bb5f0: /----> 49 89 2e               mov    %rbp,(%r14)
  3bb5f3: |      49 83 47 18 08         addq   $0x8,0x18(%r15)
  3bb5f8: |      4c 8b 74 24 20         mov    0x20(%rsp),%r14
  3bb5fd: |      48 8d 3d dc 82 e4 02   lea    0x2e482dc(%rip),%rdi # 32038e0 <_ZTS15BrowserDelegate+0x7e>
  3bb604: |      31 f6                  xor    %esi,%esi
  3bb606: |      31 d2                  xor    %edx,%edx
  3bb608: |      31 c9                  xor    %ecx,%ecx
  3bb60a: |      45 31 c0               xor    %r8d,%r8d
  3bb60d: |      e8 7e 67 37 00         call   731d90 <je_mallctl>
  3bb612: |      41 8b 7e 14            mov    0x14(%r14),%edi
  3bb616: |      e8 95 54 1f 00         call   5b0ab0 <_ZN4base5sleepEj>
  3bb61b: |      49 83 c4 04            add    $0x4,%r12         ; increase benchmark pointer
  3bb61f: |      4d 39 ec               cmp    %r13,%r12         ; should end?
  3bb622: |      48 8b 6c 24 30         mov    0x30(%rsp),%rbp
  3bb627: |      0f 84 6b 01 00 00      je     3bb798 <_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions+0x4b8>
  3bb62d: |      48 83 7d 20 00         cmpq   $0x0,0x20(%rbp)
  3bb632: |      0f 84 94 01 00 00      je     3bb7cc <_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions+0x4ec>
  3bb638: |      41 8b 1c 24            mov    (%r12),%ebx       ; ebx = benchmark number
  3bb63c: |      48 8b 7c 24 28         mov    0x28(%rsp),%rdi
  3bb641: |      ff 55 28               call   *0x28(%rbp)
  3bb644: |      a8 01                  test   $0x1,%al
  3bb646: |      0f 85 4c 01 00 00      jne    3bb798 <_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions+0x4b8>
  3bb64c: |      8b 7c 24 04            mov    0x4(%rsp),%edi
  3bb650: |      89 de                  mov    %ebx,%esi
  3bb652: |      e8 89 de 00 00         call   3c94e0 <_ZN8Metadata16workload_factoryE11SectionType12WorkloadType>
  3bb657: |      48 85 c0               test   %rax,%rax
  3bb65a: |      74 bf                  je     3bb61b <_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions+0x33b>
          |
          |
          | Benchmark call:
  3bb65c: |      48 8b 7c 24 40         mov    0x40(%rsp),%rdi
  3bb661: |      8b 74 24 04            mov    0x4(%rsp),%esi
  3bb665: |      89 da                  mov    %ebx,%edx         ; edx = benchmark number
  3bb667: |      4c 89 f1               mov    %r14,%rcx
  3bb66a: |      e8 31 4f 00 00         call   3c05a0 <_ZN14WorkloadDriver3runE11SectionType12WorkloadTypePK15WorkloadOptions>
          |
          |
  3bb66f: |      48 89 c5               mov    %rax,%rbp
  3bb672: |      48 8b 44 24 30         mov    0x30(%rsp),%rax
  3bb677: |      48 8b 78 08            mov    0x8(%rax),%rdi
  3bb67b: |      48 8b 07               mov    (%rdi),%rax
  3bb67e: |      0f 57 c0               xorps  %xmm0,%xmm0
  3bb681: |      ff 50 10               call   *0x10(%rax)
  3bb684: |      4d 8b 77 18            mov    0x18(%r15),%r14
  3bb688: |      4d 3b 77 20            cmp    0x20(%r15),%r14
  3bb68c: \----- 0f 85 5e ff ff ff      jne    3bb5f0 <_ZN13SectionDriver3runE11SectionTypeSt3setI12WorkloadTypeSt4lessIS2_ESaIS2_EEPK15WorkloadOptions+0x310>

Once WorkloadDriver::run() is invoked (at 0x3bb66a), there are other function calls that prepare the environment until the benchmark is actually invoked. However, the key point to note is the content in edx: it represents the benchmark number to be executed!

The table below illustrates all the numbers for each benchmark:

+----------------------------------------------------------+
+ Benchmark                                |  Bench number |
+----------------------------------------------------------+
+ FileCompressionWorkload::worker(int)     |   0x65        +
+ NavigationWorkload::worker(int)          |     66        +
+ HTML5BrowserWorkload::worker(int)        |     67        +
+ PDFRenderingWorkload::worker(int)        |     68        +
+ PhotoWorkload::worker(int)               |     69        +
+ ClangTBBWorkload::worker(int)            |     c9        +
+ PythonWorkload::worker(int)              |     ca        +
+ AssetCompressionWorkload::worker(int)    |     cb        +
+ ObjectDetectionWorkload::worker(int)     |  0x12d  <<<<  +
+ BackgroundBlurTBBWorkload::worker(int)   |  0x12e        +
+ HorizonDetectionTBBWorkload::worker(int) |  0x191        +
+ InpaintTBBWorkload::worker(int)          |  0x192        +
+ HDRTBBWorkload::worker(int)              |  0x193        +
+ CameraWorkload::worker(int)              |  0x194        +
+ RaytraceTBBWorkload::worker(int)         |  0x1f5        +
+ SfMTBBWorkload::worker(int)              |  0x1f6        +
+----------------------------------------------------------+

That said, since we only want to run the Object Detection (due to significant discrepancies in results), we simply need to patch the value of edx to 0x12d just before the call to WorkloadDriver::run(). Upon return, we can terminate our program, like so:

3bb65c: 48 8b 7c 24 40  mov    0x40(%rsp),%rdi
3bb661: 8b 74 24 04     mov    0x4(%rsp),%esi
3bb665: 89 da           mov    %ebx,%edx
3bb667: 4c 89 f1        mov    %r14,%rcx
3bb66a: ba 2d 01 00 00  mov    $0x12d,%edx   ; <<< Set our benchmark here!

; Call the function
3bb66f: e8 2c 4f 00 00  callq  3c05a0 <_ZN14WorkloadDriver3runE11SectionType12WorkloadTypePK15WorkloadOptions>

; Exit 0
3bb674: b8 3c 00 00 00  mov    $0x3c,%eax
3bb679: 48 31 ff        xor    %rdi,%rdi
3bb67c: 0f 05           syscall

Note that, since we've added 5 bytes before BC(callq), the address offset needs to be subtracted by 5. In other words, change e8 31 4f 00 00 to e8 2c 4f 00 00.

You can now easily patch your Geekbench v6.2.0:

$ printf "\xba\x2d\x01\x00\x00\xe8\x2c\x4f\x00\x00\xb8\x3c\x00\x00\x00\x48\x31\xff\x0f\x05" | dd of=geekbench_x86_64 bs=1 seek=$((0x3bb66a)) conv=notrunc

Simply replace \x2d\x01 (little-endian) with the benchmark you want to execute.

Perf-ing them all!

Before diving into perf, let's ensure everything is running smoothly. We'll run Geekbench as usual to see what happens now (click to enlarge):

Image 4: Patched Geekbench v6.2.0 running only Object Detection test

It worked! I must admit, I was quite skeptical about this.

What's even better: we can already notice a big initial difference between the i5 and the i7 in these preliminary results... what insights will perf provide us?

Running perf with perf record --call-graph dwarf ./gb6_obj_detec_only yields a perf.data file of 228M, a significant leap compared to before. On the desktop, it reached around \~1.2GB.

Image 5: Perf record

Reports!

Let's dive into the final results. Below, you'll find perf reports for the i7 and i5 processors, respectively. A quick reminder: these reports pertain to the execution of Geekbench v6.2.0 (also applicable to v6.3.0) using the geekbench_x86_64 binary (click on the images to enlarge):

Image 6: Perf report for i7 2600

Image 7: Perf report for i5 7300HQ

This can't be happening... let's examine the instruction annotations:

Image 8: Perf report for i7 2600, most used instructions

Image 9: Perf report for i5 7300HQ, most used instructions

Observing the Object Detection test, we find the invocation of the function ml::cpu::gemmNT_lowp(). However, there are at least two versions of this function: the one mentioned earlier and also ml::cpu::gemmNT_lowp_avx2().

Furthermore, the usage of instructions such as vpmovsxbw, vpbroadcastw, and so on clearly indicates the evident use of AVX2. On the i7 side, instructions like punpcklbw, psrad, paddd, and so forth are used, all from SSE2!.

If it's not clear yet, there are two major issues with this code:

AVX2 Usage in a Supposedly Generic x86_64 Binary: The binary should ideally support any x64 CPU. However, it's evident that there's a runtime dispatcher selecting the best code path based on the CPU.
Lack of Support for Instruction Sets Below AVX2: Between SSE2 (supported even by Pentium 4!) and AVX1 (supported by my i7), there's a plethora of SIMD instructions, including SSE3, SSSE3, SSE4, SSE4.1 + SSE4.2, and AVX. This benchmark (and possibly others) is extremely binary: it either supports AVX2 or SSE2. This severely limits precise benchmark evaluations of everything a CPU has to offer. My i7 was effectively using the same instruction set as a Pentium 4!

I understand that supporting multiple SIMD code variations is challenging, and writing code for AVX1 is more complex than for AVX2. However, this evaluation remains unfair.

If they want to implement support solely for AVX2, that's fine. But please ensure that geekbench_x86_64 runs exclusively with SSE2. Otherwise, we're comparing apples to oranges.

Is There a Fix?

Surprisingly, yes!

Dissatisfied with the results, I decided to investigate the geekbench_x86_64 binary once again and found the following function: 5e8a70 <_Z17is_avx2_availablev>, invoked from the following backtrace:

#0  0x0000555555b3ca70 in is_avx2_available() ()
#1  0x0000555555a4ad7e in ml::cpu::convolution_2d_prepare(ml::Node*) ()
#2  0x0000555555a47de7 in ml::Backend::prepare() ()
#3  0x0000555555936c3e in ObjectDetectionWorkload::ObjectDetectionWorkload(SectionType, WorkloadOptions const*) ()

It's a simple function that returns 1 if available and 0 if not, which can be easily patched with:

0x00000000005e8a70 <_Z17is_avx2_availablev>:
  0x5e8a70:   48 31 c0                xor    %rax,%rax
  0x5e8a73:   c3                      retq

If you want to 'fix' your Geekbench v6.2.0:

printf "\x48\x31\xc0\xc3" | dd of=geekbench_x86_64 bs=1 seek=$((0x5e8a70)) conv=notrunc

And finally, I was able to obtain the following final result:

Image 10: Fixed Geekbench result

which you can also check on the Geekbench website.

Final Thoughts

I am a big fan of Geekbench and have been using it for several years. However, I was somewhat surprised by these results, which lead me to not recommend version 6 for mixed CPU comparisons involving support for AVX2 and non-AVX2 instructions. The results will be inaccurate, and under these circumstances, I can only recommend using Geekbench v5.

Given its closed-source nature, what would certainly take me a day to analyze, I spent several days, which is precisely why I will always advocate for the use of FOSS.

On a positive note, the binaries for Geekbench v6.2.0 were not stripped (fortunately! I hope this doesn't change). The project appears to be a well-written C++ code, which posed me an extra challenge to understand the generated asm code. The use of Intel TBB for multi-threading is also noteworthy.

ℹ️ Info: Although all the analysis discussed here was done in v6.2.0, the same issues also occur in the latest version, v6.3.0.

This post is licensed under CC BY 4.0 by the author.