Does V8's JIT Engine Truly Compete with GCC?
JavaScript is a language that has never particularly piqued my interest, as I prefer working at the binary, CPU, operating system level, etc. However, something that has truly caught my attention recently is V8, or more precisely, its ability to perform Just-in-time Compilation (JIT).
V8’s JIT is highly acclaimed for its speed and consistently ranks not far from the top in various benchmarks. While it might not always be the fastest, it is orders of magnitude ahead of purely interpreted languages such as Python (CPython), PHP, and Perl.
This article aims to address some of my questions: 1) Is it truly as remarkable as claimed? 2) What is the ASM (x86_64) generated by it? Is it close to that of a compiled language? 3) Can you debug this JIT-generated code?
The text reflects my personal opinion, with no affiliation to V8. Everything presented here is the result of my exploration over the past week. Please avoid drawing hasty conclusions or taking the content too seriously. I am still learning, and any assistance on the subject is highly appreciated.
My Environment
The environment for conducting my tests is as follows:
- Slackware 14.2-current, w/ Linux v5.4.186
- GCC v9.3.0 / Clang v14.0.6
- V8 and V8-debug v12.3.127, obtained from jsvu
- was2wat (git~1.0.34-36-gef851559), obtained from WABT: The WebAssembly Binary Toolkit
Let’s JIT!
All analyses conducted here will be based on the following code snippet, specifically the mul
function:
mul.js:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
const SIZE = 10000000
function mul(a,b,c) {
for (let i = 0; i < SIZE; i++)
a[i] = b[i] * c[i];
}
var a = Array(SIZE).fill(0)
var b = Array(SIZE).fill(50)
var c = Array(SIZE).fill(40)
const iter = Number(arguments[0]);
t0 = performance.now()
for (let i = 0; i < iter; i++)
mul(a,b,c)
t1 = performance.now()
console.log("Time: " + (t1-t0) + " ms");
The code is quite simple: it successively multiplies two vectors, b
and c
, and stores the result in a
. Two reasons led me to choose such a simple code: 1) It is a straightforward and potentially optimizable piece of code. 2) I really intend to read the ASM generated by the JIT, and a small codebase potentially generates a concise ASM code.
Basic and Optimizing Compilers
V8 always attempts to JIT-compile its code, but with a twist: it initially does so with a fast compiler without optimizations and executes that compiled code. If, during the execution of this code, V8 detects ‘hot’ functions—functions that potentially stress the CPU—the code is then recompiled with an Optimizing Compiler. One such compiler is Turbofan, and that’s what we are going to explore here.
GDB Enters the Scene!
Assuming the tools are properly installed, it is possible to dump the machine instructions generated by Turbofan with:
1
$ v8-debug --print-opt-code add.js -- 100
producing an output like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
--- Raw source ---
(a,b,c) {
for (let i = 0; i < SIZE; i++)
a[i] = b[i] * c[i];
}
--- Optimized code ---
optimization_id = 1
source_position = 36
kind = TURBOFAN
name = mul
stack_slots = 21
compiler = turbofan
address = 0x304a00002549
Instructions (size = 1208)
0x7fffa8005840 0 488d1df9ffffff REX.W leaq rbx,[rip+0xfffffff9]
0x7fffa8005847 7 483bd9 REX.W cmpq rbx,rcx
0x7fffa800584a a 740d jz 0x7fffa8005859 <+0x19>
[...]
(huge wall of 284 intructions, for 2 lines of code... good luck)
But that’s not very exciting:
- The code displayed on the screen is extensive and lacks much useful information about what is happening where.
- The line address is not very useful in a system that uses ASLR.
We truly need debugging capability to decipher all this code, and we will have it!
To use GDB within the JIT, we need to be a little clever: the instruction dump above is done before code execution, so we need to interrupt the v8 execution before that happens and then add breakpoints wherever we want.
In version v12.3.127 of v8, this happens inside the Disassemble()
function in src/objects/code.cc
, specifically at line 189, just after the DisassembleCodeRange()
function call (you can check this on v8’s src). Additionally, the installed versions of v8
and v8-debug
provided by jsvu are actually shell scripts for the real program.
In summary, what you really need to load your script and break just after the instruction dump is:
1
2
3
4
5
6
7
8
9
10
$ gdb \
-ex "set confirm off" \
-ex "b _start" \
-ex "r" \
-ex "b code.cc:189" \
-ex "c" \
--args $HOME/.jsvu/engines/v8-debug/v8-debug \
--snapshot_blob="$HOME/.jsvu/engines/v8-debug/snapshot_blob.bin" \
--print-opt-code \
mul.js -- 100
After that, simply set a breakpoint for some point of interest within your JITed function and analyze.
Below is the heavily commented code of the mul()
function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
--- Raw source ---
(a,b,c) {
for (let i = 0; i < SIZE; i++)
a[i] = b[i] * c[i];
}
--- Optimized code ---
optimization_id = 1
source_position = 36
kind = TURBOFAN
name = mul
stack_slots = 21
compiler = turbofan
address = 0x304a00002549
Instructions (size = 1208)
[snip]
initial count' = 56651
mul:
count = rax
/-> 0x7fffa8005a80 240 488bc1 REX.W movq rax,rcx
|
| SIZE = mem[rbp-0x60]/2
| 0x7fffa8005a83 243 448b4da0 movl r9,[rbp-0x60]
| 0x7fffa8005a87 247 41d1f9 sarl r9, 1
|
| if (count < SIZE)
| 0x7fffa8005a8a 24a 413bc1 cmpl rax,r9
| 0x7fffa8005a8d 24d 0f8dc0000000 jge 0x7fffa8005b53 <+0x313> (NT)
| 0x7fffa8005a93 253 3bc6 cmpl rax,rsi
| 0x7fffa8005a95 255 0f8328020000 jnc 0x7fffa8005cc3 <+0x483> (NT)
|
| r9 = b[idx]
| 0x7fffa8005a9b 25b 458b4c8307 movl r9,[r11+rax*4+0x7]
| 0x7fffa8005aa0 260 41baffffffff movl r10,0xffffffff
| 0x7fffa8005aa6 266 4d3bca REX.W cmpq r9,r10
| 0x7fffa8005aa9 269 760d jna 0x7fffa8005ab8 <+0x278> (T) -\
| 0x7fffa8005aab 26b ba02000000 movl rdx,0x2 |
| 0x7fffa8005ab0 270 41ff95c0530000 call [r13+0x53c0] |
| 0x7fffa8005ab7 277 cc int3l |
| 0x7fffa8005ab8 278 3bc2 cmpl rax,rdx <-/
| 0x7fffa8005aba 27a 0f8307020000 jnc 0x7fffa8005cc7 <+0x487> (NT)
|
| r15 = a[idx]
| 0x7fffa8005ac0 280 458b7c8007 movl r15,[r8+rax*4+0x7]
| 0x7fffa8005ac5 285 41baffffffff movl r10,0xffffffff
| 0x7fffa8005acb 28b 4d3bfa REX.W cmpq r15,r10
| 0x7fffa8005ace 28e 760d jna 0x7fffa8005add <+0x29d> (T) -\
| 0x7fffa8005ad0 290 ba02000000 movl rdx,0x2 |
| 0x7fffa8005ad5 295 41ff95c0530000 call [r13+0x53c0] |
| 0x7fffa8005adc 29c cc int3l |
| 0x7fffa8005add 29d 41f6c101 testb r9,0x1 <-/
| 0x7fffa8005ae1 2a1 0f85e4010000 jnz 0x7fffa8005ccb <+0x48b> (NT)
|
| r9 /= 2
| 0x7fffa8005ae7 2a7 41d1f9 sarl r9, 1
| 0x7fffa8005aea 2aa 41f6c701 testb r15,0x1
| 0x7fffa8005aee 2ae 0f85db010000 jnz 0x7fffa8005ccf <+0x48f> (NT)
|
| r15 /= 2
| 0x7fffa8005af4 2b4 41d1ff sarl r15, 1
| 0x7fffa8005af7 2b7 418bd9 movl rbx,r9
| 0x7fffa8005afa 2ba 33c9 xorl rcx,rcx
|
| >>>> tmp = rbx*r15 <<<< (40 insns, 4 jumps)
| 0x7fffa8005afc 2bc 410fafdf imull rbx,r15
|
| if (!overflow(tmp))
| 0x7fffa8005b00 2c0 0f90c1 setol cl
| 0x7fffa8005b03 2c3 85c9 testl rcx,rcx
| 0x7fffa8005b05 2c5 0f85c8010000 jnz 0x7fffa8005cd3 <+0x493> (NT)
| 0x7fffa8005b0b 2cb 85db testl rbx,rbx
| 0x7fffa8005b0d 2cd 0f850c000000 jnz 0x7fffa8005b1f <+0x2df> (T) -\
| 0x7fffa8005b13 2d3 450bf9 orl r15,r9 |
| 0x7fffa8005b16 2d6 4585ff testl r15,r15 |
| 0x7fffa8005b19 2d9 0f8cb8010000 jl 0x7fffa8005cd7 <+0x497> |
| |
| if (count < SIZE) |
| 0x7fffa8005b1f 2df 4439e0 cmpl rax,r12 <-/
| 0x7fffa8005b22 2e2 0f83b3010000 jnc 0x7fffa8005cdb <+0x49b> (NT)
|
| tmp2 = tmp*2
| 0x7fffa8005b28 2e8 488bcb REX.W movq rcx,rbx
| 0x7fffa8005b2b 2eb 03cb addl rcx,rbx
|
| if (!overflow(tmp2))
| 0x7fffa8005b2d 2ed 0f80ac010000 jo 0x7fffa8005cdf <+0x49f> (NT)
|
| a[count] = tmp2
| 0x7fffa8005b33 2f3 894c8707 movl [rdi+rax*4+0x7],rcx
|
| count = rcx+1 (count++)
| 0x7fffa8005b37 2f7 488bc8 REX.W movq rcx,rax
| 0x7fffa8005b3a 2fa 83c101 addl rcx,0x1
|
| if (!overflow(count))
| 0x7fffa8005b3d 2fd 0f80a0010000 jo 0x7fffa8005ce3 <+0x4a3> (NT)
|
| if (!should_not_interrupt)
| StackGuard::address_of_interrupt_request
| 0x7fffa8005b43 303 41807db100 cmpb [r13-0x4f]
|
\-- 0x7fffa8005b48 308 0f8432ffffff jz 0x7fffa8005a80 <+0x240> (T)
0x7fffa8005b4e 30e e99c000000 jmp 0x7fffa8005bef <+0x3af>
[snip]
The generated code may look complicated, but it’s actually quite simple. However, there are some interesting points worth noting: 1) The Turbofan was indeed used for code optimization, just as expected. Note that this was intentional, and the loop invoking the mul()
function makes it ‘hot,’ prompting v8 to optimize it. 2) The loop starts from index 56651
instead of 0
, and the reason for this is straightforward: the optimized code starts from where the non-optimized code left off, which makes sense, doesn’t it? 3) Notice that the read values are divided by 2, and when saved, multiplied by
- In memory, the vector
b
is stored as a sequence of100
(instead of50
), andc
as a sequence of80
(instead of40
)… don’t ask me why. Any ideas?
Apart from that, traditional overflow checks are performed, and the loop proceeds, performing one multiplication per iteration, with 40 instructions and 4 branches taken between each imull
.
In the end, with this being the best that Turbofan can do for this code, I was slightly disappointed. I expected loop unrolling, perhaps SIMD, and so on. Well, we’ll have surprises later, don’t be sad =).
GCC Enters the Chat!
Having seen the previous JIT code, what code does GCC produce? Is it somewhat equivalent? Are there optimizations? Let’s find out.
The equivalent C code looks like this:
mul.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <inttypes.h>
#include <time.h>
#define SIZE 10000000
static int64_t difftimespec_us(
const struct timespec end, const struct timespec start)
{
return ((int64_t)end.tv_sec - (int64_t)start.tv_sec) * (int64_t)1000000
+ ((int64_t)end.tv_nsec - (int64_t)start.tv_nsec) / 1000;
}
void mul(int *restrict a, int *restrict b, int *restrict c)
{
for (size_t i = 0; i < SIZE; i++)
a[i] = b[i] * c[i];
}
int main(int argc, char **argv)
{
int64_t diff;
struct timespec t0, t1;
int *a = malloc(sizeof(*a) * SIZE);
int *b = malloc(sizeof(*b) * SIZE);
int *c = malloc(sizeof(*c) * SIZE);
memset(a, 0, SIZE);
for (size_t i = 0; i < SIZE; i++) {
b[i] = 50;
c[i] = 40;
}
int iter = atoi(argv[1]);
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < iter; i++)
mul(a,b,c);
clock_gettime(CLOCK_MONOTONIC, &t1);
diff = difftimespec_us(t1, t0);
printf("Time: %f ms\n", diff/1000.0);
return (0);
}
which produces the following asm when built with -O0
on GCC v9.3.0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
void mul(int *restrict a, int *restrict b, int *restrict c)
{
4011be: 55 push %rbp
4011bf: 48 89 e5 mov %rsp,%rbp
4011c2: 48 89 7d e8 mov %rdi,-0x18(%rbp)
4011c6: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4011ca: 48 89 55 d8 mov %rdx,-0x28(%rbp)
for (size_t i = 0; i < SIZE; i++)
4011ce: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
4011d5: 00
4011d6: /-- eb 47 jmp 40121f <mul+0x61>
a[i] = b[i] * c[i];
4011d8: /--|-> 48 8b 45 f8 mov -0x8(%rbp),%rax
4011dc: | | 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx
4011e3: | | 00
4011e4: | | 48 8b 45 e0 mov -0x20(%rbp),%rax
4011e8: | | 48 01 d0 add %rdx,%rax
4011eb: | | 8b 08 mov (%rax),%ecx ecx = b[i]
4011ed: | | 48 8b 45 f8 mov -0x8(%rbp),%rax
4011f1: | | 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx
4011f8: | | 00
4011f9: | | 48 8b 45 d8 mov -0x28(%rbp),%rax
4011fd: | | 48 01 d0 add %rdx,%rax
401200: | | 8b 00 mov (%rax),%eax eax = c[i]
401202: | | 48 8b 55 f8 mov -0x8(%rbp),%rdx
401206: | | 48 8d 34 95 00 00 00 lea 0x0(,%rdx,4),%rsi
40120d: | | 00
40120e: | | 48 8b 55 e8 mov -0x18(%rbp),%rdx
401212: | | 48 01 f2 add %rsi,%rdx
401215: | | 0f af c1 imul %ecx,%eax eax = eax*ecx
401218: | | 89 02 mov %eax,(%rdx) a[i] = eax
for (size_t i = 0; i < SIZE; i++)
40121a: | | 48 83 45 f8 01 addq $0x1,-0x8(%rbp) i++;
40121f: | \-> 48 81 7d f8 7f 96 98 cmpq $0x98967f,-0x8(%rbp) if (i < count)
401226: | 00
401227: \----- 76 af jbe 4011d8 <mul+0x1a> (T)
}
401229: 90 nop
40122a: 90 nop
40122b: 5d pop %rbp
40122c: c3 ret
Similar to the one generated by the JIT but much simpler and more straightforward, with only 18 instructions and a jump taken between each imul
.
Can GCC go beyond this? Certainly, with -O3
and -march=native
, GCC easily makes use of AVX2:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
clock_gettime(CLOCK_MONOTONIC, &t0);
40111d: 48 8d 75 b0 lea -0x50(%rbp),%rsi
401121: bf 01 00 00 00 mov $0x1,%edi
401126: 41 89 c6 mov %eax,%r14d
401129: e8 02 ff ff ff call 401030 <clock_gettime@plt>
for (int i = 0; i < iter; i++)
40112e: 45 85 ff test %r15d,%r15d
401131: /-------- 7e 36 jle 401169 <main+0xe9>
401133: | 31 d2 xor %edx,%edx
401135: | /----> 31 c0 xor %eax,%eax
401137: | | 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40113e: | | 00 00
a[i] = b[i] * c[i];
401140: | | /-> c4 c1 7e 6f 14 04 vmovdqu (%r12,%rax,1),%ymm2
401146: | | | c4 e2 6d 40 04 03 vpmulld (%rbx,%rax,1),%ymm2,%ymm0
40114c: | | | c4 c1 7e 7f 44 05 00 vmovdqu %ymm0,0x0(%r13,%rax,1)
for (size_t i = 0; i < SIZE; i++)
401153: | | | 48 83 c0 20 add $0x20,%rax
401157: | | | 48 3d 00 5a 62 02 cmp $0x2625a00,%rax
40115d: | | \-- 75 e1 jne 401140 <main+0xc0>
for (int i = 0; i < iter; i++)
40115f: | | ff c2 inc %edx
401161: | | 44 39 f2 cmp %r14d,%edx
401164: | \----- 75 cf jne 401135 <main+0xb5>
401166: | c5 f8 77 vzeroupper
mul(a,b,c);
clock_gettime(CLOCK_MONOTONIC, &t1);
401169: \-------> 48 8d 75 c0 lea -0x40(%rbp),%rsi
40116d: bf 01 00 00 00 mov $0x1,%edi
401172: e8 b9 fe ff ff call 401030 <clock_gettime@plt>
The mul()
function was inlined, and now 8 numbers are read and multiplied at a time!
Regarding performance: A comparative analysis will be conducted later, don’t worry.
WebAssembly Comes to the Rescue
Fairly enough, it’s understandable that JIT-compiled code won’t really match up to GCC with -O3
, let alone with -march=native
: analyzing whether code is ‘hot’ or not, compiling and executing at runtime, and still finding a balance between compilation time and performance seems quite complicated. Moreover, JS being a dynamically typed language imposes more constraints on the code to be compiled and executed (a problem that Java, for example, doesn’t have).
On the other hand, WebAssembly is compiled ‘ahead-of-time’, similar to Java or C. It has static typing (which avoids the compile-recompile cycle of JS code), its bytecode is smaller than plain-text JS, and it even skips the parsing stage of the source code. All of this creates a very favorable scenario for optimization by Clang.
Having said that, there are high expectations for WebAssembly. Let’s see what we can achieve.
The new code for this test are as follows:
load.js:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
const buf = read('mul.wasm', 'binary');
const mod = new WebAssembly.Module(buf);
var imports = {
env: {
log_time: function(arg) {
console.log("Time elapsed: " + arg + " ms");
},
performance_now: function() {
return performance.now();
},
}
};
const instance = new WebAssembly.Instance(mod, imports);
const { do_mul } = instance.exports;
do_mul(100);
mul_wasm.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <stdint.h>
#include <stddef.h>
extern double performance_now();
extern void log_time(double);
#define SIZE 10000000
int a[SIZE];
int b[SIZE];
int c[SIZE];
void mul(int *restrict a, int *restrict b, int *restrict c)
{
for (size_t i = 0; i < SIZE; i++)
a[i] = b[i] * c[i];
}
void do_mul(int iter)
{
double t0, t1;
for (size_t i = 0; i < SIZE; i++) {
b[i] = 50;
c[i] = 40;
}
t0 = performance_now();
for (int i = 0; i < iter; i++)
mul(a,b,c);
t1 = performance_now();
log_time(t1-t0);
}
The C code is basically identical to the previous one, with the only difference being the JS function calls, necessary to obtain timings.
The above example can be compiled with:
1
2
3
4
5
6
7
$ clang \
--target=wasm32 \
--no-standard-libraries \
-Wl,--no-entry \
-Wl,--export=do_mul \
-Wl,--allow-undefined \
-g -o mul.wasm mul_wasm.c -O3
This time, I’ll be using -O3
to see what WebAssembly can do best…
Extracting WebAssembly’s JIT Code & Analyzing
To run in GDB, obtain the WebAssembly JITed ASM, and debug, similar procedures to before are required, but with slight adaptations. This time, the v8 function responsible for dumping WebAssembly instructions is WasmCode::Disassemble()
from the file wasm-code-manager.cc
. Specifically, we need to add a break at line 436.
In summary, something like:
1
2
3
4
5
6
7
8
9
10
11
12
$ gdb \
-ex "set confirm off" \
-ex "b _start" \
-ex "r" \
-ex "b wasm-code-manager.cc:436" \
-ex "c" \
--args $HOME/.jsvu/engines/v8-debug/v8-debug \
--snapshot_blob="$HOME/.jsvu/engines/v8-debug/snapshot_blob.bin" \
--allow-natives-syntax \
--print-wasm-code \
--no-liftoff \
load.js
Note some new flags like: --no-liftoff
(to avoid using the common compiler and force the use of Turbofan) and --print-wasm-code
instead of --print-opt-code
.
Below is the heavily commented code of the mul()
function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
--- WebAssembly code ---
name: do_mul
index: 2
kind: wasm function
compiler: TurboFan
Body (size = 3712 = 3680 + 32 padding)
Instructions (size = 3660)
r9 = count
r9 = starts at 0xFFFF_FFFF - 40_000_000
mul:
[snip]
/-> 0x1c4750719e00 600 493b65a0 REX.W cmpq rsp,[r13-0x60]
| 0x1c4750719e04 604 0f860f080000 jna 0x1c475071a619 <+0xe19> (NT)
| 0x1c4750719e0a 60a 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x1c4750719e11 611 41baffffffff movl r10,0xffffffff
| 0x1c4750719e17 617 4d3bda REX.W cmpq r11,r10
| 0x1c4750719e1a 61a 761d jna 0x1c4750719e39 <+0x639> (T) -\
| 0x1c4750719e1c 61c bf01000000 movl rdi,0x1 |
| 0x1c4750719e21 621 4989e2 REX.W movq r10,rsp |
| 0x1c4750719e24 624 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719e28 628 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719e2c 62c 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719e30 630 488b050dfaffff REX.W movq rax,[rip+0xfffffa0d] |
| 0x1c4750719e37 637 ffd0 call rax |
| 0x1c4750719e39 639 458da1005e6202 leal r12,[r9+0x2625e00] <-/
| 0x1c4750719e40 640 41baffffffff movl r10,0xffffffff
| 0x1c4750719e46 646 4d3be2 REX.W cmpq r12,r10
| 0x1c4750719e49 649 761d jna 0x1c4750719e68 <+0x668> (T) -\
| 0x1c4750719e4b 64b bf01000000 movl rdi,0x1 |
| 0x1c4750719e50 650 4989e2 REX.W movq r10,rsp |
| 0x1c4750719e53 653 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719e57 657 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719e5b 65b 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719e5f 65f 488b05def9ffff REX.W movq rax,[rip+0xfffff9de] |
| 0x1c4750719e66 666 ffd0 call rax |
| |
| r11 = c[idx] |
| 0x1c4750719e68 668 468b1c1a movl r11,[rdx+r11*1] <-/
| 0x1c4750719e6c 66c 41baffffffff movl r10,0xffffffff
|
| if (r11 < 32bit)
| 0x1c4750719e72 672 4d3bda REX.W cmpq r11,r10
| 0x1c4750719e75 675 761d jna 0x1c4750719e94 <+0x694> (T) -\
| 0x1c4750719e77 677 bf01000000 movl rdi,0x1 |
| 0x1c4750719e7c 67c 4989e2 REX.W movq r10,rsp |
| 0x1c4750719e7f 67f 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719e83 683 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719e87 687 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719e8b 68b 488b05b2f9ffff REX.W movq rax,[rip+0xfffff9b2] |
| 0x1c4750719e92 692 ffd0 call rax |
| |
| r12 = b[idx] |
| 0x1c4750719e94 694 468b2422 movl r12,[rdx+r12*1] <-/
| 0x1c4750719e98 698 41baffffffff movl r10,0xffffffff
|
| if (r12 < 32bit)
| 0x1c4750719e9e 69e 4d3be2 REX.W cmpq r12,r10
| 0x1c4750719ea1 6a1 761d jna 0x1c4750719ec0 <+0x6c0> (T) -\
| 0x1c4750719ea3 6a3 bf01000000 movl rdi,0x1 |
| 0x1c4750719ea8 6a8 4989e2 REX.W movq r10,rsp |
| 0x1c4750719eab 6ab 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719eaf 6af 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719eb3 6b3 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719eb7 6b7 488b0586f9ffff REX.W movq rax,[rip+0xfffff986] |
| 0x1c4750719ebe 6be ffd0 call rax |
| 0x1c4750719ec0 6c0 458db900122707 leal r15,[r9+0x7271200] <-/
| 0x1c4750719ec7 6c7 41baffffffff movl r10,0xffffffff
| 0x1c4750719ecd 6cd 4d3bfa REX.W cmpq r15,r10
| 0x1c4750719ed0 6d0 761d jna 0x1c4750719eef <+0x6ef> (T) -\
| 0x1c4750719ed2 6d2 bf01000000 movl rdi,0x1 |
| 0x1c4750719ed7 6d7 4989e2 REX.W movq r10,rsp |
| 0x1c4750719eda 6da 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719ede 6de 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719ee2 6e2 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719ee6 6e6 488b0557f9ffff REX.W movq rax,[rip+0xfffff957] |
| 0x1c4750719eed 6ed ffd0 call rax |
| |
| |
| >>>> r11 = r11*r12 <<<< |
| 0x1c4750719eef 6ef 450fafdc imull r11,r12 <-/
|
| 0x1c4750719ef3 6f3 458da104b8c404 leal r12,[r9+0x4c4b804]
| 0x1c4750719efa 6fa 41baffffffff movl r10,0xffffffff
| 0x1c4750719f00 700 4d3be2 REX.W cmpq r12,r10
| 0x1c4750719f03 703 761d jna 0x1c4750719f22 <+0x722> (T) -\
| 0x1c4750719f05 705 bf01000000 movl rdi,0x1 |
| 0x1c4750719f0a 70a 4989e2 REX.W movq r10,rsp |
| 0x1c4750719f0d 70d 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719f11 711 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719f15 715 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719f19 719 488b0524f9ffff REX.W movq rax,[rip+0xfffff924] |
| 0x1c4750719f20 720 ffd0 call rax |
| |
| a[idx] = r11 |
| 0x1c4750719f22 722 46891c3a movl [rdx+r15*1],r11 <-/
|
| 0x1c4750719f26 726 458d99045e6202 leal r11,[r9+0x2625e04]
| 0x1c4750719f2d 72d 41baffffffff movl r10,0xffffffff
| 0x1c4750719f33 733 4d3bda REX.W cmpq r11,r10
| 0x1c4750719f36 736 761d jna 0x1c4750719f55 <+0x755> (T) -\
| 0x1c4750719f38 738 bf01000000 movl rdi,0x1 |
| 0x1c4750719f3d 73d 4989e2 REX.W movq r10,rsp |
| 0x1c4750719f40 740 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719f44 744 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719f48 748 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719f4c 74c 488b05f1f8ffff REX.W movq rax,[rip+0xfffff8f1] |
| 0x1c4750719f53 753 ffd0 call rax |
| |
| r12 = c[idx] |
| 0x1c4750719f55 755 468b2422 movl r12,[rdx+r12*1] <-/
| 0x1c4750719f59 759 41baffffffff movl r10,0xffffffff
| 0x1c4750719f5f 75f 4d3be2 REX.W cmpq r12,r10
| 0x1c4750719f62 762 761d jna 0x1c4750719f81 <+0x781> (T) -\
| 0x1c4750719f64 764 bf01000000 movl rdi,0x1 |
| 0x1c4750719f69 769 4989e2 REX.W movq r10,rsp |
| 0x1c4750719f6c 76c 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719f70 770 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719f74 774 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719f78 778 488b05c5f8ffff REX.W movq rax,[rip+0xfffff8c5] |
| 0x1c4750719f7f 77f ffd0 call rax |
| 0x1c4750719f81 781 468b1c1a movl r11,[rdx+r11*1] <-/
| 0x1c4750719f85 785 41baffffffff movl r10,0xffffffff
| 0x1c4750719f8b 78b 4d3bda REX.W cmpq r11,r10
| 0x1c4750719f8e 78e 761d jna 0x1c4750719fad <+0x7ad> (T) -\
| 0x1c4750719f90 790 bf01000000 movl rdi,0x1 |
| 0x1c4750719f95 795 4989e2 REX.W movq r10,rsp |
| 0x1c4750719f98 798 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719f9c 79c 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719fa0 7a0 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719fa4 7a4 488b0599f8ffff REX.W movq rax,[rip+0xfffff899] |
| 0x1c4750719fab 7ab ffd0 call rax |
| 0x1c4750719fad 7ad 458db904122707 leal r15,[r9+0x7271204] <-/
| 0x1c4750719fb4 7b4 41baffffffff movl r10,0xffffffff
| 0x1c4750719fba 7ba 4d3bfa REX.W cmpq r15,r10
| 0x1c4750719fbd 7bd 761d jna 0x1c4750719fdc <+0x7dc> (T) -\
| 0x1c4750719fbf 7bf bf01000000 movl rdi,0x1 |
| 0x1c4750719fc4 7c4 4989e2 REX.W movq r10,rsp |
| 0x1c4750719fc7 7c7 4883ec08 REX.W subq rsp,0x8 |
| 0x1c4750719fcb 7cb 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c4750719fcf 7cf 4c891424 REX.W movq [rsp],r10 |
| 0x1c4750719fd3 7d3 488b056af8ffff REX.W movq rax,[rip+0xfffff86a] |
| 0x1c4750719fda 7da ffd0 call rax |
| |
| >>>> r12 = r12*r11 <<<< |
| 0x1c4750719fdc 7dc 450fafe3 imull r12,r11 <-/
|
| a[idx] = r12
| 0x1c4750719fe0 7e0 4689243a movl [rdx+r15*1],r12
|
| count += 8
| 0x1c4750719fe4 7e4 4183c108 addl r9,0x8
| if (count == 0) (wraparound)
| 0x1c4750719fe8 7e8 0f84c8030000 jz 0x1c475071a3b6 <+0xbb6> (NT)
|
| 0x1c4750719fee 7ee 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x1c4750719ff5 7f5 41baffffffff movl r10,0xffffffff
| 0x1c4750719ffb 7fb 4d3bda REX.W cmpq r11,r10
| 0x1c4750719ffe 7fe 761d jna 0x1c475071a01d <+0x81d> (T) -\
| 0x1c475071a000 800 bf01000000 movl rdi,0x1 |
| 0x1c475071a005 805 4989e2 REX.W movq r10,rsp |
| 0x1c475071a008 808 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a00c 80c 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a010 810 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a014 814 488b0529f8ffff REX.W movq rax,[rip+0xfffff829] |
| 0x1c475071a01b 81b ffd0 call rax |
| 0x1c475071a01d 81d 458da1005e6202 leal r12,[r9+0x2625e00] <-/
| 0x1c475071a024 824 41baffffffff movl r10,0xffffffff
| 0x1c475071a02a 82a 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a02d 82d 761d jna 0x1c475071a04c <+0x84c> (T) -\
| 0x1c475071a02f 82f bf01000000 movl rdi,0x1 |
| 0x1c475071a034 834 4989e2 REX.W movq r10,rsp |
| 0x1c475071a037 837 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a03b 83b 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a03f 83f 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a043 843 488b05faf7ffff REX.W movq rax,[rip+0xfffff7fa] |
| 0x1c475071a04a 84a ffd0 call rax |
| |
| r11 = c[idx] |
| 0x1c475071a04c 84c 468b1c1a movl r11,[rdx+r11*1] <-/
| 0x1c475071a050 850 41baffffffff movl r10,0xffffffff
| 0x1c475071a056 856 4d3bda REX.W cmpq r11,r10
| 0x1c475071a059 859 761d jna 0x1c475071a078 <+0x878> (T) -\
| 0x1c475071a05b 85b bf01000000 movl rdi,0x1 |
| 0x1c475071a060 860 4989e2 REX.W movq r10,rsp |
| 0x1c475071a063 863 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a067 867 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a06b 86b 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a06f 86f 488b05cef7ffff REX.W movq rax,[rip+0xfffff7ce] |
| 0x1c475071a076 876 ffd0 call rax |
| |
| r12 = b[idx] |
| 0x1c475071a078 878 468b2422 movl r12,[rdx+r12*1] <-/
|
| 0x1c475071a07c 87c 41baffffffff movl r10,0xffffffff
| 0x1c475071a082 882 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a085 885 761d jna 0x1c475071a0a4 <+0x8a4> (T) -\
| 0x1c475071a087 887 bf01000000 movl rdi,0x1 |
| 0x1c475071a08c 88c 4989e2 REX.W movq r10,rsp |
| 0x1c475071a08f 88f 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a093 893 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a097 897 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a09b 89b 488b05a2f7ffff REX.W movq rax,[rip+0xfffff7a2] |
| 0x1c475071a0a2 8a2 ffd0 call rax |
| 0x1c475071a0a4 8a4 458db900122707 leal r15,[r9+0x7271200] <-/
| 0x1c475071a0ab 8ab 41baffffffff movl r10,0xffffffff
| 0x1c475071a0b1 8b1 4d3bfa REX.W cmpq r15,r10
| 0x1c475071a0b4 8b4 761d jna 0x1c475071a0d3 <+0x8d3> (T) -\
| 0x1c475071a0b6 8b6 bf01000000 movl rdi,0x1 |
| 0x1c475071a0bb 8bb 4989e2 REX.W movq r10,rsp |
| 0x1c475071a0be 8be 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a0c2 8c2 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a0c6 8c6 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a0ca 8ca 488b0573f7ffff REX.W movq rax,[rip+0xfffff773] |
| 0x1c475071a0d1 8d1 ffd0 call rax |
| |
| >>>> r11 = r11*r12 <<<< |
| 0x1c475071a0d3 8d3 450fafdc imull r11,r12 <-/
| 0x1c475071a0d7 8d7 458da104b8c404 leal r12,[r9+0x4c4b804]
| 0x1c475071a0de 8de 41baffffffff movl r10,0xffffffff
| 0x1c475071a0e4 8e4 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a0e7 8e7 761d jna 0x1c475071a106 <+0x906> (T) -\
| 0x1c475071a0e9 8e9 bf01000000 movl rdi,0x1 |
| 0x1c475071a0ee 8ee 4989e2 REX.W movq r10,rsp |
| 0x1c475071a0f1 8f1 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a0f5 8f5 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a0f9 8f9 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a0fd 8fd 488b0540f7ffff REX.W movq rax,[rip+0xfffff740] |
| 0x1c475071a104 904 ffd0 call rax |
| |
| a[idx] = r11 |
| 0x1c475071a106 906 46891c3a movl [rdx+r15*1],r11 <-/
|
| 0x1c475071a10a 90a 458d99045e6202 leal r11,[r9+0x2625e04]
| 0x1c475071a111 911 41baffffffff movl r10,0xffffffff
| 0x1c475071a117 917 4d3bda REX.W cmpq r11,r10
| 0x1c475071a11a 91a 761d jna 0x1c475071a139 <+0x939> (T) -\
| 0x1c475071a11c 91c bf01000000 movl rdi,0x1 |
| 0x1c475071a121 921 4989e2 REX.W movq r10,rsp |
| 0x1c475071a124 924 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a128 928 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a12c 92c 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a130 930 488b050df7ffff REX.W movq rax,[rip+0xfffff70d] |
| 0x1c475071a137 937 ffd0 call rax |
| |
| r12 = c[idx] |
| 0x1c475071a139 939 468b2422 movl r12,[rdx+r12*1] <-/
|
| 0x1c475071a13d 93d 41baffffffff movl r10,0xffffffff
| 0x1c475071a143 943 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a146 946 761d jna 0x1c475071a165 <+0x965> (T) -\
| 0x1c475071a148 948 bf01000000 movl rdi,0x1 |
| 0x1c475071a14d 94d 4989e2 REX.W movq r10,rsp |
| 0x1c475071a150 950 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a154 954 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a158 958 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a15c 95c 488b05e1f6ffff REX.W movq rax,[rip+0xfffff6e1] |
| 0x1c475071a163 963 ffd0 call rax |
| |
| r11 = b[idx] |
| 0x1c475071a165 965 468b1c1a movl r11,[rdx+r11*1] <-/
|
| 0x1c475071a169 969 41baffffffff movl r10,0xffffffff
| 0x1c475071a16f 96f 4d3bda REX.W cmpq r11,r10
| 0x1c475071a172 972 761d jna 0x1c475071a191 <+0x991> (T) -\
| 0x1c475071a174 974 bf01000000 movl rdi,0x1 |
| 0x1c475071a179 979 4989e2 REX.W movq r10,rsp |
| 0x1c475071a17c 97c 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a180 980 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a184 984 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a188 988 488b05b5f6ffff REX.W movq rax,[rip+0xfffff6b5] |
| 0x1c475071a18f 98f ffd0 call rax |
| 0x1c475071a191 991 458db904122707 leal r15,[r9+0x7271204] <-/
| 0x1c475071a198 998 41baffffffff movl r10,0xffffffff
| 0x1c475071a19e 99e 4d3bfa REX.W cmpq r15,r10
| 0x1c475071a1a1 9a1 761d jna 0x1c475071a1c0 <+0x9c0> (T) -\
| 0x1c475071a1a3 9a3 bf01000000 movl rdi,0x1 |
| 0x1c475071a1a8 9a8 4989e2 REX.W movq r10,rsp |
| 0x1c475071a1ab 9ab 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a1af 9af 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a1b3 9b3 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a1b7 9b7 488b0586f6ffff REX.W movq rax,[rip+0xfffff686] |
| 0x1c475071a1be 9be ffd0 call rax |
| |
| >>>> r12 = r12*r11 <<<< |
| 0x1c475071a1c0 9c0 450fafe3 imull r12,r11 <-/
| a[idx] = r12
| 0x1c475071a1c4 9c4 4689243a movl [rdx+r15*1],r12
|
| count += 8
| 0x1c475071a1c8 9c8 4183c108 addl r9,0x8
| if (count == 0) (wraparound)
| 0x1c475071a1cc 9cc 0f84e4010000 jz 0x1c475071a3b6 <+0xbb6> (NT)
|
| 0x1c475071a1d2 9d2 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x1c475071a1d9 9d9 41baffffffff movl r10,0xffffffff
| 0x1c475071a1df 9df 4d3bda REX.W cmpq r11,r10
| 0x1c475071a1e2 9e2 761d jna 0x1c475071a201 <+0xa01> (T) -\
| 0x1c475071a1e4 9e4 bf01000000 movl rdi,0x1 |
| 0x1c475071a1e9 9e9 4989e2 REX.W movq r10,rsp |
| 0x1c475071a1ec 9ec 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a1f0 9f0 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a1f4 9f4 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a1f8 9f8 488b0545f6ffff REX.W movq rax,[rip+0xfffff645] |
| 0x1c475071a1ff 9ff ffd0 call rax |
| 0x1c475071a201 a01 458da1005e6202 leal r12,[r9+0x2625e00] <-/
| 0x1c475071a208 a08 41baffffffff movl r10,0xffffffff
| 0x1c475071a20e a0e 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a211 a11 761d jna 0x1c475071a230 <+0xa30> (T) -\
| 0x1c475071a213 a13 bf01000000 movl rdi,0x1 |
| 0x1c475071a218 a18 4989e2 REX.W movq r10,rsp |
| 0x1c475071a21b a1b 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a21f a1f 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a223 a23 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a227 a27 488b0516f6ffff REX.W movq rax,[rip+0xfffff616] |
| 0x1c475071a22e a2e ffd0 call rax |
| |
| r11 = c[idx] |
| 0x1c475071a230 a30 468b1c1a movl r11,[rdx+r11*1] <-/
|
| 0x1c475071a234 a34 41baffffffff movl r10,0xffffffff
| 0x1c475071a23a a3a 4d3bda REX.W cmpq r11,r10
| 0x1c475071a23d a3d 761d jna 0x1c475071a25c <+0xa5c> (T) -\
| 0x1c475071a23f a3f bf01000000 movl rdi,0x1 |
| 0x1c475071a244 a44 4989e2 REX.W movq r10,rsp |
| 0x1c475071a247 a47 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a24b a4b 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a24f a4f 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a253 a53 488b05eaf5ffff REX.W movq rax,[rip+0xfffff5ea] |
| 0x1c475071a25a a5a ffd0 call rax |
| |
| r12 = b[idx] |
| 0x1c475071a25c a5c 468b2422 movl r12,[rdx+r12*1] <-/
|
| 0x1c475071a260 a60 41baffffffff movl r10,0xffffffff
| 0x1c475071a266 a66 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a269 a69 761d jna 0x1c475071a288 <+0xa88> (T) -\
| 0x1c475071a26b a6b bf01000000 movl rdi,0x1 |
| 0x1c475071a270 a70 4989e2 REX.W movq r10,rsp |
| 0x1c475071a273 a73 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a277 a77 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a27b a7b 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a27f a7f 488b05bef5ffff REX.W movq rax,[rip+0xfffff5be] |
| 0x1c475071a286 a86 ffd0 call rax |
| 0x1c475071a288 a88 458db900122707 leal r15,[r9+0x7271200] <-/
| 0x1c475071a28f a8f 41baffffffff movl r10,0xffffffff
| 0x1c475071a295 a95 4d3bfa REX.W cmpq r15,r10
| 0x1c475071a298 a98 761d jna 0x1c475071a2b7 <+0xab7> (T) -\
| 0x1c475071a29a a9a bf01000000 movl rdi,0x1 |
| 0x1c475071a29f a9f 4989e2 REX.W movq r10,rsp |
| 0x1c475071a2a2 aa2 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a2a6 aa6 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a2aa aaa 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a2ae aae 488b058ff5ffff REX.W movq rax,[rip+0xfffff58f] |
| 0x1c475071a2b5 ab5 ffd0 call rax |
| |
| >>>> r11 = r11*r12 <<<< |
| 0x1c475071a2b7 ab7 450fafdc imull r11,r12 <-/
|
| 0x1c475071a2bb abb 458da104b8c404 leal r12,[r9+0x4c4b804]
| 0x1c475071a2c2 ac2 41baffffffff movl r10,0xffffffff
| 0x1c475071a2c8 ac8 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a2cb acb 761d jna 0x1c475071a2ea <+0xaea> (T) -\
| 0x1c475071a2cd acd bf01000000 movl rdi,0x1 |
| 0x1c475071a2d2 ad2 4989e2 REX.W movq r10,rsp |
| 0x1c475071a2d5 ad5 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a2d9 ad9 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a2dd add 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a2e1 ae1 488b055cf5ffff REX.W movq rax,[rip+0xfffff55c] |
| 0x1c475071a2e8 ae8 ffd0 call rax |
| |
| a[idx] = r11 |
| 0x1c475071a2ea aea 46891c3a movl [rdx+r15*1],r11 <-/
|
| 0x1c475071a2ee aee 458d99045e6202 leal r11,[r9+0x2625e04]
| 0x1c475071a2f5 af5 41baffffffff movl r10,0xffffffff
| 0x1c475071a2fb afb 4d3bda REX.W cmpq r11,r10
| 0x1c475071a2fe afe 761d jna 0x1c475071a31d <+0xb1d> (T) -\
| 0x1c475071a300 b00 bf01000000 movl rdi,0x1 |
| 0x1c475071a305 b05 4989e2 REX.W movq r10,rsp |
| 0x1c475071a308 b08 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a30c b0c 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a310 b10 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a314 b14 488b0529f5ffff REX.W movq rax,[rip+0xfffff529] |
| 0x1c475071a31b b1b ffd0 call rax |
| |
| r12 = c[idx] |
| 0x1c475071a31d b1d 468b2422 movl r12,[rdx+r12*1] <-/
|
| 0x1c475071a321 b21 41baffffffff movl r10,0xffffffff
| 0x1c475071a327 b27 4d3be2 REX.W cmpq r12,r10
| 0x1c475071a32a b2a 761d jna 0x1c475071a349 <+0xb49> (T) -\
| 0x1c475071a32c b2c bf01000000 movl rdi,0x1 |
| 0x1c475071a331 b31 4989e2 REX.W movq r10,rsp |
| 0x1c475071a334 b34 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a338 b38 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a33c b3c 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a340 b40 488b05fdf4ffff REX.W movq rax,[rip+0xfffff4fd] |
| 0x1c475071a347 b47 ffd0 call rax |
| |
| r11 = b[idx] |
| 0x1c475071a349 b49 468b1c1a movl r11,[rdx+r11*1] <-/
|
| 0x1c475071a34d b4d 41baffffffff movl r10,0xffffffff
| 0x1c475071a353 b53 4d3bda REX.W cmpq r11,r10
| 0x1c475071a356 b56 761d jna 0x1c475071a375 <+0xb75> (T) -\
| 0x1c475071a358 b58 bf01000000 movl rdi,0x1 |
| 0x1c475071a35d b5d 4989e2 REX.W movq r10,rsp |
| 0x1c475071a360 b60 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a364 b64 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a368 b68 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a36c b6c 488b05d1f4ffff REX.W movq rax,[rip+0xfffff4d1] |
| 0x1c475071a373 b73 ffd0 call rax |
| 0x1c475071a375 b75 458db904122707 leal r15,[r9+0x7271204] <-/
| 0x1c475071a37c b7c 41baffffffff movl r10,0xffffffff
| 0x1c475071a382 b82 4d3bfa REX.W cmpq r15,r10
| 0x1c475071a385 b85 761d jna 0x1c475071a3a4 <+0xba4> (T) -\
| 0x1c475071a387 b87 bf01000000 movl rdi,0x1 |
| 0x1c475071a38c b8c 4989e2 REX.W movq r10,rsp |
| 0x1c475071a38f b8f 4883ec08 REX.W subq rsp,0x8 |
| 0x1c475071a393 b93 4883e4f0 REX.W andq rsp,0xf0 |
| 0x1c475071a397 b97 4c891424 REX.W movq [rsp],r10 |
| 0x1c475071a39b b9b 488b05a2f4ffff REX.W movq rax,[rip+0xfffff4a2] |
| 0x1c475071a3a2 ba2 ffd0 call rax |
| |
| >>>> r12 = r12*r11 <<<< |
| 0x1c475071a3a4 ba4 450fafe3 imull r12,r11 <-/
| a[idx] = r12
| 0x1c475071a3a8 ba8 4689243a movl [rdx+r15*1],r12
|
| count += 8
| 0x1c475071a3ac bac 4183c108 addl r9,0x8
| if (count != 0)
\- 0x1c475071a3b0 bb0 0f854afaffff jnz 0x1c4750719e00 <+0x600> (T)
[snip]
Although the produced code is extensive, its operation is simple: the function underwent inlining, and the loop underwent ‘unrolling,’ performing 6 multiplications per iteration, with an average of ~22 instructions and 5 jumps between each imul
. In practice, unrolling might provide some benefits, but apart from that, it’s basically the same ASM produced by the JITed JS, and it certainly wouldn’t yield significant gains compared to the best code emitted by GCC!
Is it the end? Have our options run out? No!
If WebAssembly Supported SIMD… Oh Wait, It Does!
WebAssembly does support SIMD instructions, and it is expected that these vector instructions will make the code JITed by Turbofan also vectorized.
By default, SIMD instructions are not emitted in a build with -O3
and require a special flag for that: -msimd128
. This way, Clang will emit vectorized code in your wasm.
Rebuild your code with:
1
2
3
4
5
6
7
$ clang \
--target=wasm32 \
--no-standard-libraries \
-Wl,--no-entry \
-Wl,--export=do_mul \
-Wl,--allow-undefined \
-g -o mul.wasm mul_wasm.c -O3 -msimd128
Which produces WebAssembly code like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
$ wasm2wat mul.wasm
(module
(type (;0;) (func (result f64)))
(type (;1;) (func (param f64)))
(type (;2;) (func (param i32)))
(import "env" "performance_now" (func $performance_now (type 0)))
(import "env" "log_time" (func $log_time (type 1)))
(func $do_mul (type 2) (param i32)
(local i32 v128 v128 f64 i32)
[snip]
call $performance_now
local.set 4
block ;; label = @1
local.get 0
i32.const 1
i32.lt_s
br_if 0 (;@1;)
i32.const 0
local.set 5
loop ;; label = @2
i32.const -40000000
local.set 1
loop ;; label = @3
local.get 1
i32.const 120001024
i32.add
local.get 1
i32.const 80001024
i32.add
v128.load <<<< SIMD
local.get 1
i32.const 40001024
i32.add
v128.load <<<< SIMD
i32x4.mul <<<< SIMD
v128.store <<<< SIMD
local.get 1
i32.const 120001040
i32.add
local.get 1
i32.const 80001040
i32.add
v128.load <<<< SIMD
local.get 1
i32.const 40001040
i32.add
v128.load <<<< SIMD
i32x4.mul <<<< SIMD
v128.store <<<< SIMD
local.get 1
i32.const 32
i32.add
local.tee 1
br_if 0 (;@3;)
end
local.get 5
i32.const 1
i32.add
local.tee 5
local.get 0
i32.ne
br_if 0 (;@2;)
end
end
call $performance_now
[snip]
Of course, this is not a guarantee that the v8 JIT will transform this into SIMD or how it would do the equivalent for x86_64. So, let’s see what this actually generates:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
--- WebAssembly code ---
name: do_mul
index: 2
kind: wasm function
compiler: TurboFan
Body (size = 2880 = 2868 + 12 padding)
Instructions (size = 2848)
r9 = count
r9 = starts at 0xFFFF_FFFF - 40_000_000
mul:
[snip]
/-> 0x3ce639a41c80 480 493b65a0 REX.W cmpq rsp,[r13-0x60]
| 0x3ce639a41c84 484 0f8666060000 jna 0x3ce639a422f0 <+0xaf0> (NT)
| 0x3ce639a41c8a 48a 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x3ce639a41c91 491 41baffffffff movl r10,0xffffffff
| 0x3ce639a41c97 497 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41c9a 49a 761d jna 0x3ce639a41cb9 <+0x4b9> (T) -\
| 0x3ce639a41c9c 49c bf01000000 movl rdi,0x1 |
| 0x3ce639a41ca1 4a1 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41ca4 4a4 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41ca8 4a8 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41cac 4ac 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41cb0 4b0 488b058dfbffff REX.W movq rax,[rip+0xfffffb8d] |
| 0x3ce639a41cb7 4b7 ffd0 call rax <-/
| 0x3ce639a41cb9 4b9 458da1005e6202 leal r12,[r9+0x2625e00]
| 0x3ce639a41cc0 4c0 41baffffffff movl r10,0xffffffff
| 0x3ce639a41cc6 4c6 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41cc9 4c9 761d jna 0x3ce639a41ce8 <+0x4e8> (T) -\
| 0x3ce639a41ccb 4cb bf01000000 movl rdi,0x1 |
| 0x3ce639a41cd0 4d0 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41cd3 4d3 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41cd7 4d7 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41cdb 4db 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41cdf 4df 488b055efbffff REX.W movq rax,[rip+0xfffffb5e] |
| 0x3ce639a41ce6 4e6 ffd0 call rax |
| |
| xmm0 = c[index] |
| 0x3ce639a41ce8 4e8 c4a17a6f041a vmovdqu xmm0,[rdx+r11*1] <-/
| xmm2 = b[index]
| 0x3ce639a41cee 4ee c4a17a6f1422 vmovdqu xmm2,[rdx+r12*1]
|
| 0x3ce639a41cf4 4f4 458d9900122707 leal r11,[r9+0x7271200]
| 0x3ce639a41cfb 4fb 41baffffffff movl r10,0xffffffff
| 0x3ce639a41d01 501 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41d04 504 761d jna 0x3ce639a41d23 <+0x523> (T) -\
| 0x3ce639a41d06 506 bf01000000 movl rdi,0x1 |
| 0x3ce639a41d0b 50b 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41d0e 50e 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41d12 512 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41d16 516 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41d1a 51a 488b0523fbffff REX.W movq rax,[rip+0xfffffb23] |
| 0x3ce639a41d21 521 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a41d23 523 c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
| 0x3ce639a41d28 528 458da110b8c404 leal r12,[r9+0x4c4b810]
| 0x3ce639a41d2f 52f 41baffffffff movl r10,0xffffffff
| 0x3ce639a41d35 535 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41d38 538 761d jna 0x3ce639a41d57 <+0x557> (T) -\
| 0x3ce639a41d3a 53a bf01000000 movl rdi,0x1 |
| 0x3ce639a41d3f 53f 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41d42 542 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41d46 546 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41d4a 54a 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41d4e 54e 488b05effaffff REX.W movq rax,[rip+0xfffffaef] |
| 0x3ce639a41d55 555 ffd0 call rax |
| |
| a[idx] = xmm0 |
| 0x3ce639a41d57 557 c4a17a7f041a vmovdqu [rdx+r11*1],xmm0 <-/
|
| 0x3ce639a41d5d 55d 458d99105e6202 leal r11,[r9+0x2625e10]
| 0x3ce639a41d64 564 41baffffffff movl r10,0xffffffff
| 0x3ce639a41d6a 56a 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41d6d 56d 761d jna 0x3ce639a41d8c <+0x58c> (T) -\
| 0x3ce639a41d6f 56f bf01000000 movl rdi,0x1 |
| 0x3ce639a41d74 574 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41d77 577 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41d7b 57b 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41d7f 57f 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41d83 583 488b05bafaffff REX.W movq rax,[rip+0xfffffaba] |
| 0x3ce639a41d8a 58a ffd0 call rax |
| |
| xmm0 = c[idx] |
| 0x3ce639a41d8c 58c c4a17a6f0422 vmovdqu xmm0,[rdx+r12*1] <-/
| xmm2 = b[idx]
| 0x3ce639a41d92 592 c4a17a6f141a vmovdqu xmm2,[rdx+r11*1]
|
| 0x3ce639a41d98 598 458d9910122707 leal r11,[r9+0x7271210]
| 0x3ce639a41d9f 59f 41baffffffff movl r10,0xffffffff
| 0x3ce639a41da5 5a5 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41da8 5a8 761d jna 0x3ce639a41dc7 <+0x5c7> (T) -\
| 0x3ce639a41daa 5aa bf01000000 movl rdi,0x1 |
| 0x3ce639a41daf 5af 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41db2 5b2 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41db6 5b6 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41dba 5ba 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41dbe 5be 488b057ffaffff REX.W movq rax,[rip+0xfffffa7f] |
| 0x3ce639a41dc5 5c5 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a41dc7 5c7 c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
| a[idx] = xmm0
| 0x3ce639a41dcc 5cc c4a17a7f041a vmovdqu [rdx+r11*1],xmm0
|
| count += 32
| 0x3ce639a41dd2 5d2 4183c120 addl r9,0x20
| if (count == 0) (wraparound)
| 0x3ce639a41dd6 5d6 0f84a4020000 jz 0x3ce639a42080 <+0x880> (NT)
|
| 0x3ce639a41ddc 5dc 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x3ce639a41de3 5e3 41baffffffff movl r10,0xffffffff
| 0x3ce639a41de9 5e9 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41dec 5ec 761d jna 0x3ce639a41e0b <+0x60b> (T) -\
| 0x3ce639a41dee 5ee bf01000000 movl rdi,0x1 |
| 0x3ce639a41df3 5f3 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41df6 5f6 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41dfa 5fa 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41dfe 5fe 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41e02 602 488b053bfaffff REX.W movq rax,[rip+0xfffffa3b] |
| 0x3ce639a41e09 609 ffd0 call rax |
| 0x3ce639a41e0b 60b 458da1005e6202 leal r12,[r9+0x2625e00] <-/
| 0x3ce639a41e12 612 41baffffffff movl r10,0xffffffff
| 0x3ce639a41e18 618 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41e1b 61b 761d jna 0x3ce639a41e3a <+0x63a> (T) -\
| 0x3ce639a41e1d 61d bf01000000 movl rdi,0x1 |
| 0x3ce639a41e22 622 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41e25 625 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41e29 629 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41e2d 62d 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41e31 631 488b050cfaffff REX.W movq rax,[rip+0xfffffa0c] |
| 0x3ce639a41e38 638 ffd0 call rax |
| |
| xmm0 = c[idx] |
| 0x3ce639a41e3a 63a c4a17a6f041a vmovdqu xmm0,[rdx+r11*1] <-/
| xmm2 = b[idx]
| 0x3ce639a41e40 640 c4a17a6f1422 vmovdqu xmm2,[rdx+r12*1]
|
| 0x3ce639a41e46 646 458d9900122707 leal r11,[r9+0x7271200]
| 0x3ce639a41e4d 64d 41baffffffff movl r10,0xffffffff
| 0x3ce639a41e53 653 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41e56 656 761d jna 0x3ce639a41e75 <+0x675> (T) -\
| 0x3ce639a41e58 658 bf01000000 movl rdi,0x1 |
| 0x3ce639a41e5d 65d 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41e60 660 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41e64 664 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41e68 668 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41e6c 66c 488b05d1f9ffff REX.W movq rax,[rip+0xfffff9d1] |
| 0x3ce639a41e73 673 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a41e75 675 c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
|
| 0x3ce639a41e7a 67a 458da110b8c404 leal r12,[r9+0x4c4b810]
| 0x3ce639a41e81 681 41baffffffff movl r10,0xffffffff
| 0x3ce639a41e87 687 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41e8a 68a 761d jna 0x3ce639a41ea9 <+0x6a9> (T) -\
| 0x3ce639a41e8c 68c bf01000000 movl rdi,0x1 |
| 0x3ce639a41e91 691 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41e94 694 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41e98 698 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41e9c 69c 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41ea0 6a0 488b059df9ffff REX.W movq rax,[rip+0xfffff99d] |
| 0x3ce639a41ea7 6a7 ffd0 call rax |
| |
| a[idx] = xmm0 |
| 0x3ce639a41ea9 6a9 c4a17a7f041a vmovdqu [rdx+r11*1],xmm0 <-/
|
| 0x3ce639a41eaf 6af 458d99105e6202 leal r11,[r9+0x2625e10]
| 0x3ce639a41eb6 6b6 41baffffffff movl r10,0xffffffff
| 0x3ce639a41ebc 6bc 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41ebf 6bf 761d jna 0x3ce639a41ede <+0x6de> (T) -\
| 0x3ce639a41ec1 6c1 bf01000000 movl rdi,0x1 |
| 0x3ce639a41ec6 6c6 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41ec9 6c9 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41ecd 6cd 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41ed1 6d1 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41ed5 6d5 488b0568f9ffff REX.W movq rax,[rip+0xfffff968] |
| 0x3ce639a41edc 6dc ffd0 call rax |
| |
| xmm0 = c[idx] |
| 0x3ce639a41ede 6de c4a17a6f0422 vmovdqu xmm0,[rdx+r12*1] <-/
| xmm2 = b[idx]
| 0x3ce639a41ee4 6e4 c4a17a6f141a vmovdqu xmm2,[rdx+r11*1]
|
| 0x3ce639a41eea 6ea 458d9910122707 leal r11,[r9+0x7271210]
| 0x3ce639a41ef1 6f1 41baffffffff movl r10,0xffffffff
| 0x3ce639a41ef7 6f7 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41efa 6fa 761d jna 0x3ce639a41f19 <+0x719> (T) -\
| 0x3ce639a41efc 6fc bf01000000 movl rdi,0x1 |
| 0x3ce639a41f01 701 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41f04 704 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41f08 708 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41f0c 70c 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41f10 710 488b052df9ffff REX.W movq rax,[rip+0xfffff92d] |
| 0x3ce639a41f17 717 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a41f19 719 c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
| a[idx] = xmm0
| 0x3ce639a41f1e 71e c4a17a7f041a vmovdqu [rdx+r11*1],xmm0
|
| count += 32
| 0x3ce639a41f24 724 4183c120 addl r9,0x20
| if (count == 0) (wraparound)
| 0x3ce639a41f28 728 0f8452010000 jz 0x3ce639a42080 <+0x880> (NT)
|
| 0x3ce639a41f2e 72e 458d9900b8c404 leal r11,[r9+0x4c4b800]
| 0x3ce639a41f35 735 41baffffffff movl r10,0xffffffff
| 0x3ce639a41f3b 73b 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41f3e 73e 761d jna 0x3ce639a41f5d <+0x75d> (T) -\
| 0x3ce639a41f40 740 bf01000000 movl rdi,0x1 |
| 0x3ce639a41f45 745 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41f48 748 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41f4c 74c 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41f50 750 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41f54 754 488b05e9f8ffff REX.W movq rax,[rip+0xfffff8e9] |
| 0x3ce639a41f5b 75b ffd0 call rax |
| 0x3ce639a41f5d 75d 458da1005e6202 leal r12,[r9+0x2625e00] <-/
| 0x3ce639a41f64 764 41baffffffff movl r10,0xffffffff
| 0x3ce639a41f6a 76a 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41f6d 76d 761d jna 0x3ce639a41f8c <+0x78c> (T) -\
| 0x3ce639a41f6f 76f bf01000000 movl rdi,0x1 |
| 0x3ce639a41f74 774 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41f77 777 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41f7b 77b 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41f7f 77f 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41f83 783 488b05baf8ffff REX.W movq rax,[rip+0xfffff8ba] |
| 0x3ce639a41f8a 78a ffd0 call rax |
| |
| xmm0 = c[idx] |
| 0x3ce639a41f8c 78c c4a17a6f041a vmovdqu xmm0,[rdx+r11*1] <-/
| xmm2 = b[idx]
| 0x3ce639a41f92 792 c4a17a6f1422 vmovdqu xmm2,[rdx+r12*1]
|
| 0x3ce639a41f98 798 458d9900122707 leal r11,[r9+0x7271200]
| 0x3ce639a41f9f 79f 41baffffffff movl r10,0xffffffff
| 0x3ce639a41fa5 7a5 4d3bda REX.W cmpq r11,r10
| 0x3ce639a41fa8 7a8 761d jna 0x3ce639a41fc7 <+0x7c7> (T) -\
| 0x3ce639a41faa 7aa bf01000000 movl rdi,0x1 |
| 0x3ce639a41faf 7af 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41fb2 7b2 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41fb6 7b6 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41fba 7ba 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41fbe 7be 488b057ff8ffff REX.W movq rax,[rip+0xfffff87f] |
| 0x3ce639a41fc5 7c5 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a41fc7 7c7 c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
|
| 0x3ce639a41fcc 7cc 458da110b8c404 leal r12,[r9+0x4c4b810]
| 0x3ce639a41fd3 7d3 41baffffffff movl r10,0xffffffff
| 0x3ce639a41fd9 7d9 4d3be2 REX.W cmpq r12,r10
| 0x3ce639a41fdc 7dc 761d jna 0x3ce639a41ffb <+0x7fb> (T) -\
| 0x3ce639a41fde 7de bf01000000 movl rdi,0x1 |
| 0x3ce639a41fe3 7e3 4989e2 REX.W movq r10,rsp |
| 0x3ce639a41fe6 7e6 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a41fea 7ea 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a41fee 7ee 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a41ff2 7f2 488b054bf8ffff REX.W movq rax,[rip+0xfffff84b] |
| 0x3ce639a41ff9 7f9 ffd0 call rax |
| |
| a[idx] = xmm0 |
| 0x3ce639a41ffb 7fb c4a17a7f041a vmovdqu [rdx+r11*1],xmm0 <-/
|
| 0x3ce639a42001 801 458d99105e6202 leal r11,[r9+0x2625e10]
| 0x3ce639a42008 808 41baffffffff movl r10,0xffffffff
| 0x3ce639a4200e 80e 4d3bda REX.W cmpq r11,r10
| 0x3ce639a42011 811 761d jna 0x3ce639a42030 <+0x830> (T) -\
| 0x3ce639a42013 813 bf01000000 movl rdi,0x1 |
| 0x3ce639a42018 818 4989e2 REX.W movq r10,rsp |
| 0x3ce639a4201b 81b 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a4201f 81f 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a42023 823 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a42027 827 488b0516f8ffff REX.W movq rax,[rip+0xfffff816] |
| 0x3ce639a4202e 82e ffd0 call rax |
| |
| xmm0 = c[idx] |
| 0x3ce639a42030 830 c4a17a6f0422 vmovdqu xmm0,[rdx+r12*1] <-/
| xmm0 = b[idx]
| 0x3ce639a42036 836 c4a17a6f141a vmovdqu xmm2,[rdx+r11*1]
|
| 0x3ce639a4203c 83c 458d9910122707 leal r11,[r9+0x7271210]
| 0x3ce639a42043 843 41baffffffff movl r10,0xffffffff
| 0x3ce639a42049 849 4d3bda REX.W cmpq r11,r10
| 0x3ce639a4204c 84c 761d jna 0x3ce639a4206b <+0x86b> (T) -\
| 0x3ce639a4204e 84e bf01000000 movl rdi,0x1 |
| 0x3ce639a42053 853 4989e2 REX.W movq r10,rsp |
| 0x3ce639a42056 856 4883ec08 REX.W subq rsp,0x8 |
| 0x3ce639a4205a 85a 4883e4f0 REX.W andq rsp,0xf0 |
| 0x3ce639a4205e 85e 4c891424 REX.W movq [rsp],r10 |
| 0x3ce639a42062 862 488b05dbf7ffff REX.W movq rax,[rip+0xfffff7db] |
| 0x3ce639a42069 869 ffd0 call rax |
| |
| xmm0 = xmm0*xmm2 |
| 0x3ce639a4206b 86b c4e27940c2 vpmulld xmm0,xmm0,xmm2 <-/
| a[idx] = xmm0
| 0x3ce639a42070 870 c4a17a7f041a vmovdqu [rdx+r11*1],xmm0
|
| count += 32
| 0x3ce639a42076 876 4183c120 addl r9,0x20
| if (count != 0)
\-- 0x3ce639a4207a 87a 0f8500fcffff jnz 0x3ce639a41c80 <+0x480> (T)
[snip]
Surprisingly (or not), the code follows exactly the same ‘shape’ as before: function inlined, loop unrolled with 6 multiplications, but with an important difference: SIMD finally! Note that this code uses AVX and, therefore, performs 4 multiplications at a time, or 24 multiplications per iteration of the loop. This should certainly bring some performance gain… right?
Interestingly, the x86_64 code is not an identical copy of the WASM version: the WebAssembly version performs only 2 multiplications per iteration!
Some Numbers…
I’m sure you’re tired of reading x86_64 assembly, and most of you probably want numbers. After all, can the v8 JIT really compete (or come close) to compiled languages?
Description | Time (ms) |
---|---|
add.js (JIT disabled, –jitless) | 39719.07 ms |
GCC/mul.c + -O0 | 2495.13 ms |
add.js (with JIT) | 2248.63 ms |
load.js+mul_wasm.c (WebAsm) + -O0 | 2298.35 ms |
load.js+mul_wasm.c (WebAsm) + -O3 | 1825.97 ms |
GCC/mul.c + -O3 | 1083.70 ms |
GCC/mul.c + -O3 + -march=native | 1065.57 ms |
load.js+mul_wasm.c (WebAsm) + -O3 + -msimd128 | 1064.27 ms |
First of all, the difference between JIT and non-JIT times is impressive. Google’s efforts to make v8 fast really seem to pay off, making browsers capable of things previously only possible in desktop environments.
Second, JITed JS (and WebAssembly) times closely approach those of unoptimized C code, which is… curious. I honestly expected a bit more, considering most developers write only in JS, but okay.
Third, enabling ‘-O3
’ in WebAssembly really starts to show some potential, slightly surpassing unoptimized C code. Finally, I had a pleasant surprise when enabling auto-vectorization with the -msimd128
flag, with performance code identical to that produced by GCC at the highest optimization level. Clearly, Clang+Turbofan did an excellent job in this last scenario.
Final Thoughts
Honestly, I am quite happy with the results and everything I learned during the process. It was both fun and exhausting, and I never thought I would be able to set breakpoints and single-step debug JIT-generated code in a runtime engine! There is still much to learn, as I cannot fully understand the JIT code generated by v8.
I also want to emphasize not taking this text too seriously—it’s just a quick dive into two lines of JS code, and no definitive conclusions can be drawn from it. There are countless scenarios where the generated code could be entirely different, and I’m not looking to set any hard conclusions here. What I’m trying to bring to the table is a different angle on debugging and analyzing the ASM code generated by the JIT, something I honestly haven’t come across much in benchmarks that usually stick to tables and graphs.
If you know of other interesting materials about v8 and comparative analyses, feel free to let me know =).