Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware

Labels: `area-CodeGen-coreclr`, `tenet-performance`

## Title

**Vector256 explicit intrinsics regress ~71% on .NET 10 vs .NET 8 on AVX-512 hardware (Tiger Lake)**

## Description

Using `System.Runtime.Intrinsics.Vector256<float>` operations in a tight loop shows a **71-73% performance regression** on .NET 10 GA compared to .NET 8 when running on AVX-512-capable hardware (Intel Tiger Lake). The equivalent portable `System.Numerics.Vector<float>` code (which also operates at 256-bit width on this hardware) shows a **3-6% improvement** on .NET 10 over .NET 8, confirming this is specific to the explicit `Vector256<T>` intrinsics codepath.

`Vector512<float>` explicit intrinsics also regress, but only by 12-13%.

## Environment

```
BenchmarkDotNet v0.15.8
Windows 11 (10.0.26200.8037/25H2)
11th Gen Intel Core i9-11950H @ 2.60GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.201
  .NET 10.0.5 (10.0.526.15411), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  .NET 8.0.25 (8.0.2526.11203), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
HardwareIntrinsics: AVX-512 (F+BW+CD+DQ+VL, IFMA+VBMI, BITALG+VBMI2+VNNI+VPOPCNTDQ)
```

## Benchmark Results

Aggregation loop processing two float arrays (null-check, conditional select, subtract, compare, masked horizontal sum). Each benchmark method processes the same data at the same vector width.

**Cross-runtime comparison (4096 elements, confirmed across two independent runs):**

| Method | .NET 8.0 | .NET 10.0 | Regression |
|---|---:|---:|---:|
| `Vector<float>` portable (256-bit) | 3,482 ns | 3,386 ns | -3% (faster) |
| `Vector256<float>` explicit (256-bit) | 2,133 ns | 3,640 ns | **+71%** |
| `Vector512<float>` explicit (512-bit) | 2,419 ns | 2,707 ns | +12% |

**Cross-runtime comparison (1024 elements):**

| Method | .NET 8.0 | .NET 10.0 | Regression |
|---|---:|---:|---:|
| `Vector<float>` portable (256-bit) | 867 ns | 811 ns | -6% (faster) |
| `Vector256<float>` explicit (256-bit) | 528 ns | 916 ns | **+73%** |
| `Vector512<float>` explicit (512-bit) | 595 ns | 670 ns | +13% |

**Within-runtime ratios (4096 elements):**

| Method | .NET 8 Ratio | .NET 10 Ratio |
|---|---:|---:|
| `Vector<float>` portable (baseline) | 1.00 | 1.00 |
| `Vector256<float>` explicit | **0.61** | **1.08** |
| `Vector512<float>` explicit | **0.70** | **0.80** |

On .NET 8, `Vector256` explicit is 39% faster than portable. On .NET 10, it's 8% *slower*.

## Reproduction

### Benchmark code

```csharp
using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net10_0)]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}
```

### Project file

```xml
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net10.0;net8.0</TargetFrameworks>
    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.15.8" />
  </ItemGroup>
</Project>
```

### Run

```bash
dotnet run -c Release -- --filter "*Vector256Regression*"
```

## Root Cause Analysis via JIT Disassembly

Disassembly captured with `DOTNET_TieredCompilation=0` + `DOTNET_JitDisasm=ExplicitVector256`.

### JIT header change

- **.NET 8:** `Emitting BLENDED_CODE for X64 with AVX512 - Windows`
- **.NET 10:** `Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows`

This reflects the EVEX/AVX-512 rework (likely [#115983](https://github.com/dotnet/runtime/pull/115983)).

### Three key codegen differences

#### 1. Extended register usage forces EVEX everywhere

.NET 10 uses `ymm16`, `xmm16`, `xmm17` throughout the hot loop. These registers require EVEX encoding (4-byte prefix vs 2-3 byte VEX). .NET 8 stays within `ymm0–ymm7`.

**.NET 10:**
```x86asm
vxorps   ymm16, ymm16, ymm16          ; EVEX (ymm16)
vpternlogd ymm16, ymm5, ymm4, -40     ; EVEX
vpermilps xmm16, xmm5, -79            ; EVEX (xmm16 dest)
vaddps   xmm5, xmm16, xmm5           ; EVEX (xmm16 source)
```

**.NET 8:**
```x86asm
vxorps   ymm3, ymm3, ymm3             ; VEX (ymm3)
vmovaps  ymm5, ymm2                    ; VEX
vpternlogd ymm5, ymm3, ymm4, -54      ; EVEX (inherently AVX-512)
```

#### 2. Horizontal sum strategy changed: `vhaddps` → `vpermilps + vaddps`

This is the largest codegen change. The `Vector256.Sum()` lowering switched from compact horizontal adds to a much longer shuffle+add sequence.

**.NET 8 (4 instructions, ~9 µops, ~20 bytes):**
```x86asm
vhaddps  ymm3, ymm3, ymm3             ; 3 µops
vhaddps  ymm3, ymm3, ymm3             ; 3 µops
vextractf128 xmm4, ymm3, 1            ; 1 µop
vaddps   xmm3, xmm4, xmm3            ; 1 µop
```

**.NET 10 (10 instructions, ~13 µops, ~50 bytes):**
```x86asm
vmovaps  ymm5, ymm4                    ; copy
vpermilps xmm16, xmm5, -79            ; shuffle low half
vaddps   xmm5, xmm16, xmm5
vpermilps xmm16, xmm5, 78
vaddps   xmm5, xmm16, xmm5
vextractf128 xmm4, ymm4               ; extract high half
vpermilps xmm16, xmm4, -79            ; shuffle high half
vaddps   xmm4, xmm16, xmm4
vpermilps xmm16, xmm4, 78
vaddps   xmm4, xmm16, xmm4
vaddss   xmm4, xmm5, xmm4            ; combine
```

This pattern is emitted **twice per iteration** (once for fill, once for cut), adding ~60 bytes of code to the loop body.

#### 3. Resulting code size and loop body growth

| | .NET 8 | .NET 10 | Delta |
|---|---:|---:|---:|
| Total method (FullOpts) | 433 B | 462 B | +29 B (+7%) |
| Hot loop body (IG11–IG14 / IG09–IG12) | ~213 B | ~273 B | +60 B (+28%) |

The 28% larger loop body may cause micro-op cache (DSB) pressure on Tiger Lake, forcing fallback to the slower legacy instruction decoder.

### Full optimized disassembly

<details>
<summary>ExplicitVector256 — .NET 10 (FullOpts, 462 bytes)</summary>

```x86asm
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code, optimized code
; No PGO data; 3 single block inlinees

G_M000_IG01:
       sub      rsp, 40
       vmovaps  xmmword ptr [rsp+0x10], xmm6
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp], rax

G_M000_IG02:
       vxorps   xmm0, xmm0, xmm0
       vxorps   xmm1, xmm1, xmm1
       vbroadcastss ymm2, dword ptr [reloc @RWD00]
       mov      rax, gword ptr [rcx+0x08]
       mov      gword ptr [rsp+0x08], rax
       test     rax, rax
       je       SHORT G_M000_IG04

G_M000_IG03:
       mov      r8d, dword ptr [rax+0x08]
       test     r8d, r8d
       je       SHORT G_M000_IG04
       add      rax, 16
       jmp      SHORT G_M000_IG05

G_M000_IG04:
       xor      eax, eax

G_M000_IG05:
       mov      r8, gword ptr [rcx+0x10]
       mov      gword ptr [rsp], r8
       test     r8, r8
       je       SHORT G_M000_IG07

G_M000_IG06:
       mov      r10d, dword ptr [r8+0x08]
       test     r10d, r10d
       je       SHORT G_M000_IG07
       add      r8, 16
       jmp      SHORT G_M000_IG08

G_M000_IG07:
       xor      r8d, r8d

G_M000_IG08:
       xor      r10d, r10d
       mov      ecx, dword ptr [rcx+0x18]
       mov      r9d, ecx
       sar      r9d, 31
       and      r9d, 7
       add      ecx, r9d
       sar      ecx, 3
       cmp      r10d, ecx
       jge      G_M000_IG13

G_M000_IG09:                              ;; HOT LOOP START
       vmovups  ymm3, ymmword ptr [rax]
       vcmpeqps ymm4, ymm2, ymm3
       vmovups  ymm5, ymmword ptr [r8]
       vcmpeqps ymm6, ymm2, ymm5
       vpternlogd ymm4, ymm6, ymm4, 17
       vxorps   ymm16, ymm16, ymm16
       vcmpeqps k1, ymm16, ymm4
       kortestb k1, k1
       jb       G_M000_IG12

G_M000_IG10:
       vxorps   ymm16, ymm16, ymm16
       vpternlogd ymm16, ymm5, ymm4, -40
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vsubps   ymm3, ymm16, ymm4
       vcmpgtps ymm4, ymm3, ymmword ptr [reloc @RWD32]
       vmovmskps r9, ymm4
       test     r9d, r9d
       je       SHORT G_M000_IG11
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vmovaps  ymm5, ymm4
       vpermilps xmm16, xmm5, -79
       vaddps   xmm5, xmm16, xmm5
       vpermilps xmm16, xmm5, 78
       vaddps   xmm5, xmm16, xmm5
       vextractf128 xmm4, ymm4
       vpermilps xmm16, xmm4, -79
       vaddps   xmm4, xmm16, xmm4
       vpermilps xmm16, xmm4, 78
       vaddps   xmm4, xmm16, xmm4
       vaddss   xmm4, xmm5, xmm4
       vcvtss2sd xmm4, xmm4, xmm4
       vaddsd   xmm1, xmm4, xmm1

G_M000_IG11:
       vcmpltps ymm4, ymm3, ymmword ptr [reloc @RWD64]
       vmovmskps r9, ymm4
       test     r9d, r9d
       je       SHORT G_M000_IG12
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vmovaps  ymm3, ymm4
       vpermilps xmm5, xmm3, -79
       vaddps   xmm3, xmm5, xmm3
       vpermilps xmm5, xmm3, 78
       vaddps   xmm3, xmm5, xmm3
       vextractf128 xmm4, ymm4
       vpermilps xmm5, xmm4, -79
       vaddps   xmm4, xmm5, xmm4
       vpermilps xmm5, xmm4, 78
       vaddps   xmm4, xmm5, xmm4
       vaddss   xmm3, xmm3, xmm4
       vcvtss2sd xmm3, xmm3, xmm3
       vaddsd   xmm0, xmm3, xmm0

G_M000_IG12:                              ;; HOT LOOP END
       inc      r10d
       add      rax, 32
       add      r8, 32
       cmp      r10d, ecx
       jl       G_M000_IG09

G_M000_IG13:
       xor      rax, rax
       mov      gword ptr [rsp+0x08], rax
       mov      gword ptr [rsp], rax
       vmovsd   qword ptr [rdx], xmm0
       vmovsd   qword ptr [rdx+0x08], xmm1
       mov      rax, rdx

G_M000_IG15:
       vzeroupper
       vmovaps  xmm6, xmmword ptr [rsp+0x10]
       add      rsp, 40
       ret

; Total bytes of code 462
```

</details>

<details>
<summary>ExplicitVector256 — .NET 8 (FullOpts, 433 bytes)</summary>

```x86asm
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code, optimized code
; No PGO data; 2 single block inlinees

G_M000_IG01:
       sub      rsp, 56
       vzeroupper
       xor      eax, eax
       mov      qword ptr [rsp+0x30], rax
       mov      qword ptr [rsp+0x28], rax

G_M000_IG02:
       vxorps   xmm0, xmm0, xmm0
       vxorps   xmm1, xmm1, xmm1
       mov      rax, gword ptr [rcx+0x08]
       mov      gword ptr [rsp+0x30], rax
       test     rax, rax
       je       SHORT G_M000_IG04

G_M000_IG03:
       mov      rax, gword ptr [rsp+0x30]
       cmp      dword ptr [rax+0x08], 0
       jne      SHORT G_M000_IG05

G_M000_IG04:
       xor      eax, eax
       jmp      SHORT G_M000_IG06

G_M000_IG05:
       mov      rax, gword ptr [rsp+0x30]
       cmp      dword ptr [rax+0x08], 0
       jbe      G_M000_IG18
       mov      rax, gword ptr [rsp+0x30]
       add      rax, 16

G_M000_IG06:
       mov      r8, gword ptr [rcx+0x10]
       mov      gword ptr [rsp+0x28], r8
       test     r8, r8
       je       SHORT G_M000_IG08

G_M000_IG07:
       mov      r8, gword ptr [rsp+0x28]
       cmp      dword ptr [r8+0x08], 0
       jne      SHORT G_M000_IG09

G_M000_IG08:
       xor      r8d, r8d
       jmp      SHORT G_M000_IG10

G_M000_IG09:
       mov      r8, gword ptr [rsp+0x28]
       cmp      dword ptr [r8+0x08], 0
       jbe      G_M000_IG18
       mov      r8, gword ptr [rsp+0x28]
       add      r8, 16

G_M000_IG10:
       xor      r10d, r10d
       mov      ecx, dword ptr [rcx+0x18]
       mov      r9d, ecx
       sar      r9d, 31
       and      r9d, 7
       add      ecx, r9d
       sar      ecx, 3
       test     ecx, ecx
       jle      G_M000_IG15

G_M000_IG11:                              ;; HOT LOOP START
       vmovups  ymm2, ymmword ptr [rax]
       vcmpps   ymm2, ymm2, ymmword ptr [reloc @RWD00], 0
       vmovups  ymm3, ymmword ptr [r8]
       vcmpps   ymm3, ymm3, ymmword ptr [reloc @RWD00], 0
       vorps    ymm2, ymm2, ymm3
       vpternlogd ymm2, ymm2, ymm2, 85
       vxorps   ymm3, ymm3, ymm3
       vcmpps   k1, ymm2, ymm3, 0
       kortestb k1, k1
       jb       G_M000_IG14

G_M000_IG12:
       vmovups  ymm3, ymmword ptr [r8]
       vxorps   ymm4, ymm4, ymm4
       vmovaps  ymm5, ymm2
       vpternlogd ymm5, ymm3, ymm4, -54
       vmovups  ymm3, ymmword ptr [rax]
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm2, ymm3, ymm4, -54
       vsubps   ymm2, ymm5, ymm2
       vcmpps   ymm3, ymm2, ymmword ptr [reloc @RWD32], 14
       vmovmskps r9, ymm3
       test     r9d, r9d
       je       SHORT G_M000_IG13
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm3, ymm2, ymm4, -54
       vhaddps  ymm3, ymm3, ymm3
       vhaddps  ymm3, ymm3, ymm3
       vextractf128 xmm4, ymm3, 1
       vaddps   xmm3, xmm4, xmm3
       vcvtss2sd xmm3, xmm3, xmm3
       vaddsd   xmm1, xmm3, xmm1

G_M000_IG13:
       vcmpps   ymm3, ymm2, ymmword ptr [reloc @RWD64], 1
       vmovmskps r9, ymm3
       test     r9d, r9d
       je       SHORT G_M000_IG14
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm3, ymm2, ymm4, -54
       vhaddps  ymm2, ymm3, ymm3
       vhaddps  ymm2, ymm2, ymm2
       vextractf128 xmm3, ymm2, 1
       vaddps   xmm2, xmm3, xmm2
       vcvtss2sd xmm2, xmm2, xmm2
       vaddsd   xmm0, xmm2, xmm0

G_M000_IG14:                              ;; HOT LOOP END
       inc      r10d
       add      rax, 32
       add      r8, 32
       cmp      r10d, ecx
       jl       G_M000_IG11

G_M000_IG15:
       xor      rax, rax
       mov      gword ptr [rsp+0x30], rax
       mov      gword ptr [rsp+0x28], rax
       vmovsd   qword ptr [rdx], xmm0
       vmovsd   qword ptr [rdx+0x08], xmm1
       mov      rax, rdx

G_M000_IG17:
       vzeroupper
       add      rsp, 56
       ret

G_M000_IG18:
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 433
```

</details>

## Possibly Related

- [#115983](https://github.com/dotnet/runtime/pull/115983) — EVEX/AVX-512/AVX-10 rework merged in .NET 10
- The `.NET 10` JIT header changed from `"X64 with AVX512"` to `"generic X64 + VEX + EVEX"`, suggesting the instruction encoding selection was significantly reworked

## Summary

On .NET 10 GA running on AVX-512-capable Tiger Lake hardware:

1. `Vector256.Sum()` lowering changed from `vhaddps` (4 instructions, ~20 bytes) to `vpermilps + vaddps` (10 instructions, ~50 bytes), duplicated for every call site in the loop
2. The register allocator places temporaries in `ymm16+` / `xmm16+`, forcing EVEX encoding on instructions that could be VEX-encoded
3. The hot loop body grew 28% (213 B → 273 B), potentially exceeding micro-op cache capacity
4. Net result: 71-73% regression for `Vector256<float>` explicit intrinsics vs .NET 8




Method	.NET 8 Ratio	.NET 10 Ratio
`Vector<float>` portable (baseline)	1.00	1.00
`Vector256<float>` explicit	0.61	1.08
`Vector512<float>` explicit	0.70	0.80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

Title

Description

Environment

Benchmark Results

Reproduction

Benchmark code

Project file

Run

Root Cause Analysis via JIT Disassembly

JIT header change

Three key codegen differences

1. Extended register usage forces EVEX everywhere

2. Horizontal sum strategy changed: `vhaddps` → `vpermilps + vaddps`

3. Resulting code size and loop body growth

Full optimized disassembly

Possibly Related

Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	.NET 8.0	.NET 10.0	Regression
`Vector<float>` portable (256-bit)	3,482 ns	3,386 ns	-3% (faster)
`Vector256<float>` explicit (256-bit)	2,133 ns	3,640 ns	+71%
`Vector512<float>` explicit (512-bit)	2,419 ns	2,707 ns	+12%

Method	.NET 8.0	.NET 10.0	Regression
`Vector<float>` portable (256-bit)	867 ns	811 ns	-6% (faster)
`Vector256<float>` explicit (256-bit)	528 ns	916 ns	+73%
`Vector512<float>` explicit (512-bit)	595 ns	670 ns	+13%

	.NET 8	.NET 10	Delta
Total method (FullOpts)	433 B	462 B	+29 B (+7%)
Hot loop body (IG11–IG14 / IG09–IG12)	~213 B	~273 B	+60 B (+28%)

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

Description

Title

Description

Environment

Benchmark Results

Reproduction

Benchmark code

Project file

Run

Root Cause Analysis via JIT Disassembly

JIT header change

Three key codegen differences

1. Extended register usage forces EVEX everywhere

2. Horizontal sum strategy changed: vhaddps → vpermilps + vaddps

3. Resulting code size and loop body growth

Full optimized disassembly

Possibly Related

Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Horizontal sum strategy changed: `vhaddps` → `vpermilps + vaddps`