Skip to content

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

@raymondpwilson

Description

@raymondpwilson

Labels: area-CodeGen-coreclr, tenet-performance

Title

Vector256 explicit intrinsics regress ~71% on .NET 10 vs .NET 8 on AVX-512 hardware (Tiger Lake)

Description

Using System.Runtime.Intrinsics.Vector256<float> operations in a tight loop shows a 71-73% performance regression on .NET 10 GA compared to .NET 8 when running on AVX-512-capable hardware (Intel Tiger Lake). The equivalent portable System.Numerics.Vector<float> code (which also operates at 256-bit width on this hardware) shows a 3-6% improvement on .NET 10 over .NET 8, confirming this is specific to the explicit Vector256<T> intrinsics codepath.

Vector512<float> explicit intrinsics also regress, but only by 12-13%.

Environment

BenchmarkDotNet v0.15.8
Windows 11 (10.0.26200.8037/25H2)
11th Gen Intel Core i9-11950H @ 2.60GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.201
  .NET 10.0.5 (10.0.526.15411), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  .NET 8.0.25 (8.0.2526.11203), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
HardwareIntrinsics: AVX-512 (F+BW+CD+DQ+VL, IFMA+VBMI, BITALG+VBMI2+VNNI+VPOPCNTDQ)

Benchmark Results

Aggregation loop processing two float arrays (null-check, conditional select, subtract, compare, masked horizontal sum). Each benchmark method processes the same data at the same vector width.

Cross-runtime comparison (4096 elements, confirmed across two independent runs):

Method .NET 8.0 .NET 10.0 Regression
Vector<float> portable (256-bit) 3,482 ns 3,386 ns -3% (faster)
Vector256<float> explicit (256-bit) 2,133 ns 3,640 ns +71%
Vector512<float> explicit (512-bit) 2,419 ns 2,707 ns +12%

Cross-runtime comparison (1024 elements):

Method .NET 8.0 .NET 10.0 Regression
Vector<float> portable (256-bit) 867 ns 811 ns -6% (faster)
Vector256<float> explicit (256-bit) 528 ns 916 ns +73%
Vector512<float> explicit (512-bit) 595 ns 670 ns +13%

Within-runtime ratios (4096 elements):

Method .NET 8 Ratio .NET 10 Ratio
Vector<float> portable (baseline) 1.00 1.00
Vector256<float> explicit 0.61 1.08
Vector512<float> explicit 0.70 0.80

On .NET 8, Vector256 explicit is 39% faster than portable. On .NET 10, it's 8% slower.

Reproduction

Benchmark code

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net10_0)]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}

Project file

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net10.0;net8.0</TargetFrameworks>
    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.15.8" />
  </ItemGroup>
</Project>

Run

dotnet run -c Release -- --filter "*Vector256Regression*"

Root Cause Analysis via JIT Disassembly

Disassembly captured with DOTNET_TieredCompilation=0 + DOTNET_JitDisasm=ExplicitVector256.

JIT header change

  • .NET 8: Emitting BLENDED_CODE for X64 with AVX512 - Windows
  • .NET 10: Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows

This reflects the EVEX/AVX-512 rework (likely #115983).

Three key codegen differences

1. Extended register usage forces EVEX everywhere

.NET 10 uses ymm16, xmm16, xmm17 throughout the hot loop. These registers require EVEX encoding (4-byte prefix vs 2-3 byte VEX). .NET 8 stays within ymm0–ymm7.

.NET 10:

vxorps   ymm16, ymm16, ymm16          ; EVEX (ymm16)
vpternlogd ymm16, ymm5, ymm4, -40     ; EVEX
vpermilps xmm16, xmm5, -79            ; EVEX (xmm16 dest)
vaddps   xmm5, xmm16, xmm5           ; EVEX (xmm16 source)

.NET 8:

vxorps   ymm3, ymm3, ymm3             ; VEX (ymm3)
vmovaps  ymm5, ymm2                    ; VEX
vpternlogd ymm5, ymm3, ymm4, -54      ; EVEX (inherently AVX-512)

2. Horizontal sum strategy changed: vhaddpsvpermilps + vaddps

This is the largest codegen change. The Vector256.Sum() lowering switched from compact horizontal adds to a much longer shuffle+add sequence.

.NET 8 (4 instructions, ~9 µops, ~20 bytes):

vhaddps  ymm3, ymm3, ymm3             ; 3 µops
vhaddps  ymm3, ymm3, ymm3             ; 3 µops
vextractf128 xmm4, ymm3, 1            ; 1 µop
vaddps   xmm3, xmm4, xmm3            ; 1 µop

.NET 10 (10 instructions, ~13 µops, ~50 bytes):

vmovaps  ymm5, ymm4                    ; copy
vpermilps xmm16, xmm5, -79            ; shuffle low half
vaddps   xmm5, xmm16, xmm5
vpermilps xmm16, xmm5, 78
vaddps   xmm5, xmm16, xmm5
vextractf128 xmm4, ymm4               ; extract high half
vpermilps xmm16, xmm4, -79            ; shuffle high half
vaddps   xmm4, xmm16, xmm4
vpermilps xmm16, xmm4, 78
vaddps   xmm4, xmm16, xmm4
vaddss   xmm4, xmm5, xmm4            ; combine

This pattern is emitted twice per iteration (once for fill, once for cut), adding ~60 bytes of code to the loop body.

3. Resulting code size and loop body growth

.NET 8 .NET 10 Delta
Total method (FullOpts) 433 B 462 B +29 B (+7%)
Hot loop body (IG11–IG14 / IG09–IG12) ~213 B ~273 B +60 B (+28%)

The 28% larger loop body may cause micro-op cache (DSB) pressure on Tiger Lake, forcing fallback to the slower legacy instruction decoder.

Full optimized disassembly

ExplicitVector256 — .NET 10 (FullOpts, 462 bytes)
; Emitting BLENDED_CODE for generic X64 + VEX + EVEX on Windows
; FullOpts code, optimized code
; No PGO data; 3 single block inlinees

G_M000_IG01:
       sub      rsp, 40
       vmovaps  xmmword ptr [rsp+0x10], xmm6
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp], rax

G_M000_IG02:
       vxorps   xmm0, xmm0, xmm0
       vxorps   xmm1, xmm1, xmm1
       vbroadcastss ymm2, dword ptr [reloc @RWD00]
       mov      rax, gword ptr [rcx+0x08]
       mov      gword ptr [rsp+0x08], rax
       test     rax, rax
       je       SHORT G_M000_IG04

G_M000_IG03:
       mov      r8d, dword ptr [rax+0x08]
       test     r8d, r8d
       je       SHORT G_M000_IG04
       add      rax, 16
       jmp      SHORT G_M000_IG05

G_M000_IG04:
       xor      eax, eax

G_M000_IG05:
       mov      r8, gword ptr [rcx+0x10]
       mov      gword ptr [rsp], r8
       test     r8, r8
       je       SHORT G_M000_IG07

G_M000_IG06:
       mov      r10d, dword ptr [r8+0x08]
       test     r10d, r10d
       je       SHORT G_M000_IG07
       add      r8, 16
       jmp      SHORT G_M000_IG08

G_M000_IG07:
       xor      r8d, r8d

G_M000_IG08:
       xor      r10d, r10d
       mov      ecx, dword ptr [rcx+0x18]
       mov      r9d, ecx
       sar      r9d, 31
       and      r9d, 7
       add      ecx, r9d
       sar      ecx, 3
       cmp      r10d, ecx
       jge      G_M000_IG13

G_M000_IG09:                              ;; HOT LOOP START
       vmovups  ymm3, ymmword ptr [rax]
       vcmpeqps ymm4, ymm2, ymm3
       vmovups  ymm5, ymmword ptr [r8]
       vcmpeqps ymm6, ymm2, ymm5
       vpternlogd ymm4, ymm6, ymm4, 17
       vxorps   ymm16, ymm16, ymm16
       vcmpeqps k1, ymm16, ymm4
       kortestb k1, k1
       jb       G_M000_IG12

G_M000_IG10:
       vxorps   ymm16, ymm16, ymm16
       vpternlogd ymm16, ymm5, ymm4, -40
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vsubps   ymm3, ymm16, ymm4
       vcmpgtps ymm4, ymm3, ymmword ptr [reloc @RWD32]
       vmovmskps r9, ymm4
       test     r9d, r9d
       je       SHORT G_M000_IG11
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vmovaps  ymm5, ymm4
       vpermilps xmm16, xmm5, -79
       vaddps   xmm5, xmm16, xmm5
       vpermilps xmm16, xmm5, 78
       vaddps   xmm5, xmm16, xmm5
       vextractf128 xmm4, ymm4
       vpermilps xmm16, xmm4, -79
       vaddps   xmm4, xmm16, xmm4
       vpermilps xmm16, xmm4, 78
       vaddps   xmm4, xmm16, xmm4
       vaddss   xmm4, xmm5, xmm4
       vcvtss2sd xmm4, xmm4, xmm4
       vaddsd   xmm1, xmm4, xmm1

G_M000_IG11:
       vcmpltps ymm4, ymm3, ymmword ptr [reloc @RWD64]
       vmovmskps r9, ymm4
       test     r9d, r9d
       je       SHORT G_M000_IG12
       vxorps   ymm5, ymm5, ymm5
       vpternlogd ymm4, ymm5, ymm3, -84
       vmovaps  ymm3, ymm4
       vpermilps xmm5, xmm3, -79
       vaddps   xmm3, xmm5, xmm3
       vpermilps xmm5, xmm3, 78
       vaddps   xmm3, xmm5, xmm3
       vextractf128 xmm4, ymm4
       vpermilps xmm5, xmm4, -79
       vaddps   xmm4, xmm5, xmm4
       vpermilps xmm5, xmm4, 78
       vaddps   xmm4, xmm5, xmm4
       vaddss   xmm3, xmm3, xmm4
       vcvtss2sd xmm3, xmm3, xmm3
       vaddsd   xmm0, xmm3, xmm0

G_M000_IG12:                              ;; HOT LOOP END
       inc      r10d
       add      rax, 32
       add      r8, 32
       cmp      r10d, ecx
       jl       G_M000_IG09

G_M000_IG13:
       xor      rax, rax
       mov      gword ptr [rsp+0x08], rax
       mov      gword ptr [rsp], rax
       vmovsd   qword ptr [rdx], xmm0
       vmovsd   qword ptr [rdx+0x08], xmm1
       mov      rax, rdx

G_M000_IG15:
       vzeroupper
       vmovaps  xmm6, xmmword ptr [rsp+0x10]
       add      rsp, 40
       ret

; Total bytes of code 462
ExplicitVector256 — .NET 8 (FullOpts, 433 bytes)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code, optimized code
; No PGO data; 2 single block inlinees

G_M000_IG01:
       sub      rsp, 56
       vzeroupper
       xor      eax, eax
       mov      qword ptr [rsp+0x30], rax
       mov      qword ptr [rsp+0x28], rax

G_M000_IG02:
       vxorps   xmm0, xmm0, xmm0
       vxorps   xmm1, xmm1, xmm1
       mov      rax, gword ptr [rcx+0x08]
       mov      gword ptr [rsp+0x30], rax
       test     rax, rax
       je       SHORT G_M000_IG04

G_M000_IG03:
       mov      rax, gword ptr [rsp+0x30]
       cmp      dword ptr [rax+0x08], 0
       jne      SHORT G_M000_IG05

G_M000_IG04:
       xor      eax, eax
       jmp      SHORT G_M000_IG06

G_M000_IG05:
       mov      rax, gword ptr [rsp+0x30]
       cmp      dword ptr [rax+0x08], 0
       jbe      G_M000_IG18
       mov      rax, gword ptr [rsp+0x30]
       add      rax, 16

G_M000_IG06:
       mov      r8, gword ptr [rcx+0x10]
       mov      gword ptr [rsp+0x28], r8
       test     r8, r8
       je       SHORT G_M000_IG08

G_M000_IG07:
       mov      r8, gword ptr [rsp+0x28]
       cmp      dword ptr [r8+0x08], 0
       jne      SHORT G_M000_IG09

G_M000_IG08:
       xor      r8d, r8d
       jmp      SHORT G_M000_IG10

G_M000_IG09:
       mov      r8, gword ptr [rsp+0x28]
       cmp      dword ptr [r8+0x08], 0
       jbe      G_M000_IG18
       mov      r8, gword ptr [rsp+0x28]
       add      r8, 16

G_M000_IG10:
       xor      r10d, r10d
       mov      ecx, dword ptr [rcx+0x18]
       mov      r9d, ecx
       sar      r9d, 31
       and      r9d, 7
       add      ecx, r9d
       sar      ecx, 3
       test     ecx, ecx
       jle      G_M000_IG15

G_M000_IG11:                              ;; HOT LOOP START
       vmovups  ymm2, ymmword ptr [rax]
       vcmpps   ymm2, ymm2, ymmword ptr [reloc @RWD00], 0
       vmovups  ymm3, ymmword ptr [r8]
       vcmpps   ymm3, ymm3, ymmword ptr [reloc @RWD00], 0
       vorps    ymm2, ymm2, ymm3
       vpternlogd ymm2, ymm2, ymm2, 85
       vxorps   ymm3, ymm3, ymm3
       vcmpps   k1, ymm2, ymm3, 0
       kortestb k1, k1
       jb       G_M000_IG14

G_M000_IG12:
       vmovups  ymm3, ymmword ptr [r8]
       vxorps   ymm4, ymm4, ymm4
       vmovaps  ymm5, ymm2
       vpternlogd ymm5, ymm3, ymm4, -54
       vmovups  ymm3, ymmword ptr [rax]
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm2, ymm3, ymm4, -54
       vsubps   ymm2, ymm5, ymm2
       vcmpps   ymm3, ymm2, ymmword ptr [reloc @RWD32], 14
       vmovmskps r9, ymm3
       test     r9d, r9d
       je       SHORT G_M000_IG13
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm3, ymm2, ymm4, -54
       vhaddps  ymm3, ymm3, ymm3
       vhaddps  ymm3, ymm3, ymm3
       vextractf128 xmm4, ymm3, 1
       vaddps   xmm3, xmm4, xmm3
       vcvtss2sd xmm3, xmm3, xmm3
       vaddsd   xmm1, xmm3, xmm1

G_M000_IG13:
       vcmpps   ymm3, ymm2, ymmword ptr [reloc @RWD64], 1
       vmovmskps r9, ymm3
       test     r9d, r9d
       je       SHORT G_M000_IG14
       vxorps   ymm4, ymm4, ymm4
       vpternlogd ymm3, ymm2, ymm4, -54
       vhaddps  ymm2, ymm3, ymm3
       vhaddps  ymm2, ymm2, ymm2
       vextractf128 xmm3, ymm2, 1
       vaddps   xmm2, xmm3, xmm2
       vcvtss2sd xmm2, xmm2, xmm2
       vaddsd   xmm0, xmm2, xmm0

G_M000_IG14:                              ;; HOT LOOP END
       inc      r10d
       add      rax, 32
       add      r8, 32
       cmp      r10d, ecx
       jl       G_M000_IG11

G_M000_IG15:
       xor      rax, rax
       mov      gword ptr [rsp+0x30], rax
       mov      gword ptr [rsp+0x28], rax
       vmovsd   qword ptr [rdx], xmm0
       vmovsd   qword ptr [rdx+0x08], xmm1
       mov      rax, rdx

G_M000_IG17:
       vzeroupper
       add      rsp, 56
       ret

G_M000_IG18:
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 433

Possibly Related

  • #115983 — EVEX/AVX-512/AVX-10 rework merged in .NET 10
  • The .NET 10 JIT header changed from "X64 with AVX512" to "generic X64 + VEX + EVEX", suggesting the instruction encoding selection was significantly reworked

Summary

On .NET 10 GA running on AVX-512-capable Tiger Lake hardware:

  1. Vector256.Sum() lowering changed from vhaddps (4 instructions, ~20 bytes) to vpermilps + vaddps (10 instructions, ~50 bytes), duplicated for every call site in the loop
  2. The register allocator places temporaries in ymm16+ / xmm16+, forcing EVEX encoding on instructions that could be VEX-encoded
  3. The hot loop body grew 28% (213 B → 273 B), potentially exceeding micro-op cache capacity
  4. Net result: 71-73% regression for Vector256<float> explicit intrinsics vs .NET 8

Metadata

Metadata

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions