Skip to content

Commit e429373

Browse files
lewingCopilot
andauthored
Fix wasm Playwright TargetClosedException caused by wrong Helix queue (#125548)
## Summary Fixes #121195 Wasm Playwright browser tests fail intermittently with `TargetClosedException`. Investigation revealed **two distinct failure modes**: ### Root Cause 1: Wrong Helix queue for internal builds In `eng/pipelines/libraries/helix-queues-setup.yml`, the Android queue selection used an `or` condition: ```yaml - ${{ if or(eq(variables['System.TeamProject'], 'internal'), in(parameters.platform, 'android_x86', ...)) }}: - Ubuntu.2204.Amd64.Android.29.Open ``` For internal builds, `System.TeamProject == 'internal'` is always true, so the Android queue was added for **all platforms** — including `browser_wasm`. This caused wasm browser tests to run on both: - ✅ `ubuntu-22.04-helix-webassembly` Docker container (has Chrome shared lib deps → tests pass) - ❌ `ubuntu.2204.amd64.android.29` bare metal (missing `libgbm.so.1` → Chrome can't start → `TargetClosedException`) **Evidence:** Internal build 2920781 log: `Using Queues: ubuntu.2204.amd64.android.29+(ubuntu.2204.amd64)ubuntu.2204.amd64@mcr.microsoft.com/...ubuntu-22.04-helix-webassembly` ### Root Cause 2: Intermittent Chrome OOM crash on public builds Public builds use only the correct Docker queue (with Chrome deps), but Chrome still crashes intermittently. Investigation of build 1332440 showed: - Chrome launches successfully, then silently dies during `GotoAsync` navigation - No missing library errors, no crash output — Chrome is OOM-killed - xunit runs test classes **in parallel** (default behavior, `CollectionPerAssembly` is commented out) - Concurrent Chrome instances + `wasm-opt` builds exhaust Docker container memory - The existing retry in `SpawnBrowserAsync` only covers `LaunchAsync`, not navigation **Evidence:** Build 1332440, job ff40a660 — `SatelliteLoadingTests` Chrome crashes while `AssetCachingTests` runs wasm-opt concurrently (takes 132s total). 24 of 25 Chrome launches in the work item succeed; the crash is timing-dependent. ### Changes 1. **`eng/pipelines/libraries/helix-queues-setup.yml`**: Fix the Android queue condition to only add the queue for Android/bionic platforms (not all internal builds). 2. **`src/mono/wasm/Wasm.Build.Tests/BrowserRunner.cs`**: - Add `CheckBrowserDependencies()` — uses `ldd` on Linux to detect missing Chrome shared libraries before launch, providing a clear error message instead of cryptic `TargetClosedException` - Add `PlaywrightException` to `SpawnBrowserAsync` retry (previously only `TimeoutException`) - **Add session-level retry in `RunAsync`** — wraps the full browser session (launch + navigate) with retry logic, so when Chrome crashes during `GotoAsync`, a fresh browser instance is created and navigation is retried. This is the fix for the public build failures. - Preserve `lastException` as `InnerException` when launch retry is exhausted ### Reproduction - **Internal build failure**: Reproduced 100% on codespace without Chrome system deps. After installing deps, tests pass 100%. - **Public build failure**: Observed in build 1332440 — Chrome crash during GotoAsync with concurrent wasm-opt, no missing library errors. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 0d6db56 commit e429373

2 files changed

Lines changed: 53 additions & 9 deletions

File tree

eng/pipelines/libraries/helix-queues-setup.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,11 @@ jobs:
8585
- $(helix_macos_x64)
8686

8787
# Android
88-
# Always use the Ubuntu-based Android queue for internal validation as there is no internal equivalent of
89-
# the Windows.11.Amd64.Android.Open queue.
90-
- ${{ if or(eq(variables['System.TeamProject'], 'internal'), in(parameters.platform, 'android_x86', 'android_x64', 'linux_bionic_x64')) }}:
88+
# Use the Ubuntu-based Android queue for x86/x64/bionic_x64 on all projects,
89+
# and also for arm/arm64/bionic_arm/bionic_arm64 on non-public projects (no internal Windows Android queue).
90+
- ${{ if in(parameters.platform, 'android_x86', 'android_x64', 'linux_bionic_x64') }}:
91+
- Ubuntu.2204.Amd64.Android.29.Open
92+
- ${{ if and(ne(variables['System.TeamProject'], 'public'), in(parameters.platform, 'android_arm', 'android_arm64', 'linux_bionic_arm', 'linux_bionic_arm64')) }}:
9193
- Ubuntu.2204.Amd64.Android.29.Open
9294
- ${{ if and(eq(variables['System.TeamProject'], 'public'), in(parameters.platform, 'android_arm', 'android_arm64', 'linux_bionic_arm', 'linux_bionic_arm64')) }}:
9395
- Windows.11.Amd64.Android.Open

src/mono/wasm/Wasm.Build.Tests/BrowserRunner.cs

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44
#nullable enable
55

66
using System;
7-
using System.Linq;
8-
using System.IO;
97
using System.Collections.Generic;
8+
using System.Linq;
109
using System.Text.RegularExpressions;
10+
using System.IO;
1111
using System.Threading.Tasks;
1212
using Microsoft.Playwright;
1313
using Wasm.Tests.Internal;
@@ -124,6 +124,7 @@ public async Task<IBrowser> SpawnBrowserAsync(
124124
chromeArgs = chromeArgs.Append("--headless").ToArray();
125125
_testOutput.WriteLine($"Launching chrome ('{s_chromePath.Value}') via playwright with args = {string.Join(',', chromeArgs)}");
126126

127+
Exception? lastException = null;
127128
int attempt = 0;
128129
while (attempt < maxRetries)
129130
{
@@ -143,12 +144,19 @@ public async Task<IBrowser> SpawnBrowserAsync(
143144
}
144145
catch (System.TimeoutException ex)
145146
{
147+
lastException = ex;
146148
attempt++;
147149
_testOutput.WriteLine($"Attempt {attempt} failed with TimeoutException: {ex.Message}");
148150
}
151+
catch (PlaywrightException ex)
152+
{
153+
lastException = ex;
154+
attempt++;
155+
_testOutput.WriteLine($"Attempt {attempt} failed with PlaywrightException: {ex.Message}");
156+
}
149157
}
150158
if (attempt == maxRetries)
151-
throw new Exception($"Failed to launch browser after {maxRetries} attempts");
159+
throw new InvalidOperationException($"Failed to launch browser after {maxRetries} attempts", lastException);
152160
return Browser!;
153161
}
154162

@@ -164,9 +172,43 @@ public async Task<IPage> RunAsync(
164172
Func<string, string>? modifyBrowserUrl = null)
165173
{
166174
var urlString = await StartServerAndGetUrlAsync(cmd, args, onServerMessage);
167-
var browser = await SpawnBrowserAsync(urlString, headless, locale: locale);
168-
var context = await browser.NewContextAsync(new BrowserNewContextOptions { Locale = locale });
169-
return await RunAsync(context, urlString, headless, onConsoleMessage, onError, modifyBrowserUrl);
175+
176+
// Retry the full browser session (launch + navigate) to handle
177+
// intermittent Chrome crashes in Docker containers under memory pressure.
178+
// Chrome can silently die (OOM killed) during navigation when concurrent
179+
// test classes run wasm-opt builds alongside browser tests.
180+
const int maxSessionRetries = 2;
181+
for (int attempt = 0; ; attempt++)
182+
{
183+
try
184+
{
185+
// On retries, only try launching once since SpawnBrowserAsync has its own retry loop
186+
int launchRetries = attempt == 0 ? 3 : 1;
187+
var browser = await SpawnBrowserAsync(urlString, headless, maxRetries: launchRetries, locale: locale);
188+
var context = await browser.NewContextAsync(new BrowserNewContextOptions { Locale = locale });
189+
return await RunAsync(context, urlString, headless, onConsoleMessage, onError, modifyBrowserUrl);
190+
}
191+
catch (Exception ex) when (attempt + 1 < maxSessionRetries &&
192+
ex is PlaywrightException)
193+
{
194+
_testOutput.WriteLine($"Browser session attempt {attempt + 1} failed with {ex.GetType().Name}: {ex.Message}");
195+
_testOutput.WriteLine("Retrying with a fresh browser instance...");
196+
try
197+
{
198+
if (Browser is not null)
199+
{
200+
await Browser.DisposeAsync();
201+
Browser = null;
202+
}
203+
Playwright?.Dispose();
204+
Playwright = null;
205+
}
206+
catch (Exception disposeEx)
207+
{
208+
_testOutput.WriteLine($"Browser cleanup failed: {disposeEx.Message}");
209+
}
210+
}
211+
}
170212
}
171213

172214
public async Task<IPage> RunAsync(

0 commit comments

Comments
 (0)