[TRTLLM-11614][feat] Fixing multigpu tests by greg-kwasniewski1 · Pull Request #11615 · NVIDIA/TensorRT-LLM

greg-kwasniewski1 · 2026-02-21T16:06:51Z

Fixing Silent Test Passes in Multi-GPU Sharding Tests

Issue

Running any multi-GPU test via spawn_multiprocess_job (e.g.
test_sharding[MLP-torch_dist_all_reduce-False-False-2]) reports 1 passed
even though both child processes crash during initialization with:

File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 115, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

The test body never executes — the processes die before reaching
init_and_run_process — yet pytest shows a green result.

Root Cause

Three independent bugs in tensorrt_llm/_torch/auto_deploy/distributed/common.py:

1. `isinstance` check uses the wrong class — exit codes are never verified

_join_multiprocess_job guards the exit-code assertion with:

if isinstance(p, mp.Process):          # always False for spawn-context processes
    assert not p.exitcode, ...

Processes created via mp.get_context("spawn").Process() are SpawnProcess
instances. In Python's multiprocessing, SpawnProcess and
multiprocessing.Process are siblings under BaseProcess — SpawnProcess
does not inherit from multiprocessing.Process:

BaseProcess
├── multiprocessing.context.Process   ← what mp.Process resolves to
├── SpawnProcess                      ← what ctx.Process() creates
├── ForkProcess
└── ForkServerProcess

Because the isinstance check always returns False, the assertion is
silently skipped for every spawned process, regardless of exit code.

2. `Value`/`Barrier` use POSIX named semaphores that fail on Lustre

_start_multiprocess_job created shared state for port synchronization:

shared_port = ctx.Value("i", port)       # uses a named semaphore internally
port_ready_barrier = ctx.Barrier(size)    # uses a named semaphore internally

These objects are pickled and sent to the child processes. During unpickling,
the child calls SemLock._rebuild(name) to reopen the semaphore by its
filesystem path (typically under /dev/shm/). On this Lustre-backed HPC
system, the semaphore files cannot be found by the child processes, causing the
FileNotFoundError before init_and_run_process ever runs.

3. Latent deadlock in `initialize` — rank 0 blocks all ranks

Even if bugs #1 and #2 were fixed, the port-retry logic in initialize has a
deadlock:

Rank 0 calls _try_init_process_group → dist.init_process_group(nccl, world_size=2) — a collective that blocks until all ranks join
Rank 1 is waiting for rank 0 to signal the port (via Barrier or pipe
recv())
Neither can proceed → deadlock

The deadlock was never hit with the old code because bug #2 killed both
children before they reached init_and_run_process.

Fix

All three bugs are fixed in common.py:

1. Use `hasattr` instead of `isinstance` for the exit-code check

# Before (broken — SpawnProcess is not a subclass of mp.Process)
if isinstance(p, mp.Process):
    assert not p.exitcode, ...

# After (works for all process types)
if hasattr(p, "exitcode"):
    assert p.exitcode == 0, ...

2. Replace `Value`/`Barrier` with `Pipe`-based port synchronization

multiprocessing.Pipe uses socket pairs internally, not POSIX named
semaphores, so it works reliably on all filesystems.

Create size - 1 unidirectional pipes (one per non-rank-0 process).
Rank 0 receives all write-ends (port_send_conns). After finding a working
port, it sends the port integer through each pipe.
Each non-rank-0 process receives one read-end (port_recv_conn) and blocks
on recv() until rank 0 writes.

This naturally replaces both the Value (port sharing) and the Barrier
(synchronization) in a single mechanism: recv() blocks until send()
writes, so non-rank-0 processes cannot proceed until rank 0 has resolved the
port. On error, rank 0 sends -1 so other ranks raise immediately instead of
hanging.

3. Replace `_try_init_process_group` with a lightweight socket-bind check

The old retry loop called dist.init_process_group (a collective) to test port
availability — this is what caused the deadlock. Replaced with
_is_port_available(port) which does a non-blocking socket.bind() check.
After the port is resolved and broadcast, all ranks call
init_process_group together in a common code path — no rank waits on another
before the collective.

Summary by CodeRabbit

Release Notes

New Features
- Added tensor parallelism sharding configurations for Qwen 3.5 MOE 35B and 400B models.
- Introduced support for MLA and Delta layer sharding strategies.
Improvements
- Enhanced distributed training port coordination and synchronization mechanism.
- Improved layer type detection capabilities.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

coderabbitai · 2026-02-21T16:14:41Z

📝 Walkthrough

Walkthrough

The PR replaces multiprocessing shared memory port synchronization with pipe-based coordination in distributed initialization, adds "mla" and "delta" sharding modes to the sharding configuration library, updates test configurations for TP sharding strategies, and extends the LayerSubgraph data structure with layer type information.

Changes

Cohort / File(s)	Summary
Configuration Updates `examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b.yaml`, `examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml`	Added `detect_sharding` configurations with manual TP sharding plans defining per-operator sharding strategies for projection layers (in_proj_qkv, q_proj, k_proj, v_proj, o_proj).
Distributed Coordination Refactoring `tensorrt_llm/_torch/auto_deploy/distributed/common.py`	Replaced multiprocessing Value/Barrier-based port synchronization with pipe-based inter-process communication. Updated signatures of `initialize_or_skip`, `initialize`, `init_and_run_process`, `_start_multiprocess_job`, and `_join_multiprocess_job` to use `port_recv_conn` and `port_send_conns`. Added `_is_port_available` helper for socket-based port availability checks.
Sharding Library Extension `tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`	Added support for "mla" and "delta" sharding modes in configuration validation and detection flow. Introduced `num_mla_shards` and `num_delta_shards` counters with corresponding shard processors and summary logging.
Data Structure Enhancement `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`	Extended `LayerSubgraph` class with `layer_type: LayerType` and `subgraph_nodes: List[Node]` public fields.
Test Configuration Updates `tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`	Modified `base_model_tp_plan` to replace projection sharding entries (removed gather/colwise strategies, added "mla" for q_a_proj and "delta" for in_proj_qkv). Updated test flows to use unified `sharding_source` field and inject `manual_config` into detect_sharding configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[TRTLLM-11614][feat] Fixing multigpu tests' clearly identifies the ticket and describes the main objective of fixing multi-GPU test issues.
Linked Issues check	✅ Passed	The PR fully addresses all objectives from issue `#11614`: fixes silent test passes by validating exit codes [common.py], eliminates POSIX semaphore failures via pipe-based synchronization [common.py], and removes deadlock by using non-blocking port checks [common.py].
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing multi-GPU test issues: core fixes in common.py, supporting configuration files for sharding tests, test utilities, and test configuration updates—all directly related to the linked issue.
Description check	✅ Passed	PR description comprehensively addresses issue `#11614` with detailed root cause analysis and fixes for three distinct bugs in multiprocess synchronization and exit-code verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, The
file-level copyright header is missing; add the Apache-2.0 NVIDIA copyright
header (with the latest modification year) at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module
docstring ("""Common utils for torch fx graph transformation."""), replacing or
preceding the docstring so the file begins with the standard NVIDIA Apache-2.0
header block.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
at line 1, This file lacks the required NVIDIA Apache-2.0 copyright header;
add/update the Apache 2.0 NVIDIA copyright header with the latest year at the
top of the file (above the existing module docstring "Tests for basic graph
sharding.") so the modified Python source carries the correct license header per
guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0
template and includes the current year of latest modification.
tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py` at line 1, Add the
Apache‑2.0 NVIDIA copyright header at the top of the module (before the existing
module docstring) for tensorrt_llm._torch.auto_deploy.distributed.common, using
the latest modification year (2026) and the standard NVIDIA/Apache‑2.0 header
format; ensure the header text mentions NVIDIA CORPORATION and the Apache
License, Version 2.0, and place it above the existing triple‑quoted file
docstring so the header is the first thing in the file.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, The
module-level docstring (the first line """Transformations to support graph
sharding.) is missing the required NVIDIA Apache-2.0 copyright header; add or
update the standard NVIDIA Apache License 2.0 header with the latest year as a
comment block at the very top of the file (before the module docstring) so the
file carries the proper NVIDIA copyright and license notice.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Around line 194-209: The current try/except around the port probing loop is
too broad; narrow the except to only expected exception types (e.g., OSError and
RuntimeError) so unexpected errors bubble up. Change the except Exception as e
to something like except (OSError, RuntimeError) as e, preserve assigning
init_error = e, and let other exceptions propagate; keep using the existing
symbols (_is_port_available, get_free_port, ad_logger, init_error, max_retries)
so the behavior and logging remain unchanged.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 46-52: Add inline attribute docstrings for the LayerSubgraph
Pydantic model attributes using the triple-quoted `"""<type>: Description"""`
syntax so they render under the class docstring; update attributes model_config,
layer_type, opening_nodes, terminating_node, min_local_shape, and subgraph_nodes
in the LayerSubgraph class to each include a one-line type and brief description
(e.g., model_config: ConfigDict for Pydantic settings; layer_type: LayerType,
opening_nodes: List[Node] of entry nodes; terminating_node: Union[Node,None]
final node in subgraph; min_local_shape: int minimum local tensor shape;
subgraph_nodes: List[Node] nodes in the subgraph). Ensure the docstrings sit
immediately after each attribute declaration following the project's inline
attribute docstring convention.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Line 1: Add the Apache‑2.0 NVIDIA copyright header at the top of the module
(before the existing module docstring) for
tensorrt_llm._torch.auto_deploy.distributed.common, using the latest
modification year (2026) and the standard NVIDIA/Apache‑2.0 header format;
ensure the header text mentions NVIDIA CORPORATION and the Apache License,
Version 2.0, and place it above the existing triple‑quoted file docstring so the
header is the first thing in the file.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: The module-level docstring (the first line """Transformations to
support graph sharding.) is missing the required NVIDIA Apache-2.0 copyright
header; add or update the standard NVIDIA Apache License 2.0 header with the
latest year as a comment block at the very top of the file (before the module
docstring) so the file carries the proper NVIDIA copyright and license notice.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Line 1: The file-level copyright header is missing; add the Apache-2.0 NVIDIA
copyright header (with the latest modification year) at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module
docstring ("""Common utils for torch fx graph transformation."""), replacing or
preceding the docstring so the file begins with the standard NVIDIA Apache-2.0
header block.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`:
- Line 1: This file lacks the required NVIDIA Apache-2.0 copyright header;
add/update the Apache 2.0 NVIDIA copyright header with the latest year at the
top of the file (above the existing module docstring "Tests for basic graph
sharding.") so the modified Python source carries the correct license header per
guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0
template and includes the current year of latest modification.

tensorrt_llm/_torch/auto_deploy/distributed/common.py

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

greg-kwasniewski1 · 2026-02-21T19:58:07Z

/bot run

tensorrt-cicd · 2026-02-21T20:03:46Z

PR_Github #36390 [ run ] triggered by Bot. Commit: 017260d Link to invocation

tensorrt-cicd · 2026-02-22T01:26:11Z

PR_Github #36390 [ run ] completed with state SUCCESS. Commit: 017260d
/LLM/main/L0_MergeRequest_PR pipeline #28150 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

greg-kwasniewski1 · 2026-02-22T07:56:16Z

/bot run

tensorrt-cicd · 2026-02-22T08:02:47Z

PR_Github #36402 [ run ] triggered by Bot. Commit: 69503ac Link to invocation

tensorrt-cicd · 2026-02-22T14:24:01Z

PR_Github #36402 [ run ] completed with state SUCCESS. Commit: 69503ac
/LLM/main/L0_MergeRequest_PR pipeline #28161 completed with status: 'SUCCESS'

Link to invocation

galagam

Added some small comments. Approved.

tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py

tensorrt_llm/_torch/auto_deploy/distributed/common.py

tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

greg-kwasniewski1 · 2026-02-23T12:17:37Z

/bot run --reuse-test

tensorrt-cicd · 2026-02-23T12:23:34Z

PR_Github #36508 [ run ] triggered by Bot. Commit: d3cbd7f Link to invocation

greg-kwasniewski1 added 4 commits February 20, 2026 04:19

added GatedDeltaNet support from config

d5025f2

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

fixed config keyword

4812b3f

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

updated test configs

cd81fca

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

up

62062b5

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

greg-kwasniewski1 requested review from a team as code owners February 21, 2026 16:06

greg-kwasniewski1 requested review from MrGeva, kaiyux and laikhtewari February 21, 2026 16:06

coderabbitai bot reviewed Feb 21, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/distributed/common.py Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Show resolved Hide resolved

up

017260d

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

fix

69503ac

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

greg-kwasniewski1 removed request for MrGeva, kaiyux and laikhtewari February 22, 2026 14:42

galagam approved these changes Feb 22, 2026

View reviewed changes

refactor

d3cbd7f

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>

Comments

Conversation

greg-kwasniewski1 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixing Silent Test Passes in Multi-GPU Sharding Tests

Issue

Root Cause

1. isinstance check uses the wrong class — exit codes are never verified

2. Value/Barrier use POSIX named semaphores that fail on Lustre

3. Latent deadlock in initialize — rank 0 blocks all ranks

Fix

1. Use hasattr instead of isinstance for the exit-code check

2. Replace Value/Barrier with Pipe-based port synchronization

3. Replace _try_init_process_group with a lightweight socket-bind check

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greg-kwasniewski1 commented Feb 21, 2026

Uh oh!

tensorrt-cicd commented Feb 21, 2026

Uh oh!

tensorrt-cicd commented Feb 22, 2026

Uh oh!

greg-kwasniewski1 commented Feb 22, 2026

Uh oh!

tensorrt-cicd commented Feb 22, 2026

Uh oh!

tensorrt-cicd commented Feb 22, 2026

Uh oh!

galagam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greg-kwasniewski1 commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greg-kwasniewski1 commented Feb 21, 2026 •

edited

Loading

1. `isinstance` check uses the wrong class — exit codes are never verified

2. `Value`/`Barrier` use POSIX named semaphores that fail on Lustre

3. Latent deadlock in `initialize` — rank 0 blocks all ranks

1. Use `hasattr` instead of `isinstance` for the exit-code check

2. Replace `Value`/`Barrier` with `Pipe`-based port synchronization

3. Replace `_try_init_process_group` with a lightweight socket-bind check

coderabbitai bot commented Feb 21, 2026 •

edited

Loading