Skip to content

Comments

[TRTLLM-11614][feat] Fixing multigpu tests#11615

Open
greg-kwasniewski1 wants to merge 7 commits intoNVIDIA:mainfrom
nv-auto-deploy:gk/fix_test_launches
Open

[TRTLLM-11614][feat] Fixing multigpu tests#11615
greg-kwasniewski1 wants to merge 7 commits intoNVIDIA:mainfrom
nv-auto-deploy:gk/fix_test_launches

Conversation

@greg-kwasniewski1
Copy link
Collaborator

@greg-kwasniewski1 greg-kwasniewski1 commented Feb 21, 2026

Fixes #11614

Fixing Silent Test Passes in Multi-GPU Sharding Tests

Issue

Running any multi-GPU test via spawn_multiprocess_job (e.g.
test_sharding[MLP-torch_dist_all_reduce-False-False-2]) reports 1 passed
even though both child processes crash during initialization with:

File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 115, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

The test body never executes — the processes die before reaching
init_and_run_process — yet pytest shows a green result.

Root Cause

Three independent bugs in tensorrt_llm/_torch/auto_deploy/distributed/common.py:

1. isinstance check uses the wrong class — exit codes are never verified

_join_multiprocess_job guards the exit-code assertion with:

if isinstance(p, mp.Process):          # always False for spawn-context processes
    assert not p.exitcode, ...

Processes created via mp.get_context("spawn").Process() are SpawnProcess
instances. In Python's multiprocessing, SpawnProcess and
multiprocessing.Process are siblings under BaseProcessSpawnProcess
does not inherit from multiprocessing.Process:

BaseProcess
├── multiprocessing.context.Process   ← what mp.Process resolves to
├── SpawnProcess                      ← what ctx.Process() creates
├── ForkProcess
└── ForkServerProcess

Because the isinstance check always returns False, the assertion is
silently skipped for every spawned process, regardless of exit code.

2. Value/Barrier use POSIX named semaphores that fail on Lustre

_start_multiprocess_job created shared state for port synchronization:

shared_port = ctx.Value("i", port)       # uses a named semaphore internally
port_ready_barrier = ctx.Barrier(size)    # uses a named semaphore internally

These objects are pickled and sent to the child processes. During unpickling,
the child calls SemLock._rebuild(name) to reopen the semaphore by its
filesystem path (typically under /dev/shm/). On this Lustre-backed HPC
system, the semaphore files cannot be found by the child processes, causing the
FileNotFoundError before init_and_run_process ever runs.

3. Latent deadlock in initialize — rank 0 blocks all ranks

Even if bugs #1 and #2 were fixed, the port-retry logic in initialize has a
deadlock:

  1. Rank 0 calls _try_init_process_groupdist.init_process_group(nccl, world_size=2) — a collective that blocks until all ranks join
  2. Rank 1 is waiting for rank 0 to signal the port (via Barrier or pipe
    recv())
  3. Neither can proceed → deadlock

The deadlock was never hit with the old code because bug #2 killed both
children before they reached init_and_run_process.

Fix

All three bugs are fixed in common.py:

1. Use hasattr instead of isinstance for the exit-code check

# Before (broken — SpawnProcess is not a subclass of mp.Process)
if isinstance(p, mp.Process):
    assert not p.exitcode, ...

# After (works for all process types)
if hasattr(p, "exitcode"):
    assert p.exitcode == 0, ...

2. Replace Value/Barrier with Pipe-based port synchronization

multiprocessing.Pipe uses socket pairs internally, not POSIX named
semaphores, so it works reliably on all filesystems.

  • Create size - 1 unidirectional pipes (one per non-rank-0 process).
  • Rank 0 receives all write-ends (port_send_conns). After finding a working
    port, it sends the port integer through each pipe.
  • Each non-rank-0 process receives one read-end (port_recv_conn) and blocks
    on recv() until rank 0 writes.

This naturally replaces both the Value (port sharing) and the Barrier
(synchronization) in a single mechanism: recv() blocks until send()
writes, so non-rank-0 processes cannot proceed until rank 0 has resolved the
port. On error, rank 0 sends -1 so other ranks raise immediately instead of
hanging.

3. Replace _try_init_process_group with a lightweight socket-bind check

The old retry loop called dist.init_process_group (a collective) to test port
availability — this is what caused the deadlock. Replaced with
_is_port_available(port) which does a non-blocking socket.bind() check.
After the port is resolved and broadcast, all ranks call
init_process_group together in a common code path — no rank waits on another
before the collective.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added tensor parallelism sharding configurations for Qwen 3.5 MOE 35B and 400B models.
    • Introduced support for MLA and Delta layer sharding strategies.
  • Improvements

    • Enhanced distributed training port coordination and synchronization mechanism.
    • Improved layer type detection capabilities.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

📝 Walkthrough

Walkthrough

The PR replaces multiprocessing shared memory port synchronization with pipe-based coordination in distributed initialization, adds "mla" and "delta" sharding modes to the sharding configuration library, updates test configurations for TP sharding strategies, and extends the LayerSubgraph data structure with layer type information.

Changes

Cohort / File(s) Summary
Configuration Updates
examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b.yaml, examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
Added detect_sharding configurations with manual TP sharding plans defining per-operator sharding strategies for projection layers (in_proj_qkv, q_proj, k_proj, v_proj, o_proj).
Distributed Coordination Refactoring
tensorrt_llm/_torch/auto_deploy/distributed/common.py
Replaced multiprocessing Value/Barrier-based port synchronization with pipe-based inter-process communication. Updated signatures of initialize_or_skip, initialize, init_and_run_process, _start_multiprocess_job, and _join_multiprocess_job to use port_recv_conn and port_send_conns. Added _is_port_available helper for socket-based port availability checks.
Sharding Library Extension
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
Added support for "mla" and "delta" sharding modes in configuration validation and detection flow. Introduced num_mla_shards and num_delta_shards counters with corresponding shard processors and summary logging.
Data Structure Enhancement
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Extended LayerSubgraph class with layer_type: LayerType and subgraph_nodes: List[Node] public fields.
Test Configuration Updates
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py
Modified base_model_tp_plan to replace projection sharding entries (removed gather/colwise strategies, added "mla" for q_a_proj and "delta" for in_proj_qkv). Updated test flows to use unified sharding_source field and inject manual_config into detect_sharding configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[TRTLLM-11614][feat] Fixing multigpu tests' clearly identifies the ticket and describes the main objective of fixing multi-GPU test issues.
Linked Issues check ✅ Passed The PR fully addresses all objectives from issue #11614: fixes silent test passes by validating exit codes [common.py], eliminates POSIX semaphore failures via pipe-based synchronization [common.py], and removes deadlock by using non-blocking port checks [common.py].
Out of Scope Changes check ✅ Passed All changes are scoped to fixing multi-GPU test issues: core fixes in common.py, supporting configuration files for sharding tests, test utilities, and test configuration updates—all directly related to the linked issue.
Description check ✅ Passed PR description comprehensively addresses issue #11614 with detailed root cause analysis and fixes for three distinct bugs in multiprocess synchronization and exit-code verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, The
file-level copyright header is missing; add the Apache-2.0 NVIDIA copyright
header (with the latest modification year) at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module
docstring ("""Common utils for torch fx graph transformation."""), replacing or
preceding the docstring so the file begins with the standard NVIDIA Apache-2.0
header block.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`
at line 1, This file lacks the required NVIDIA Apache-2.0 copyright header;
add/update the Apache 2.0 NVIDIA copyright header with the latest year at the
top of the file (above the existing module docstring "Tests for basic graph
sharding.") so the modified Python source carries the correct license header per
guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0
template and includes the current year of latest modification.
tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py` at line 1, Add the
Apache‑2.0 NVIDIA copyright header at the top of the module (before the existing
module docstring) for tensorrt_llm._torch.auto_deploy.distributed.common, using
the latest modification year (2026) and the standard NVIDIA/Apache‑2.0 header
format; ensure the header text mentions NVIDIA CORPORATION and the Apache
License, Version 2.0, and place it above the existing triple‑quoted file
docstring so the header is the first thing in the file.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add/update the NVIDIA copyright header for this modified source file.

This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, The
module-level docstring (the first line """Transformations to support graph
sharding.) is missing the required NVIDIA Apache-2.0 copyright header; add or
update the standard NVIDIA Apache License 2.0 header with the latest year as a
comment block at the very top of the file (before the module docstring) so the
file carries the proper NVIDIA copyright and license notice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Around line 194-209: The current try/except around the port probing loop is
too broad; narrow the except to only expected exception types (e.g., OSError and
RuntimeError) so unexpected errors bubble up. Change the except Exception as e
to something like except (OSError, RuntimeError) as e, preserve assigning
init_error = e, and let other exceptions propagate; keep using the existing
symbols (_is_port_available, get_free_port, ad_logger, init_error, max_retries)
so the behavior and logging remain unchanged.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 46-52: Add inline attribute docstrings for the LayerSubgraph
Pydantic model attributes using the triple-quoted `"""<type>: Description"""`
syntax so they render under the class docstring; update attributes model_config,
layer_type, opening_nodes, terminating_node, min_local_shape, and subgraph_nodes
in the LayerSubgraph class to each include a one-line type and brief description
(e.g., model_config: ConfigDict for Pydantic settings; layer_type: LayerType,
opening_nodes: List[Node] of entry nodes; terminating_node: Union[Node,None]
final node in subgraph; min_local_shape: int minimum local tensor shape;
subgraph_nodes: List[Node] nodes in the subgraph). Ensure the docstrings sit
immediately after each attribute declaration following the project's inline
attribute docstring convention.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Line 1: Add the Apache‑2.0 NVIDIA copyright header at the top of the module
(before the existing module docstring) for
tensorrt_llm._torch.auto_deploy.distributed.common, using the latest
modification year (2026) and the standard NVIDIA/Apache‑2.0 header format;
ensure the header text mentions NVIDIA CORPORATION and the Apache License,
Version 2.0, and place it above the existing triple‑quoted file docstring so the
header is the first thing in the file.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: The module-level docstring (the first line """Transformations to
support graph sharding.) is missing the required NVIDIA Apache-2.0 copyright
header; add or update the standard NVIDIA Apache License 2.0 header with the
latest year as a comment block at the very top of the file (before the module
docstring) so the file carries the proper NVIDIA copyright and license notice.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Line 1: The file-level copyright header is missing; add the Apache-2.0 NVIDIA
copyright header (with the latest modification year) at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module
docstring ("""Common utils for torch fx graph transformation."""), replacing or
preceding the docstring so the file begins with the standard NVIDIA Apache-2.0
header block.

In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`:
- Line 1: This file lacks the required NVIDIA Apache-2.0 copyright header;
add/update the Apache 2.0 NVIDIA copyright header with the latest year at the
top of the file (above the existing module docstring "Tests for basic graph
sharding.") so the modified Python source carries the correct license header per
guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0
template and includes the current year of latest modification.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
@greg-kwasniewski1
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36390 [ run ] triggered by Bot. Commit: 017260d Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36390 [ run ] completed with state SUCCESS. Commit: 017260d
/LLM/main/L0_MergeRequest_PR pipeline #28150 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
@greg-kwasniewski1
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36402 [ run ] triggered by Bot. Commit: 69503ac Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36402 [ run ] completed with state SUCCESS. Commit: 69503ac
/LLM/main/L0_MergeRequest_PR pipeline #28161 completed with status: 'SUCCESS'

Link to invocation

Copy link
Collaborator

@galagam galagam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some small comments. Approved.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
@greg-kwasniewski1
Copy link
Collaborator Author

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36508 [ run ] triggered by Bot. Commit: d3cbd7f Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Multi GPU tests failing silently

3 participants