[TRTLLM-11614][feat] Fixing multigpu tests#11615
[TRTLLM-11614][feat] Fixing multigpu tests#11615greg-kwasniewski1 wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR replaces multiprocessing shared memory port synchronization with pipe-based coordination in distributed initialization, adds "mla" and "delta" sharding modes to the sharding configuration library, updates test configurations for TP sharding strategies, and extends the LayerSubgraph data structure with layer type information. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd/update the NVIDIA copyright header for this modified source file.
This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, The file-level copyright header is missing; add the Apache-2.0 NVIDIA copyright header (with the latest modification year) at the top of tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module docstring ("""Common utils for torch fx graph transformation."""), replacing or preceding the docstring so the file begins with the standard NVIDIA Apache-2.0 header block.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd/update the NVIDIA copyright header for this modified source file.
This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py` at line 1, This file lacks the required NVIDIA Apache-2.0 copyright header; add/update the Apache 2.0 NVIDIA copyright header with the latest year at the top of the file (above the existing module docstring "Tests for basic graph sharding.") so the modified Python source carries the correct license header per guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0 template and includes the current year of latest modification.tensorrt_llm/_torch/auto_deploy/distributed/common.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd/update the NVIDIA copyright header for this modified source file.
This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py` at line 1, Add the Apache‑2.0 NVIDIA copyright header at the top of the module (before the existing module docstring) for tensorrt_llm._torch.auto_deploy.distributed.common, using the latest modification year (2026) and the standard NVIDIA/Apache‑2.0 header format; ensure the header text mentions NVIDIA CORPORATION and the Apache License, Version 2.0, and place it above the existing triple‑quoted file docstring so the header is the first thing in the file.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd/update the NVIDIA copyright header for this modified source file.
This file is a modified Python source and should carry the Apache 2.0 NVIDIA copyright header with the latest year.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, The module-level docstring (the first line """Transformations to support graph sharding.) is missing the required NVIDIA Apache-2.0 copyright header; add or update the standard NVIDIA Apache License 2.0 header with the latest year as a comment block at the very top of the file (before the module docstring) so the file carries the proper NVIDIA copyright and license notice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Around line 194-209: The current try/except around the port probing loop is
too broad; narrow the except to only expected exception types (e.g., OSError and
RuntimeError) so unexpected errors bubble up. Change the except Exception as e
to something like except (OSError, RuntimeError) as e, preserve assigning
init_error = e, and let other exceptions propagate; keep using the existing
symbols (_is_port_available, get_free_port, ad_logger, init_error, max_retries)
so the behavior and logging remain unchanged.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 46-52: Add inline attribute docstrings for the LayerSubgraph
Pydantic model attributes using the triple-quoted `"""<type>: Description"""`
syntax so they render under the class docstring; update attributes model_config,
layer_type, opening_nodes, terminating_node, min_local_shape, and subgraph_nodes
in the LayerSubgraph class to each include a one-line type and brief description
(e.g., model_config: ConfigDict for Pydantic settings; layer_type: LayerType,
opening_nodes: List[Node] of entry nodes; terminating_node: Union[Node,None]
final node in subgraph; min_local_shape: int minimum local tensor shape;
subgraph_nodes: List[Node] nodes in the subgraph). Ensure the docstrings sit
immediately after each attribute declaration following the project's inline
attribute docstring convention.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/distributed/common.py`:
- Line 1: Add the Apache‑2.0 NVIDIA copyright header at the top of the module
(before the existing module docstring) for
tensorrt_llm._torch.auto_deploy.distributed.common, using the latest
modification year (2026) and the standard NVIDIA/Apache‑2.0 header format;
ensure the header text mentions NVIDIA CORPORATION and the Apache License,
Version 2.0, and place it above the existing triple‑quoted file docstring so the
header is the first thing in the file.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: The module-level docstring (the first line """Transformations to
support graph sharding.) is missing the required NVIDIA Apache-2.0 copyright
header; add or update the standard NVIDIA Apache License 2.0 header with the
latest year as a comment block at the very top of the file (before the module
docstring) so the file carries the proper NVIDIA copyright and license notice.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Line 1: The file-level copyright header is missing; add the Apache-2.0 NVIDIA
copyright header (with the latest modification year) at the top of
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py above the existing module
docstring ("""Common utils for torch fx graph transformation."""), replacing or
preceding the docstring so the file begins with the standard NVIDIA Apache-2.0
header block.
In
`@tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`:
- Line 1: This file lacks the required NVIDIA Apache-2.0 copyright header;
add/update the Apache 2.0 NVIDIA copyright header with the latest year at the
top of the file (above the existing module docstring "Tests for basic graph
sharding.") so the modified Python source carries the correct license header per
guidelines; ensure the header matches the project's standard NVIDIA Apache-2.0
template and includes the current year of latest modification.
|
/bot run |
|
PR_Github #36390 [ run ] triggered by Bot. Commit: |
|
PR_Github #36390 [ run ] completed with state
|
|
/bot run |
|
PR_Github #36402 [ run ] triggered by Bot. Commit: |
|
PR_Github #36402 [ run ] completed with state |
galagam
left a comment
There was a problem hiding this comment.
Added some small comments. Approved.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py
Outdated
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py
Show resolved
Hide resolved
|
/bot run --reuse-test |
|
PR_Github #36508 [ run ] triggered by Bot. Commit: |
Fixes #11614
Fixing Silent Test Passes in Multi-GPU Sharding Tests
Issue
Running any multi-GPU test via
spawn_multiprocess_job(e.g.test_sharding[MLP-torch_dist_all_reduce-False-False-2]) reports 1 passedeven though both child processes crash during initialization with:
The test body never executes — the processes die before reaching
init_and_run_process— yet pytest shows a green result.Root Cause
Three independent bugs in
tensorrt_llm/_torch/auto_deploy/distributed/common.py:1.
isinstancecheck uses the wrong class — exit codes are never verified_join_multiprocess_jobguards the exit-code assertion with:Processes created via
mp.get_context("spawn").Process()areSpawnProcessinstances. In Python's
multiprocessing,SpawnProcessandmultiprocessing.Processare siblings underBaseProcess—SpawnProcessdoes not inherit from
multiprocessing.Process:Because the
isinstancecheck always returnsFalse, the assertion issilently skipped for every spawned process, regardless of exit code.
2.
Value/Barrieruse POSIX named semaphores that fail on Lustre_start_multiprocess_jobcreated shared state for port synchronization:These objects are pickled and sent to the child processes. During unpickling,
the child calls
SemLock._rebuild(name)to reopen the semaphore by itsfilesystem path (typically under
/dev/shm/). On this Lustre-backed HPCsystem, the semaphore files cannot be found by the child processes, causing the
FileNotFoundErrorbeforeinit_and_run_processever runs.3. Latent deadlock in
initialize— rank 0 blocks all ranksEven if bugs #1 and #2 were fixed, the port-retry logic in
initializehas adeadlock:
_try_init_process_group→dist.init_process_group(nccl, world_size=2)— a collective that blocks until all ranks joinBarrieror piperecv())The deadlock was never hit with the old code because bug #2 killed both
children before they reached
init_and_run_process.Fix
All three bugs are fixed in
common.py:1. Use
hasattrinstead ofisinstancefor the exit-code check2. Replace
Value/BarrierwithPipe-based port synchronizationmultiprocessing.Pipeuses socket pairs internally, not POSIX namedsemaphores, so it works reliably on all filesystems.
size - 1unidirectional pipes (one per non-rank-0 process).port_send_conns). After finding a workingport, it sends the port integer through each pipe.
port_recv_conn) and blockson
recv()until rank 0 writes.This naturally replaces both the
Value(port sharing) and theBarrier(synchronization) in a single mechanism:
recv()blocks untilsend()writes, so non-rank-0 processes cannot proceed until rank 0 has resolved the
port. On error, rank 0 sends
-1so other ranks raise immediately instead ofhanging.
3. Replace
_try_init_process_groupwith a lightweight socket-bind checkThe old retry loop called
dist.init_process_group(a collective) to test portavailability — this is what caused the deadlock. Replaced with
_is_port_available(port)which does a non-blockingsocket.bind()check.After the port is resolved and broadcast, all ranks call
init_process_grouptogether in a common code path — no rank waits on anotherbefore the collective.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.