[None][fix] Nemotron H fp4 and MTP#11601
Conversation
📝 WalkthroughWalkthroughThese changes modify MoE weight remapping to handle quantizer weights, adjust the input source for latent projection in Nemotron routing, extend layer masking for MTP hybrid decoding during speculative decoding, and correct backend assertion logic in Mamba cache manager. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
1-1:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
The file has no copyright header. As per coding guidelines, all
.pysource files must include an NVIDIA copyright header with the Apache 2.0 license block and the year of latest meaningful modification.📝 Add copyright header at top of file
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import torchAs per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification… Include NVIDIA copyright header on ALL new files and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py` at line 1, Add the NVIDIA copyright header (Apache-2.0 license block with the year of latest meaningful modification) to the very top of the file nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header text used across the repo is copied exactly and the year is updated appropriately so the file starts with the standard copyright/license block followed by the existing "import torch" line.tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate copyright year.
The file's copyright reads
2022-2024but is being modified in this PR. Per coding guidelines, the year of latest meaningful modification must be reflected.📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines: "update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` at line 1, Update the SPDX copyright header at the top of tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year range "2022-2024" in the SPDX/ copyright comment with the correct latest modification year (e.g., "2022-2026") so the header reflects the file's most recent change.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
729-734: Precompute pattern counts outside the outer loop.
mtp_pattern.count("*")andmtp_pattern.count("M")are O(len(pattern)) string scans invoked on every iteration of therange(num_mtp)loop, making the work O(num_mtp × len(pattern)) when O(len(pattern)) suffices.♻️ Proposed refactor
+ mtp_attn_count = mtp_pattern.count("*") + mtp_mamba_count = mtp_pattern.count("M") for _ in range(num_mtp): for char in mtp_pattern: hybrid_layer_mask.append(char == "*") mamba_layer_mask.append(char == "M") - num_layers += mtp_pattern.count("*") - mamba_num_layers += mtp_pattern.count("M") + num_layers += mtp_attn_count + mamba_num_layers += mtp_mamba_count🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 729 - 734, Precompute mtp_pattern.count("*") and mtp_pattern.count("M") (and optionally the boolean list of characters for mtp_pattern) before the outer loop in the code that populates hybrid_layer_mask and mamba_layer_mask so you don't re-scan the string each iteration; then inside the for _ in range(num_mtp) loop append the precomputed booleans to hybrid_layer_mask and mamba_layer_mask and update num_layers and mamba_num_layers by adding the precomputed counts instead of calling mtp_pattern.count(...) repeatedly (refer to variables/functions mtp_pattern, num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers, mamba_num_layers).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Around line 126-129: Update the stale comment above the conditional that
checks for "input_scale", "weight_scale_2", "input_quantizer", and
"weight_quantizer" to explain why the quantizer keys are treated the same as the
scalar scales for fp8 and nvfp4 models; specifically, mention that
input_quantizer and weight_quantizer are scalar/shared parameters and therefore
can be copied into both new_weights[w3_key] and new_weights[w1_key] using
weights[name] (referencing the conditional that uses keys "input_scale",
"weight_scale_2", "input_quantizer", "weight_quantizer" and the variables
new_weights, w3_key, w1_key, weights, name).
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 724-734: The MTP mask extension currently reads num_mtp from the
checkpoint config (`getattr(config, 'num_nextn_predict_layers', 0)`) which can
differ from the model spec; change it to read the MTP count from spec_config
(e.g., use `spec_config.num_nextn_predict_layers` or `getattr(spec_config,
'num_nextn_predict_layers', 0)`) so that the loop that appends to mtp_pattern,
and updates hybrid_layer_mask, mamba_layer_mask, num_layers, and
mamba_num_layers, uses the same MTP layer count the model actually uses.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Line 1: Add the NVIDIA copyright header (Apache-2.0 license block with the
year of latest meaningful modification) to the very top of the file
nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header
text used across the repo is copied exactly and the year is updated
appropriately so the file starts with the standard copyright/license block
followed by the existing "import torch" line.
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Line 1: Update the SPDX copyright header at the top of
tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year
range "2022-2024" in the SPDX/ copyright comment with the correct latest
modification year (e.g., "2022-2026") so the header reflects the file's most
recent change.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 729-734: Precompute mtp_pattern.count("*") and
mtp_pattern.count("M") (and optionally the boolean list of characters for
mtp_pattern) before the outer loop in the code that populates hybrid_layer_mask
and mamba_layer_mask so you don't re-scan the string each iteration; then inside
the for _ in range(num_mtp) loop append the precomputed booleans to
hybrid_layer_mask and mamba_layer_mask and update num_layers and
mamba_num_layers by adding the precomputed counts instead of calling
mtp_pattern.count(...) repeatedly (refer to variables/functions mtp_pattern,
num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers, mamba_num_layers).
|
need to rebase on this PR #11425 |
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
503ddf9 to
fd539a0
Compare
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
|
/bot run |
|
PR_Github #36354 [ run ] triggered by Bot. Commit: |
|
to consolidate if you can add the top line of this to nemotron_h (for fp4) |
@IzzyPutterman doesn't the current impl already do that? |
|
PR_Github #36354 [ run ] completed with state
|
|
/bot run |
|
PR_Github #36367 [ run ] triggered by Bot. Commit: |
|
PR_Github #36367 [ run ] completed with state |
litaotju
left a comment
There was a problem hiding this comment.
LGTM. Small focused fix, CI green.
Summary by CodeRabbit
Bug Fixes
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.