[None][fix] Nemotron H fp4 and MTP by NVShreyas · Pull Request #11601 · NVIDIA/TensorRT-LLM

NVShreyas · 2026-02-20T17:10:23Z

Summary by CodeRabbit

Bug Fixes
- Fixed cache manager assertion logic for proper backend compatibility.
Improvements
- Enhanced NemotronH MoE weight remapping to handle quantizer-related weights.
- Modified latent projection input source in MoE routing.
- Extended layer mask support for MTP/hybrid decoding during speculative decoding.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt_llm/_torch/models/modeling_nemotron_h.py

coderabbitai · 2026-02-20T17:21:17Z

📝 Walkthrough

Walkthrough

These changes modify MoE weight remapping to handle quantizer weights, adjust the input source for latent projection in Nemotron routing, extend layer masking for MTP hybrid decoding during speculative decoding, and correct backend assertion logic in Mamba cache manager.

Changes

Cohort / File(s)	Summary
Nemotron MoE Weight Handling `tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`, `tensorrt_llm/_torch/models/modeling_nemotron_h.py`	Updated MoE weight remapping to include quantizer-related weights alongside scale weights for dual w1/w3 mapping; shifted latent projection input from routed_hidden_states to hidden_states_hp in MoERoute path.
Speculative Decoding & Cache Management `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`	Extended layer mask handling to include MTP layers when speculative decoding is enabled; inverted update_mamba_states assertion to allow Python implementation path and disallow C++ path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely empty—no explanation of the issue, solution, test coverage, or checklist items were provided, only the template structure remains.	Provide a detailed description explaining what issue is being fixed, how the changes address it, what tests validate the fix, and confirm checklist items are complete.
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is directly related to the main changes in the changeset, which address Nemotron H model fixes for fp4 (quantization) and MTP (multi-token prediction) functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

The file has no copyright header. As per coding guidelines, all .py source files must include an NVIDIA copyright header with the Apache 2.0 license block and the year of latest meaningful modification.
📝 Add copyright header at top of file
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification… Include NVIDIA copyright header on ALL new files and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py` at
line 1, Add the NVIDIA copyright header (Apache-2.0 license block with the year
of latest meaningful modification) to the very top of the file
nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header
text used across the repo is copied exactly and the year is updated
appropriately so the file starts with the standard copyright/license block
followed by the existing "import torch" line.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year.

The file's copyright reads 2022-2024 but is being modified in this PR. Per coding guidelines, the year of latest meaningful modification must be reflected.
📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: "update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` at line 1, Update the SPDX
copyright header at the top of
tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year
range "2022-2024" in the SPDX/ copyright comment with the correct latest
modification year (e.g., "2022-2026") so the header reflects the file's most
recent change.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

729-734: Precompute pattern counts outside the outer loop.

mtp_pattern.count("*") and mtp_pattern.count("M") are O(len(pattern)) string scans invoked on every iteration of the range(num_mtp) loop, making the work O(num_mtp × len(pattern)) when O(len(pattern)) suffices.

♻️ Proposed refactor

+            mtp_attn_count = mtp_pattern.count("*")
+            mtp_mamba_count = mtp_pattern.count("M")
             for _ in range(num_mtp):
                 for char in mtp_pattern:
                     hybrid_layer_mask.append(char == "*")
                     mamba_layer_mask.append(char == "M")
-                num_layers += mtp_pattern.count("*")
-                mamba_num_layers += mtp_pattern.count("M")
+                num_layers += mtp_attn_count
+                mamba_num_layers += mtp_mamba_count

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 729 - 734, Precompute
mtp_pattern.count("*") and mtp_pattern.count("M") (and optionally the boolean
list of characters for mtp_pattern) before the outer loop in the code that
populates hybrid_layer_mask and mamba_layer_mask so you don't re-scan the string
each iteration; then inside the for _ in range(num_mtp) loop append the
precomputed booleans to hybrid_layer_mask and mamba_layer_mask and update
num_layers and mamba_num_layers by adding the precomputed counts instead of
calling mtp_pattern.count(...) repeatedly (refer to variables/functions
mtp_pattern, num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers,
mamba_num_layers).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Around line 126-129: Update the stale comment above the conditional that
checks for "input_scale", "weight_scale_2", "input_quantizer", and
"weight_quantizer" to explain why the quantizer keys are treated the same as the
scalar scales for fp8 and nvfp4 models; specifically, mention that
input_quantizer and weight_quantizer are scalar/shared parameters and therefore
can be copied into both new_weights[w3_key] and new_weights[w1_key] using
weights[name] (referencing the conditional that uses keys "input_scale",
"weight_scale_2", "input_quantizer", "weight_quantizer" and the variables
new_weights, w3_key, w1_key, weights, name).

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 724-734: The MTP mask extension currently reads num_mtp from the
checkpoint config (`getattr(config, 'num_nextn_predict_layers', 0)`) which can
differ from the model spec; change it to read the MTP count from spec_config
(e.g., use `spec_config.num_nextn_predict_layers` or `getattr(spec_config,
'num_nextn_predict_layers', 0)`) so that the loop that appends to mtp_pattern,
and updates hybrid_layer_mask, mamba_layer_mask, num_layers, and
mamba_num_layers, uses the same MTP layer count the model actually uses.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Line 1: Add the NVIDIA copyright header (Apache-2.0 license block with the
year of latest meaningful modification) to the very top of the file
nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header
text used across the repo is copied exactly and the year is updated
appropriately so the file starts with the standard copyright/license block
followed by the existing "import torch" line.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Line 1: Update the SPDX copyright header at the top of
tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year
range "2022-2024" in the SPDX/ copyright comment with the correct latest
modification year (e.g., "2022-2026") so the header reflects the file's most
recent change.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 729-734: Precompute mtp_pattern.count("*") and
mtp_pattern.count("M") (and optionally the boolean list of characters for
mtp_pattern) before the outer loop in the code that populates hybrid_layer_mask
and mamba_layer_mask so you don't re-scan the string each iteration; then inside
the for _ in range(num_mtp) loop append the precomputed booleans to
hybrid_layer_mask and mamba_layer_mask and update num_layers and
mamba_num_layers by adding the precomputed counts instead of calling
mtp_pattern.count(...) repeatedly (refer to variables/functions mtp_pattern,
num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers, mamba_num_layers).

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py

tensorrt_llm/_torch/pyexecutor/_util.py

NVShreyas · 2026-02-20T17:28:16Z

need to rebase on this PR #11425

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-02-20T19:53:05Z

/bot run

tensorrt-cicd · 2026-02-20T19:58:52Z

PR_Github #36354 [ run ] triggered by Bot. Commit: 81b1b4a Link to invocation

IzzyPutterman · 2026-02-20T23:02:03Z

to consolidate if you can add the top line of this to nemotron_h (for fp4)

            routed_hidden_states = hidden_states_hp if self.use_latent_moe else hidden_states
            if self.use_latent_moe:
                routed_hidden_states = self.fc1_latent_proj(
                    routed_hidden_states)

NVShreyas · 2026-02-20T23:10:49Z

to consolidate if you can add the top line of this to nemotron_h (for fp4)

            routed_hidden_states = hidden_states_hp if self.use_latent_moe else hidden_states
            if self.use_latent_moe:
                routed_hidden_states = self.fc1_latent_proj(
                    routed_hidden_states)

@IzzyPutterman doesn't the current impl already do that?

if isinstance(hidden_states, tuple):
            hidden_states, hidden_states_hp = hidden_states
        else:
            hidden_states_hp = hidden_states

routed_hidden_states = self.fc1_latent_proj(
                hidden_states_hp) if self.use_latent_moe else hidden_states

tensorrt-cicd · 2026-02-20T23:26:55Z

PR_Github #36354 [ run ] completed with state SUCCESS. Commit: 81b1b4a
/LLM/main/L0_MergeRequest_PR pipeline #28121 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-02-20T23:33:16Z

/bot run

tensorrt-cicd · 2026-02-20T23:40:04Z

PR_Github #36367 [ run ] triggered by Bot. Commit: 9061976 Link to invocation

tensorrt-cicd · 2026-02-21T03:27:49Z

PR_Github #36367 [ run ] completed with state SUCCESS. Commit: 9061976
/LLM/main/L0_MergeRequest_PR pipeline #28131 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

litaotju

LGTM. Small focused fix, CI green.

NVShreyas requested review from a team as code owners February 20, 2026 17:10

NVShreyas requested review from Naveassaf, Wanli-Jiang and tomeras91 February 20, 2026 17:10

tijyojwad reviewed Feb 20, 2026

View reviewed changes

tensorrt_llm/_torch/models/modeling_nemotron_h.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/_util.py Outdated Show resolved Hide resolved

fix fp4 and MTP

fd539a0

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas force-pushed the user/shreyasm/nemotron-h-fix branch from 503ddf9 to fd539a0 Compare February 20, 2026 19:48

add suggestion

81b1b4a

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

fix tests

9061976

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

litaotju approved these changes Feb 22, 2026

View reviewed changes

tijyojwad merged commit c53b8fc into NVIDIA:main Feb 23, 2026
5 checks passed

Comments

Conversation

NVShreyas commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Uh oh!

coderabbitai bot commented Feb 20, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NVShreyas commented Feb 20, 2026

Uh oh!

NVShreyas commented Feb 20, 2026

Uh oh!

tensorrt-cicd commented Feb 20, 2026

Uh oh!

IzzyPutterman commented Feb 20, 2026

Uh oh!

NVShreyas commented Feb 20, 2026

Uh oh!

tensorrt-cicd commented Feb 20, 2026

Uh oh!

NVShreyas commented Feb 20, 2026

Uh oh!

tensorrt-cicd commented Feb 20, 2026

Uh oh!

tensorrt-cicd commented Feb 21, 2026

Uh oh!

litaotju left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NVShreyas commented Feb 20, 2026 •

edited by coderabbitai bot

Loading