Skip to content

Comments

[None][fix] Nemotron H fp4 and MTP#11601

Merged
tijyojwad merged 3 commits intoNVIDIA:mainfrom
NVShreyas:user/shreyasm/nemotron-h-fix
Feb 23, 2026
Merged

[None][fix] Nemotron H fp4 and MTP#11601
tijyojwad merged 3 commits intoNVIDIA:mainfrom
NVShreyas:user/shreyasm/nemotron-h-fix

Conversation

@NVShreyas
Copy link
Collaborator

@NVShreyas NVShreyas commented Feb 20, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed cache manager assertion logic for proper backend compatibility.
  • Improvements

    • Enhanced NemotronH MoE weight remapping to handle quantizer-related weights.
    • Modified latent projection input source in MoE routing.
    • Extended layer mask support for MTP/hybrid decoding during speculative decoding.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

These changes modify MoE weight remapping to handle quantizer weights, adjust the input source for latent projection in Nemotron routing, extend layer masking for MTP hybrid decoding during speculative decoding, and correct backend assertion logic in Mamba cache manager.

Changes

Cohort / File(s) Summary
Nemotron MoE Weight Handling
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py, tensorrt_llm/_torch/models/modeling_nemotron_h.py
Updated MoE weight remapping to include quantizer-related weights alongside scale weights for dual w1/w3 mapping; shifted latent projection input from routed_hidden_states to hidden_states_hp in MoERoute path.
Speculative Decoding & Cache Management
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Extended layer mask handling to include MTP layers when speculative decoding is enabled; inverted update_mamba_states assertion to allow Python implementation path and disallow C++ path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty—no explanation of the issue, solution, test coverage, or checklist items were provided, only the template structure remains. Provide a detailed description explaining what issue is being fixed, how the changes address it, what tests validate the fix, and confirm checklist items are complete.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title is directly related to the main changes in the changeset, which address Nemotron H model fixes for fp4 (quantization) and MTP (multi-token prediction) functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

The file has no copyright header. As per coding guidelines, all .py source files must include an NVIDIA copyright header with the Apache 2.0 license block and the year of latest meaningful modification.

📝 Add copyright header at top of file
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification… Include NVIDIA copyright header on ALL new files and update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py` at
line 1, Add the NVIDIA copyright header (Apache-2.0 license block with the year
of latest meaningful modification) to the very top of the file
nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header
text used across the repo is copied exactly and the year is updated
appropriately so the file starts with the standard copyright/license block
followed by the existing "import torch" line.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year.

The file's copyright reads 2022-2024 but is being modified in this PR. Per coding guidelines, the year of latest meaningful modification must be reflected.

📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: "update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` at line 1, Update the SPDX
copyright header at the top of
tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year
range "2022-2024" in the SPDX/ copyright comment with the correct latest
modification year (e.g., "2022-2026") so the header reflects the file's most
recent change.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)

729-734: Precompute pattern counts outside the outer loop.

mtp_pattern.count("*") and mtp_pattern.count("M") are O(len(pattern)) string scans invoked on every iteration of the range(num_mtp) loop, making the work O(num_mtp × len(pattern)) when O(len(pattern)) suffices.

♻️ Proposed refactor
+            mtp_attn_count = mtp_pattern.count("*")
+            mtp_mamba_count = mtp_pattern.count("M")
             for _ in range(num_mtp):
                 for char in mtp_pattern:
                     hybrid_layer_mask.append(char == "*")
                     mamba_layer_mask.append(char == "M")
-                num_layers += mtp_pattern.count("*")
-                mamba_num_layers += mtp_pattern.count("M")
+                num_layers += mtp_attn_count
+                mamba_num_layers += mtp_mamba_count
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 729 - 734, Precompute
mtp_pattern.count("*") and mtp_pattern.count("M") (and optionally the boolean
list of characters for mtp_pattern) before the outer loop in the code that
populates hybrid_layer_mask and mamba_layer_mask so you don't re-scan the string
each iteration; then inside the for _ in range(num_mtp) loop append the
precomputed booleans to hybrid_layer_mask and mamba_layer_mask and update
num_layers and mamba_num_layers by adding the precomputed counts instead of
calling mtp_pattern.count(...) repeatedly (refer to variables/functions
mtp_pattern, num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers,
mamba_num_layers).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Around line 126-129: Update the stale comment above the conditional that
checks for "input_scale", "weight_scale_2", "input_quantizer", and
"weight_quantizer" to explain why the quantizer keys are treated the same as the
scalar scales for fp8 and nvfp4 models; specifically, mention that
input_quantizer and weight_quantizer are scalar/shared parameters and therefore
can be copied into both new_weights[w3_key] and new_weights[w1_key] using
weights[name] (referencing the conditional that uses keys "input_scale",
"weight_scale_2", "input_quantizer", "weight_quantizer" and the variables
new_weights, w3_key, w1_key, weights, name).

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 724-734: The MTP mask extension currently reads num_mtp from the
checkpoint config (`getattr(config, 'num_nextn_predict_layers', 0)`) which can
differ from the model spec; change it to read the MTP count from spec_config
(e.g., use `spec_config.num_nextn_predict_layers` or `getattr(spec_config,
'num_nextn_predict_layers', 0)`) so that the loop that appends to mtp_pattern,
and updates hybrid_layer_mask, mamba_layer_mask, num_layers, and
mamba_num_layers, uses the same MTP layer count the model actually uses.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Line 1: Add the NVIDIA copyright header (Apache-2.0 license block with the
year of latest meaningful modification) to the very top of the file
nemotron_h_weight_mapper.py before any imports; ensure the full NVIDIA header
text used across the repo is copied exactly and the year is updated
appropriately so the file starts with the standard copyright/license block
followed by the existing "import torch" line.

In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Line 1: Update the SPDX copyright header at the top of
tensorrt_llm/_torch/models/modeling_nemotron_h.py: replace the current year
range "2022-2024" in the SPDX/ copyright comment with the correct latest
modification year (e.g., "2022-2026") so the header reflects the file's most
recent change.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 729-734: Precompute mtp_pattern.count("*") and
mtp_pattern.count("M") (and optionally the boolean list of characters for
mtp_pattern) before the outer loop in the code that populates hybrid_layer_mask
and mamba_layer_mask so you don't re-scan the string each iteration; then inside
the for _ in range(num_mtp) loop append the precomputed booleans to
hybrid_layer_mask and mamba_layer_mask and update num_layers and
mamba_num_layers by adding the precomputed counts instead of calling
mtp_pattern.count(...) repeatedly (refer to variables/functions mtp_pattern,
num_mtp, hybrid_layer_mask, mamba_layer_mask, num_layers, mamba_num_layers).

@NVShreyas
Copy link
Collaborator Author

need to rebase on this PR #11425

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas NVShreyas force-pushed the user/shreyasm/nemotron-h-fix branch from 503ddf9 to fd539a0 Compare February 20, 2026 19:48
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36354 [ run ] triggered by Bot. Commit: 81b1b4a Link to invocation

@IzzyPutterman
Copy link
Collaborator

to consolidate if you can add the top line of this to nemotron_h (for fp4)

            routed_hidden_states = hidden_states_hp if self.use_latent_moe else hidden_states
            if self.use_latent_moe:
                routed_hidden_states = self.fc1_latent_proj(
                    routed_hidden_states)

@NVShreyas
Copy link
Collaborator Author

to consolidate if you can add the top line of this to nemotron_h (for fp4)

            routed_hidden_states = hidden_states_hp if self.use_latent_moe else hidden_states
            if self.use_latent_moe:
                routed_hidden_states = self.fc1_latent_proj(
                    routed_hidden_states)

@IzzyPutterman doesn't the current impl already do that?

if isinstance(hidden_states, tuple):
            hidden_states, hidden_states_hp = hidden_states
        else:
            hidden_states_hp = hidden_states

routed_hidden_states = self.fc1_latent_proj(
                hidden_states_hp) if self.use_latent_moe else hidden_states

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36354 [ run ] completed with state SUCCESS. Commit: 81b1b4a
/LLM/main/L0_MergeRequest_PR pipeline #28121 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36367 [ run ] triggered by Bot. Commit: 9061976 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36367 [ run ] completed with state SUCCESS. Commit: 9061976
/LLM/main/L0_MergeRequest_PR pipeline #28131 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

Copy link
Collaborator

@litaotju litaotju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Small focused fix, CI green.

@tijyojwad tijyojwad merged commit c53b8fc into NVIDIA:main Feb 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants