Skip to main content

About the Provider

NVIDIA is a global leader in AI computing and accelerated hardware, known for its GPUs and enterprise AI platforms. Through its NeMo and research initiatives, NVIDIA develops advanced models to enable reasoning, tool orchestration, and scalable AI workflows for developers and enterprises.

Model Quickstart

This section helps you quickly get started with the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 model on the Qubrid AI inferencing platform. To use this model, you need:
  • A valid Qubrid API key
  • Access to the Qubrid inference API
  • Basic knowledge of making API requests in your preferred language
Once authenticated with your API key, you can send inference requests to the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 model and receive responses based on your input prompts. Below are example placeholders showing how the model can be accessed using different programming environments.
You can choose the one that best fits your workflow.
from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=16000,
    temperature=1,
    top_p=0.95,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

Model Overview

NVIDIA Nemotron-3-Super-120B-A12B is an open-weight LLM built for agentic reasoning and high-volume workloads.
  • Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B.
  • With a native 1M-token context window and configurable thinking mode, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages.

Model at a Glance

FeatureDetails
Model IDnvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
ProviderNVIDIA
ArchitectureLatentMoE — Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP); 512 experts, 22 active per token
Model Size120B params (12B active)
Parameters4
Context Length256K Tokens (up to 1M)
Release DateMarch 11, 2026
LicenseNVIDIA Nemotron Open Model License
Training Data25T token corpus (NVFP4 native pretraining): web, code, math, science, multilingual; post-training cutoff February 2026

When to use?

You should consider using Nemotron-3-Super-120B-A12B if:
  • You need agentic workflows and multi-agent collaboration
  • Your application requires long-context reasoning up to 1M tokens
  • You are building IT ticket automation and high-volume enterprise workloads
  • Your use case involves complex tool use and multi-step function calling
  • You need RAG (Retrieval-Augmented Generation) pipelines
  • Your workflow involves software engineering and cybersecurity triaging

Inference Parameters

Parameter NameTypeDefaultDescription
StreamingbooleantrueEnable streaming responses for real-time output.
Temperaturenumber1Controls randomness in output. Recommended: 1.0 for all tasks.
Max Tokensnumber16000Maximum tokens to generate.
Top Pnumber0.95Controls nucleus sampling. Recommended: 0.95 for all tasks.

Key Features

  • LatentMoE Architecture: 512 experts with 22 active per token — same compute cost as standard MoE with higher capacity.
  • 2.2x Throughput vs GPT-OSS-120B: Delivers 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B.
  • Native 1M Token Context: 91.75% on RULER @ 1M vs GPT-OSS-120B’s 22.30% — purpose-built for long-horizon reasoning.
  • MTP Speculative Decoding: 3.45 avg acceptance length enabling up to 3x wall-clock speedup.
  • Configurable Thinking Mode: Enable or disable reasoning via enable_thinking=True/False in the chat template.
  • 60.47% SWE-Bench Verified: 83.73% MMLU-Pro and 79.23% GPQA across reasoning and software engineering benchmarks.

Summary

NVIDIA Nemotron-3-Super-120B-A12B is NVIDIA’s open-weight agentic reasoning model built for high-throughput enterprise workloads.
  • It uses a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with 512 experts, 22 active per token, pretrained on 25T tokens.
  • It delivers 2.2x throughput over GPT-OSS-120B, 91.75% on RULER @ 1M context, and 60.47% on SWE-Bench Verified.
  • The model supports a native 1M token context window, configurable thinking mode, and MTP speculative decoding for up to 3x wall-clock speedup.
  • Licensed under the NVIDIA Nemotron Open Model License.