Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Tech Chain Daily
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Tech Chain Daily
    Home»AI News»NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX
    NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX
    AI News

    NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

    May 10, 20267 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    changelly


    Step 01 of 09  ·  Prerequisites

    What You Need Before You Start

    cuda-oxide has specific version requirements for each dependency. Before installing anything, verify your system meets all of these. The project is currently Linux-only (tested on Ubuntu 24.04).

    Linux (Ubuntu 24.04)
    Rust nightly
    CUDA Toolkit 12.x+
    LLVM 21+
    Clang 21 / libclang-common-21-dev
    Git

    ⓘ Why LLVM 21?
    Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell — TMA, tcgen05, WGMMA — requires llc from LLVM 21 or later. This is a hard requirement, not a recommendation.

    synthesia

    Check your current CUDA version to confirm compatibility:

    nvcc –version

    Step 02 of 09  ·  Install Rust Nightly

    Set Up the Rust Nightly Toolchain

    cuda-oxide requires Rust nightly with two additional components: rust-src and rustc-dev. The toolchain is pinned to nightly-2026-04-03 via rust-toolchain.toml in the repository — it will be installed automatically when you first run a build inside the repo.

    If you need to install it manually:

    # Install the pinned nightly toolchain
    rustup toolchain install nightly-2026-04-03

    # Add required components
    rustup component add rust-src rustc-dev \
    –toolchain nightly-2026-04-03

    # Confirm the toolchain is active
    rustup show

    ⓘ Why these components?
    rustc-dev exposes the internal compiler APIs that the custom codegen backend hooks into. rust-src is needed so the compiler can find and compile its own standard library sources for the device target.

    Step 03 of 09  ·  Install LLVM 21

    Install LLVM 21 with the NVPTX Backend

    The cuda-oxide pipeline emits textual LLVM IR (.ll files) and hands them to the external llc binary to produce PTX. You need LLVM 21 or later with the NVPTX backend enabled.

    # Ubuntu/Debian
    sudo apt install llvm-21

    # Verify the NVPTX backend is present
    llc-21 –version | grep nvptx

    The pipeline auto-discovers llc-22 and llc-21 on your PATH in that order. To pin a specific binary, set the environment variable:

    # Pin to a specific llc binary
    export CUDA_OXIDE_LLC=/usr/bin/llc-21

    ⚠ Common Failure
    If NVPTX does not appear in the output of llc-21 –version, your LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro’s default packages, which may omit GPU backends.

    Step 04 of 09  ·  Install Clang

    Install Clang 21 for the cuda-bindings Crate

    The cuda-bindings crate uses bindgen to generate FFI bindings to cuda.h at build time. bindgen needs libclang — and specifically, it needs Clang’s own resource directory (which includes stddef.h). A bare libclang1-* runtime package is not enough.

    # Install the full clang-21 package (includes resource headers)
    sudo apt install clang-21

    # Alternatively, the -dev header package also works
    sudo apt install libclang-common-21-dev

    ⚠ Symptom of Missing Clang
    If you only install the runtime but not the headers, the host build will fail with a cryptic ‘stddef.h’ file not found error during bindgen. Run cargo oxide doctor in the next step to catch this before attempting a build.

    Step 05 of 09  ·  Install cargo-oxide

    Clone the Repo and Install cargo-oxide

    cargo-oxide is a Cargo subcommand that drives the entire build pipeline — running cargo oxide build, cargo oxide run, cargo oxide debug, and cargo oxide pipeline.

    Inside the repo (for trying examples):

    git clone https://github.com/NVlabs/cuda-oxide.git
    cd cuda-oxide

    # cargo oxide works out of the box via a workspace alias
    cargo oxide run vecadd

    Outside the repo (for your own projects):

    # Install globally from the git source
    cargo install \
    –git https://github.com/NVlabs/cuda-oxide.git \
    cargo-oxide

    # On first run, cargo-oxide fetches and builds the codegen backend

    Then verify all prerequisites are in place with the built-in health check:

    cargo oxide doctor

    ⓘ What doctor checks
    It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang/libclang headers, and the codegen backend binary. Fix any red items before proceeding.

    Step 06 of 09  ·  Run Your First Kernel

    Build and Run the vecadd Example

    The canonical first example is vecadd — a vector addition kernel that adds two arrays of 1,024 f32 values on the GPU and verifies the result on the host.

    # Build and run end-to-end
    cargo oxide run vecadd

    If everything is configured correctly, you will see:

    ✓ SUCCESS: All 1024 elements correct!

    To see the full compilation pipeline — from Rust MIR through each Pliron dialect down to PTX — run:

    # Print the full Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX trace
    cargo oxide pipeline vecadd

    To debug with cuda-gdb:

    cargo oxide debug vecadd –tui

    ⓘ Output artifacts
    A successful build produces two files: target/debug/vecadd (the host binary) and target/debug/vecadd.ptx (the device code). The host binary loads the PTX file via the CUDA driver at runtime.

    Step 07 of 09  ·  Write a Kernel

    Writing Your Own #[kernel] Function

    A kernel function is annotated with #[kernel]. Use DisjointSlice<T> for mutable outputs and &[T] for read-only inputs. Access the thread’s unique hardware index with thread::index_1d().

    use cuda_device::{kernel, thread, DisjointSlice};

    // Tier 1 safety: race-free by construction, no `unsafe` needed.
    // DisjointSlice::get_mut() only accepts a ThreadIndex —
    // a hardware-derived opaque type guaranteeing unique writes per thread.

    #[kernel]
    pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
    *elem = input[idx.get()] * factor;
    }
    }

    ⓘ Tier 1 Safety — how it works
    ThreadIndex is an opaque newtype around usize that can only be created from hardware built-in registers (threadIdx, blockIdx, blockDim). Since each thread gets a unique value, and DisjointSlice::get_mut() only accepts a ThreadIndex, writes are race-free by construction — no unsafe anywhere in the kernel.

    Step 08 of 09  ·  Launch from Host

    Launching the Kernel from Host Code

    Host and device code live in the same .rs file. The host side uses CudaContext, DeviceBuffer, and the cuda_launch! macro to manage GPU memory and dispatch.

    use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
    use cuda_host::{cuda_launch, load_kernel_module};

    fn main() {
    // Initialize GPU context on device 0
    let ctx = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, “scale_example”).unwrap();

    // Upload input data to GPU memory
    let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
    let input = DeviceBuffer::from_host(&stream, &data).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
    kernel: scale,
    stream: stream,
    module: module,
    config: LaunchConfig::for_num_elems(1024),
    args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Download result back to host
    let result = output.to_host_vec(&stream).unwrap();
    assert!((result[1] – 2.5).abs() < 1e-5);
    println!(“✓ Kernel ran successfully!”);
    }

    ⓘ What cuda_launch! does
    It scalarizes the argument list — flattening slices, scalars, and captured closures — into PTX kernel parameters and dispatches the kernel on the given stream. No manual argument marshalling is required.

    Step 09 of 09  ·  Next Steps

    What to Explore Next

    You have a working cuda-oxide setup. Here are the high-value paths forward, ordered by complexity:

    • Generic kernels with monomorphization — try the generic example (cargo oxide run generic) to see how fn scale<T: Copy> compiles to separate PTX kernels per type.
    • Closures with captures — the host_closure example shows how a move |x: f32| x * factor closure is scalarized and passed as PTX kernel parameters automatically.
    • Async GPU execution — cuda_launch_async! returns a lazy DeviceOperation that executes on .sync() or .await. See the async_mlp and async_vecadd examples.
    • Shared memory and warp intrinsics — these require scoped unsafe blocks with documented safety contracts. See Tier 2 in the safety model documentation.
    • GEMM at Speed-of-Light — the gemm_sol example achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) using cta_group::2, CLC, and a 4-stage pipeline.
    • Blackwell tensor cores — the tcgen05 example targets sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.

    ⓘ Known Limitation in v0.1.0
    index_2d(stride) is documented as currently unsound — if threads in the same kernel use different stride values, two threads can get &mut T to the same element with no unsafe in sight. Until the fix lands (lifting stride into a type parameter), bind stride to a single let binding and reuse it at every call site.

    Full documentation: nvlabs.github.io/cuda-oxide  ·  Source: github.com/NVlabs/cuda-oxide



    Source link

    aistudios
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    CryptoExpert
    • Website

    Related Posts

    Pinterest cut AI costs 90% by gutting a frontier model's vision layer

    May 29, 2026

    NBA plans AI system for automatic out-of-bounds calls

    May 28, 2026

    Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

    May 27, 2026

    Building AI models that understand chemical principles | MIT News

    May 26, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    changelly
    Latest Posts

    Strategy Moves 411 BTC to Coinbase Prime as Polymarket Sell Odds Hit 84%

    May 29, 2026

    Memecoin Platform DxSale Drained for $7.3M Across 1,400 LPs

    May 29, 2026

    Stock Indexes Rally to Record Highs on Peace Deal Hopes and AI Spending

    May 29, 2026

    Pinterest cut AI costs 90% by gutting a frontier model's vision layer

    May 29, 2026

    If you’re trying to get rich with AI, you need to hear this…

    May 29, 2026
    kraken
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights

    Bitcoin Retail Sentiment Still Matters, Says Swan Bitcoin CEO

    May 30, 2026

    Trezor Launches USDC, USDT Yield in Trezor Suite Through Morpho

    May 30, 2026
    changelly
    Facebook X (Twitter) Instagram Pinterest
    © 2026 TechChainDaily.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.