-
-
Neural networks news
Intel NN News
- Edge AI
Clinical Insight When Decisions Can’t Wait
- Confidential AI with GPU Acceleration: Bounce Buffers Offer a Solution Today
by Mike Ferron-Jones (Intel) and Dan Middleton (NVIDIA) As AI workloads increasingly process […]
- Unleash Fast and Optimized AI Inference with Intel® AI for Enterprise Inference
Intel® AI for Enterprise Inference reduces infrastructure complexity with a one-click packaged […]
- Edge AI
-
Confidential AI with GPU Acceleration: Bounce Buffers Offer a Solution Today
by Mike Ferron-Jones (Intel) and Dan Middleton (NVIDIA)
As AI workloads increasingly process sensitive and regulated data, enterprises face a growing challenge: how to combine the performance of GPU acceleration with strong confidentiality guarantees. Confidential AI aims to meet this need by protecting data actively in use, not just at rest or in transit. While Intel® Xeon® CPUs and NVIDIA GPUs both now support Trusted Execution Environments (TEEs), securely connecting these isolated domains was a critical architectural hurdle. Addressing that challenge is where the “bounce buffer” architecture comes into play.
Why GPU Accelerated Confidential AI Matters
Many modern AI use cases, including healthcare analytics, financial modeling, and personalized recommendation systems depend on highly sensitive inputs and proprietary models, a trend which will go into overdrive with Agentic AI. AI workloads often require GPUs to meet performance requirements for training and inference, but traditional GPU passthrough across PCIe exposes data to system software and firmware outside the trusted boundary. This creates an inherent trust or privacy issue: organizations need assurance that data, model weights, and intermediate results remain confidential and unaltered throughout execution, even in shared or cloud environments.
The Trust Gap Between CPU and GPU TEEs
Both Intel and NVIDIA provide TEEs—Intel® Trust Domain Extensions (Intel® TDX) for CPUs and NVIDIA Confidential Computing modes for GPUs. However, data must still traverse the PCIe interconnect between these two domains. Without additional protection, DMA operations or other transfers could expose plaintext data on an unencrypted channel. The challenge is not the lack of TEEs but securely connecting them without breaking confidentiality or unacceptable performance degradation.
What Is a Bounce Buffer?
A bounce buffer is an intermediary memory region used to securely stage data transfers between CPU and GPU TEEs. In the NVIDIA Confidential Computing deployment architecture, GPU DMA operations are redirected through the host managed, encrypted bounce buffer. Data is decrypted only inside the CPU TEE, processed, and then re-encrypted before being staged for GPU consumption in the bounce buffer memory. This approach ensures that neither the hypervisor nor the device path ever sees plaintext data.
Figure 1. Visualization of CPU and GPU TEE with encrypted bounce buffer.
Reference Architecture and Implementation
Intel and NVIDIA collaborated closely on solution engineering and validation of bounce buffer architecture, working with Canonical to enable a production ready software stack. The reference implementation combines Intel TDX enabled Xeon platforms, NVIDIA H100 and H200 along with NVIDIA Blackwell B200 and B300 GPUs operating in Confidential Computing modes, and an Ubuntu Linux virtualization stack capable of enforcing memory isolation and encrypted DMA paths over PCIe. The reference architecture and deployment guide are publicly available today here.
Solution Ingredients
The reference architecture hardware uses 5th Gen Intel Xeon Scalable CPUs (code named “Emerald Rapids”) with NVIDIA Hopper, NVIDIA Blackwell, and the RTX PRO Server GPU family of offerings. The host OS and virtualization is provided by Ubuntu 25.10, and the guest OS is Ubuntu 24.04 LTS. This stack enables the establishment of TEEs on both CPU and multiple GPUs, as well as OS support to manage bounce buffer mappings.
While the bounce buffer introduces additional copy and encryption steps, observed performance remains suitable for real world AI inference scenarios, especially when weighed against the security, privacy, and compliance benefits provided.
Remote attestation is a critical part of Confidential Computing, providing cryptographic assurance and verification that the CPU and GPU TEEs launched correctly and are running as expected. In addition to bounce buffers, Intel and NVIDIA worked together to synchronize CPU and GPU attestation through Intel Trust Authority, enabling customers to receive attestations via a single service rather than using separate services.
The Road Ahead: TEE-IO and Intel TDX Connect
To address the gaps in architecture, there has been a broader industry push to secure data in use through open, interoperable confidential computing primitives, rather than siloed, vendor specific solutions. In that spirit, the solution aligns with the community models emerging in the Confidential Computing Consortium, where hardware vendors, cloud providers, and software developers collaborate on common TEE building blocks and deployment patterns.
Bounce buffers provide a practical solution today; the industry is moving toward standards-based TEE-IO, where the CPU and attached devices can effectively establish a single logical TEE, with faster direct memory access and end-to-end encrypted communications. Intel TDX Connect is Intel’s framework for securely binding CPU and device TEEs with hardware level PCIe link encryption, reducing overhead and improving efficiency. NVIDIA Accelerated Confidential Computing along with Intel Xeon 6 processors (code named “Granite Rapids”), are already architecturally prepared for Intel TDX Connect adoption as the ecosystem software matures.
Production Ready Today
Bounce buffer architecture is not theoretical. Confidential AI solutions using this technology are already in production at major cloud service providers including Alibaba, ByteDance, Google, and Oracle, with additional providers expected to follow. Customers can also work with their preferred Linux Distribution vendors to deploy select inference workloads on-premises. These deployments demonstrate that Confidential Computing and GPU acceleration can coexist at scale. We invite anyone interested to take them out for a test drive today.
Resources and Further Reading
NVIDIA Deployment Guide for Secure AI
Intel Confidential Computing Homepage
NVIDIA Confidential Computing Homepage
Intel TDX Connect Architectural Specification
Intel NVIDIA Seamless Attestation Whitepaper
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No product or component can be absolutely secure.
Publié dans Non classé
Commentaires fermés sur Confidential AI with GPU Acceleration: Bounce Buffers Offer a Solution Today
Edge AI
AI That Moves the World Starts at the Edge
Publié dans Non classé
Commentaires fermés sur Edge AI
Edge AI for Smart Cities
Cities That Sense, Decide, and Respond as One: Edge AI Turns Urban Infrastructure into Autonomous Systems
Publié dans Non classé
Commentaires fermés sur Edge AI for Smart Cities
From Factory Precision to Intelligent Robots: How Real-World Performance Shapes Industrial Edge AI
AI often looks impressive in demos. We’ve all seen them—robots performing miraculous feats of dexterity, thanks largely to the human with a gaming controller behind the stage.
To be fair, of course, a proof of concept often requires this. However, it’s an important reminder that what looks good on stage doesn’t already translate to the real world. And not because the technology isn’t fantastic. Rather, it’s because the edge is so complex.
The real question is whether a PoC works within real operational constraints and delivers tangible outcomes safely, consistently, and efficiently. That’s the true readiness indicator for scale. And that’s the lens through which Intel approaches industrial edge AI: not what’s technically possible in a lab, but what’s production-ready on a factory floor, inside a robotic arm, or aboard an autonomous mobile robot navigating a warehouse.
At Embedded World 2026, we’re expanding Intel’s edge portfolio with silicon built for exactly this reality—processors optimized for precise machine control, and processors purpose-built for autonomous, intelligent systems. The right compute for the right workload is not a compromise. It’s how production-scale industrial AI actually gets built.
Manufacturing Precision and Autonomy Depends on Real-World Performance
A robotic arm on an assembly line handles different tasks across the production sequence: picking, placing, inspecting, packaging. Every motion must stay perfectly synchronized with every other arm on the line. If one moves a fraction of a millisecond faster and another a fraction slower, the result is misalignment, defects, and wasted product. In manufacturing, precision is everything. The compute that drives these systems needs to be fast, but more importantly, it needs to be consistent.
Precision, Real-World Performance
Intel® Core™ Series 2 processors are purpose-built for this requirement. The all-P-core architecture delivers higher performance compared to prior generations, with 10-year support and options for environmental hardening. But the real differentiator isn’t raw throughput; it’s determinism.
Intel® Time Coordinated Computing (Intel® TCC) and Time Sensitive Networking (TSN) enable precise timing and predictable execution that is essential for industrial control, machine vision, and automation. In benchmark comparisons against AMD’s 9700X at equivalent power levels, Core Series 2 delivers more deterministic scheduling behavior, better predictable performance under load, and lower max PCIe latency.
The all-P-core architecture also simplifies CPU management and scheduling, reducing development complexity and maintenance overhead. For industrial automation engineers, this matters in a specific way: every core behaves the same way, every time. Developers don’t have to account for heterogeneous core behavior when writing real-time control logic. For systems that need to run identically for years, that predictability is not a nice-to-have. It’s the foundation.
The proof of any industrial platform is what customers do with it:
Neurocle, a vision AI solution provider, is delivering faster, more responsive defect detection on manufacturing lines. Their system identifies issues earlier and keeps operations flowing smoothly—a direct result of the consistent, low-latency inference that Core Series 2 enables.
In warehouse automation, XYZ Robotics is improving overall productivity by reducing compute-related delays, shortening planning cycles, and minimizing idle time. The result is smoother operation, fewer late waves, and faster payback on automation investments.
Codesys, a leader in industrial control software, is helping customers consolidate more virtual PLCs onto fewer systems, enabling more compact, cost-efficient designs and simpler infrastructure.
These are not proofs of concept. They are production deployments running on Intel silicon, delivering measurable outcomes that justify continued investment. And they point to something important: when the control workload is well-defined and the performance requirements are deterministic, a processor optimized specifically for that job outperforms a generalist one. Core Series 2 is that processor.
What Happens When Robots Need to Think
The deployments above represent industrial AI at scale for control-dominant workloads. But a different class of deployment is emerging—one where the robot doesn’t just execute a sequence, but observes, reasons, and adapts. And that requires a fundamentally different approach to compute.
Traditional computer vision models for factory robots were small, typically under 50 million parameters, and focused on narrow tasks: is the part present, is the weld aligned, is the worker wearing a hard hat. These models worked well within tight constraints but broke when conditions changed. If the safety gear changed color or the packaging was redesigned, the model stopped recognizing what it was seeing.
Vision Language Models (VLMs) and Vision Language Action Models (VLAs) change this equation. These transformer-based architectures, ranging from 500 million to 5 billion parameters and larger, combine computer vision with generative AI to understand context, not just detect objects.
A VLM-equipped robot recognizes that a hard hat is still safety gear even when the color or design changes. A VLA model goes further: it can observe a human performing a task, learn the sequence, and execute it autonomously. This is imitation learning, and it’s the core capability driving humanoid robotics forward.
Running these models alongside real-time control requires simultaneous execution of workloads with very different timing and compute profiles. Vision inference, LLM-based reasoning, and sub-millisecond motor control cannot compete for the same resources without compromising the integrity of any one of them. The architecture has to support them concurrently and independently.
What Real-World Robotics Deployments Are Teaching Us
The early wave of advanced robotics deployments—the ones pushing into humanoid robots and agentic AI—were built on multi-subsystem architectures. A dedicated processor for real-time controls, a separate one for AI inference. That approach made sense at the time: it allowed developers to target each function with purpose-built hardware and get the first applications to market.
But real-world deployment experience is revealing the limits of that path. Two processors mean two boards, two software stacks, separate thermal management, and compounded integration risk. Every additional component adds cost, adds failure points, and adds friction between the prototype stage and production at scale. The developers who have lived through that complexity are the ones now asking whether there is a more optimized architecture—one that preserves the functional separation between control and AI inference without requiring separate silicon to achieve it.
That is the problem Core Ultra Series 3 is built to solve.
Precision with Integrated Acceleration
Intel® Core™ Ultra Series 3 is the first Intel processor to combine AI acceleration and real-time control in a single SoC. It brings nearly 180 TOPS of integrated AI acceleration, the ability to operate in rugged environments, and a low power envelope that fits existing industrial form factors—alongside the same Intel® TCC, discrete TSN, Functional Safety (FuSa) readiness, and In-Band ECC memory support that industrial and mission-critical applications require.
The key architectural insight is that integration does not mean consolidation of resources. Core Ultra Series 3’s CPU, GPU, and dedicated NPU run independently on isolated silicon. Vision runs on the NPU. LLM-based reasoning runs on the GPU. Real-time control runs on the CPU. They execute concurrently without competing for resources—which is exactly what the two-processor architecture was trying to achieve, without the hardware complexity.
Independent benchmarks by Circulus, a robotics partner, demonstrated this in practice. When running concurrent vision, LLM reasoning, and speech synthesis workloads, Core Ultra Series 3’s dedicated NPU maintained vision performance with only a 17 percent drop under full cognitive load, while competitive GPU-shared architectures showed a 56 percent drop.
For a humanoid robot working alongside humans on a factory floor, that difference determines whether the robot detects a falling object in time to react. The platform’s deterministic perception, independent of cognitive load, makes it fundamentally more certifiable for personal care robots and industrial robots per ISO requirements.
The Economics of Convergence
The TCO case follows directly from the architectural one. Customers who have moved from multi-processor to single-SoC deployments on Core Ultra Series 3 have achieved 39 to 67 percent TCO savings compared to higher-cost, higher-power alternatives. For on-device fine-tuning—one of the most intensive AI workloads typically reserved for expensive discrete GPUs—Core Ultra Series 3 achieved 87 percent of the performance of a discrete solution at 5.8x the savings.
That is the kind of economics that determines whether a robotics deployment scales from 10 units to 10,000.
Circulus is already seeing the results in practice: smoother motion, better scene understanding, and more natural interactions from humanoid robots running on Core Ultra Series 3. The improvement isn’t attributable to any single benchmark advantage. It’s the result of running perception, reasoning, and control on one tightly integrated platform—without the coordination overhead that separate subsystems inevitably introduce.
Open Software and AI Suites Compress the Development Cycle
Hardware alone doesn’t solve the deployment gap. Intel’s Manufacturing AI Suite and Robotics AI Suite provide the software tools, sample applications, and benchmarked reference implementations that industrial developers need to move from concept to production.
The Manufacturing AI Suite covers predictive maintenance, process optimization, anomaly detection, quality inspection, worker safety, and vision-guided robotics—all built on modular, open-source components with IoT protocol support for MQTT and OPC UA.
The Robotics AI Suite, launched for the first time this year, targets three distinct robot classes: stationary robot arms with real-time control and pick-and-place applications, autonomous mobile robots with multi-camera perception and SLAM capabilities, and humanoid robots with Action Chunking with Transformers pipelines, LLM-driven movement control, and Diffusion Transformer support for manipulation tasks. All are built on ROS 2 and open standards, designed for long-lasting industrial deployment and modular upgrades, and deployable across multiple generations of Intel® Core™ Ultra processors.
OpenVINO™ underpins the entire software stack, optimizing and scaling AI across CPU, GPU, and NPU to maximize performance and portability while protecting R&D investment across hardware generations. Models developed on any x86 workstation or cloud server deploy to Intel edge platforms with minimal modification, and Docker containers run unmodified. You can find the entire development environment available through Intel’s GitHub.
The Industrial Edge Is Entering Its Next Wave of Growth
The trajectory is clear. Edge AI started with fixed-function embedded controllers decades ago, evolved through IoT connectivity and software-defined infrastructure, and matured with computer vision for defect detection and quality management. Now, generative, agentic, and physical AI are moving to the edge, driven by VLM and VLA models that combine vision with reasoning to deliver resilience and contextual understanding that traditional models cannot match.
Intel’s portfolio is built for this next wave.
What’s shaping the path forward isn’t what’s technically possible in the lab. It’s what the first wave of real deployments has revealed about what works at scale. The developers and integrators who have navigated the complexity of multi-subsystem robotics architectures are the ones driving demand for a more optimized approach. The customers running deterministic control workloads at production scale are the ones validating that a processor purpose-built for that job outperforms a generalist one.
Intel’s portfolio is built on what that real-world experience is teaching us.
At Embedded World 2026, we’re showing what this looks like in practice: real workloads, real customer deployments, real TCO savings. Not because the demos are impressive, but because the results are.
That’s the power of Intel Inside®.
_________________________________________________________________________
For notices, disclaimers, and details about certain performance claims, visit www.intel.com/PerformanceIndex
Publié dans Non classé
Commentaires fermés sur From Factory Precision to Intelligent Robots: How Real-World Performance Shapes Industrial Edge AI
Unleash Fast and Optimized AI Inference with Intel® AI for Enterprise Inference
Intel® AI for Enterprise Inference reduces infrastructure complexity with a one-click packaged solution to deploy all the necessary components for optimized hardware-specific model inference.
Publié dans Non classé
Commentaires fermés sur Unleash Fast and Optimized AI Inference with Intel® AI for Enterprise Inference
Building Production AI Agents on Intel® Xeon® Processors with Flowise
Within inference workloads which are growing faster than any other, even outpacing training, one category is growing faster than any other: agentic AI.
Publié dans Non classé
Commentaires fermés sur Building Production AI Agents on Intel® Xeon® Processors with Flowise
Give Your RAG a Voice: Building an Audio Q&A Experience with Intel® AI for Enterprise RAG
Turn your RAG into a voice-powered assistant with Intel® AI for Enterprise RAG.
Publié dans Non classé
Commentaires fermés sur Give Your RAG a Voice: Building an Audio Q&A Experience with Intel® AI for Enterprise RAG
Reduce Downtime Up To 50% by Utilizing AI-Ready RAS Features of Intel® Xeon® Processors
As generative and agentic AI use cases proliferate across nearly every industry, improving the reliability, availability, and serviceability (RAS) of AI clusters is becoming increasingly important. Intel® Xeon® 6 processors offer an impressive set of RAS features that can help improve the stability and performance of AI computing clusters. Intel’s collaboration with Internet technology company ByteDance demonstrated that using the RAS features of Intel Xeon CPUs reduced server downtime by up to 50%
Publié dans Non classé
Commentaires fermés sur Reduce Downtime Up To 50% by Utilizing AI-Ready RAS Features of Intel® Xeon® Processors
How to Fine-Tune an LLM on Intel® GPUs With Unsloth
Fine-tuning an LLM doesn’t have to require massive infrastructure. With Unsloth now supporting Intel® GPUs, developers can efficiently customize models like Llama 3 and Qwen across Intel Core Ultra–based AI PCs, Intel Arc graphics, and the Intel Data Center GPU Max Series.
This blog walks through key techniques like SFT, PEFT, and RLHF—and shows how Intel-optimized libraries such as oneDNN and Triton accelerate training while reducing memory use. Build faster, smarter, and more personalized AI—all within the Intel ecosystem.
Publié dans Non classé
Commentaires fermés sur How to Fine-Tune an LLM on Intel® GPUs With Unsloth