Intel® AI for Enterprise Inference as a Deployable Architecture on IBM Cloud

The enterprise AI landscape demands solutions that can scale efficiently while maintaining operational simplicity and cost-effectiveness. Intel® AI for Enterprise Inference (Enterprise Inference), powered by the Open Platform for Enterprise AI (OPEA), addresses these challenges by providing a comprehensive platform that automates the deployment of OpenAI-compatible LLM inference endpoints. With IBM Cloud now offering Intel® Gaudi® 3 AI accelerators, the result is scalable, high-performance inference servers that integrate seamlessly with existing applications while leveraging Intel’s optimized hardware for superior economics and performance. When combined with IBM Cloud’s deployable architecture approach, organizations gain access to a powerful, streamlined pathway for implementing production-ready AI inference capabilities.

Understanding Intel AI for Enterprise Inference in the Modern Cloud Context

Intel AI for Enterprise Inference addresses a fundamental challenge facing organizations today: deploying AI inference services at scale while maintaining operational efficiency and cost control. Built on Intel’s Open Platform for Enterprise AI (OPEA), Enterprise Inference automates the deployment of OpenAI-compatible LLM inference endpoints, eliminating the manual configuration work that typically slows enterprise AI adoption.

The platform’s architecture centers on Kubernetes orchestration with specialized operators for Intel hardware detection and management. This approach streamlines complex tasks including model deployment, compute resource provisioning, and optimal configuration settings, reducing what traditionally requires weeks of manual setup to automated, repeatable deployments.

Multi-Hardware Optimization Strategy

Different AI workloads require different hardware strategies for optimal performance and economics. The platform supports both Intel® Gaudi® AI accelerators and Intel® Xeon® processors to match workload requirements with the most efficient hardware:

Intel® Gaudi® 3 AI Accelerators for Large Model Performance: For demanding inference workloads, Intel Gaudi 3 AI accelerators deliver measurable advantages. Independent benchmarking shows up to 43% more tokens per second than GPU competitor solutions for small AI workloads, with up to 335% better price-performance ratios for large models like Llama-3.1-405B. These improvements translate directly to operational benefits—processing more requests with lower infrastructure costs.

Intel® Xeon® Processors for Small Model Efficiency: For applications using smaller, specialized models, Intel Xeon processors provide compelling advantages. As HuggingFace CEO Clement Delangue notes, « More companies would be better served focusing on smaller, specific models that are cheaper to train and run. » Intel Xeon-based deployments offer cost efficiency, deployment simplicity, and energy sustainability while delivering throughput well beyond typical human reading speeds for chat applications and productivity tools.

OpenAI-Compatible API: Zero-Friction Integration

Intel AI for Enterprise Inference’s complete OpenAI API compatibility eliminates integration friction, enabling organizations to leverage dedicated inference infrastructure without application rewrites. Existing applications using OpenAI’s cloud services can migrate by simply changing the base URL and authentication token—the identical API surface means existing code, SDKs, and integration patterns continue to work without modification. This plug-and-play approach allows organizations to deploy Enterprise Inference as a strategic infrastructure upgrade rather than a disruptive technology change, preserving existing investments while gaining the performance and cost advantages of optimized Intel infrastructure.

Core Components and Architecture

The Intel AI for Enterprise Inference platform consists of several integrated components that work together to provide a complete AI inference solution. Kubernetes serves as the foundational orchestration layer, managing container lifecycles and resource allocation across the infrastructure. The Intel Gaudi 3 AI accelerator Base Operator extends Kubernetes with AI-specific capabilities, handling the complexities of Intel Gaudi resource management and optimizing workload placement for inference performance.

At the core of the inference capabilities, vLLM serves as the high-performance inference server, optimized for Intel hardware and designed to deliver efficient LLM serving with OpenAI-compatible APIs. Identity management through Keycloak ensures that AI services integrate seamlessly with existing enterprise authentication systems. This component becomes particularly important in regulated industries where access control and audit trails are mandatory requirements. The APISIX cloud-native API gateway provides the external interface for AI services, handling request routing, rate limiting, and protocol translation between client applications and inference endpoints.

Observability components provide the monitoring and metrics collection necessary for production AI workloads. Unlike traditional application monitoring, AI inference monitoring must account for model-specific metrics such as token throughput, latency distributions, and resource utilization patterns that are unique to language model workloads.

The platform’s modular design allows organizations to customize deployments based on their specific requirements. Some enterprises may require additional security components, while others might need integration with existing monitoring systems. The deployable architecture approach accommodates these variations through parameterized configurations that maintain consistency while allowing necessary customization.

Simplifying Enterprise AI with Deployable Architectures

Enterprise AI deployments traditionally involve complex coordination between multiple teams, extensive manual configuration, and weeks of integration work to assemble components like Kubernetes clusters, identity management, API gateways, and monitoring systems. Each deployment becomes a custom project requiring specialized expertise and careful orchestration.

IBM Cloud’s deployable architecture approach transforms this complexity into automated, repeatable patterns. Instead of assembling infrastructure components from scratch, deployable architectures provide pre-validated templates that combine multiple cloud resources into cohesive, ready-to-deploy solutions. Think of it as moving from building enterprise infrastructure piece by piece to deploying proven architectural blueprints.

For Intel AI for Enterprise Inference, this means the entire AI infrastructure stack—Kubernetes orchestration, Intel Gaudi Base Operator, vLLM inference servers, ingress controllers, Keycloak identity management, APISIX API gateway, and observability tools—comes packaged as a single, automated deployment. Teams can provision production-ready AI inference capabilities in hours rather than weeks, with confidence that security configurations, networking policies, and resource allocations follow established enterprise best practices.

This approach shifts the complexity from deployment time to design time. Rather than coordinating between platform teams, data science teams, and operations teams during each deployment, that coordination happens once during the architectural design phase. The result is consistent, governance-compliant deployments that maintain enterprise control while dramatically reducing operational overhead.

Seeing It in Action

Intel AI for Enterprise Inference deployments follow IBM Cloud’s deployable architecture pattern, offering two primary consumption methods that suit different team preferences and operational requirements. Teams can deploy through the IBM Cloud catalog’s visual interface or directly via Terraform CLI, both approaches leveraging the same flexible configuration system.

Deployment Configuration Options

The deployable architecture for Intel AI for Enterprise Inference provides two configuration approaches to accommodate different infrastructure readiness levels and deployment timelines.

Quickstart Configuration assumes existing IBM Cloud infrastructure components are already in place. This approach significantly reduces deployment time by leveraging pre-configured VPCs, subnets, security groups, and resource groups. The quickstart requires these essential variables:

# Authentication and region
ibmcloud_api_key = « <your-api-key>«
ibmcloud_region = « <region-name>«

# Instance configuration
instance_name = « <instance-name>«
instance_zone = « <availability-zone>«
instance_profile = « gx3d-160x1792x8gaudi3« # Intel Gaudi 3 AI accelerator instance

# Existing infrastructure references
vpc = « <vpc-name>«
subnet = « <subnet-name>«
security_group = « <security-group-name>«
public_gateway = « <public-gateway-name>«
resource_group = « <resource-group-name>«

# SSH access
ssh_key = « <ssh-key-name>«
ssh_private_key = « <path-to-private-key-file>«

# Model configuration
models = « 1« # Options: « 1 » (Llama-3.1-8B), « 12 » (Llama-3.3-70B), or « 11 » (Llama-3.1-405B)
hf_token = « <your-hf-token>«

# Domain configuration
cluster_url = « api.example.com« # or your custom domain

# Optional: Image selection (leave empty for default)
image = « «

# Optional: Keycloak admin credentials
keycloak_admin_user = « admin« # Default: « admin »
keycloak_admin_password = « admin« # Default: « admin » (change for production!)

# Optional: TLS certificates (for production domains)
user_cert = « —–BEGIN CERTIFICATE—–n…n—–END CERTIFICATE—–«
user_key = « —–BEGIN PRIVATE KEY—–n…n—–END PRIVATE KEY—–«

Standard Configuration provisions all infrastructure components from scratch, including VPC creation, subnet configuration, security group setup, and resource group management. While this approach requires additional deployment time, it provides complete infrastructure automation and is ideal for new environments or teams that prefer full infrastructure-as-code control.

# Authentication and region
ibmcloud_api_key = « <your-api-key>«
ibmcloud_region = « <region-name>«

# Instance configuration
instance_zone = « <availability-zone>«

# Resource organization
resource_group = « <resource-group-name>« # or « Default »

# SSH access
ssh_key = « <ssh-key-name>«
ssh_private_key = « <path-to-private-key-file>«

# Model configuration
models = « 1« # Options: « 1 » (Llama-3.1-8B), « 12 » (Llama-3.3-70B), or « 11 » (Llama-3.1-405B)
hf_token = « <your-hf-token>«

# Domain configuration
cluster_url = « api.example.com« # or your custom domain

# Optional: Image selection (leave empty for default)
image = « «

# Optional: Keycloak admin credentials
keycloak_admin_user = « admin« # Default: « admin »
keycloak_admin_password = « admin« # Default: « admin » (change for production!)

# Optional: SSH allowed CIDR (restrict SSH access)
# ssh_allowed_cidr = « 0.0.0.0/0 » # Default: allows all IPs (use for development only)
# ssh_allowed_cidr = « 192.168.1.0/24 » # Production: restrict to your IP range

The standard pattern automatically creates all networking infrastructure including VPC, subnets, security groups, and public gateways with optimized configurations for AI workloads.

Deployment via IBM Cloud Catalog UI

The IBM Cloud catalog provides a guided deployment experience that simplifies the configuration process while maintaining full control over deployment parameters. Teams access the Intel AI for Enterprise Inference deployable architecture through the IBM Cloud console, where they configure deployment settings through a structured form interface.

The catalog presents configuration options organized into logical groups, making it straightforward to specify essential parameters like instance profiles, networking settings, and component selections. Key configuration choices include selecting between Intel Xeon CPU and Intel Gaudi processing modes, enabling or disabling specific components like Keycloak authentication or observability tools, and setting up model deployment preferences.

Users specify infrastructure details such as VPC configuration, subnet assignments, and security group settings through dropdown menus and input fields. The interface validates configurations in real-time, preventing common deployment errors before they reach the infrastructure provisioning phase. Once configured, the deployment proceeds automatically, with progress tracking and status updates available through the IBM Cloud console.

The catalog approach particularly benefits teams who prefer visual interfaces or need to coordinate deployments across multiple stakeholders. The guided configuration process ensures that all required parameters are addressed while providing helpful descriptions and validation for each setting.

Terraform CLI Deployment

For teams that prefer infrastructure-as-code approaches, direct Terraform CLI deployment offers maximum flexibility and automation capabilities. Using either the Quickstart or Standard configuration shown above, create a terraform.tfvars file with your specific values.

Once your configuration is ready, deployment execution follows standard Terraform patterns:

# Initialize the Terraform environment
terraform init
# Review the planned deployment
terraform plan
# Deploy the infrastructure
terraform apply -auto-approve

The Terraform approach provides complete deployment automation, making it ideal for CI/CD pipelines and teams that manage infrastructure through code. The same configuration file can be versioned, tested, and deployed consistently across different environments.

Configuration Flexibility

Both deployment methods leverage the same underlying configuration system, offering identical customization options. The deployment architecture supports selective component activation, allowing teams to deploy only the components they need. Teams can choose to deploy a minimal AI inference stack with just Kubernetes and model serving capabilities, or a complete enterprise stack including identity management, API gateways, and comprehensive observability.

The CPU versus Intel Gaudi processing choice significantly impacts deployment characteristics. Intel Gaudi 3 AI accelerator deployments automatically configure Intel Gaudi drivers and optimization libraries, while CPU deployments focus on standard inference optimization. This choice affects both performance characteristics and infrastructure requirements, with Intel Gaudi deployments requiring specific instance profiles that support AI accelerator hardware.

Model deployment configuration allows teams to specify which AI models to deploy during initial setup, with options for deploying additional models post-deployment. The platform supports multiple model formats and can accommodate custom model configurations through the deployment parameter system.

Accessing the Endpoints

Once deployed, Intel AI for Enterprise Inference exposes standard OpenAI-compatible APIs that integrate seamlessly with existing applications and development workflows. The platform supports multiple popular model families including Llama, Qwen, DeepSeek, Mistral, and others, providing flexibility for different use cases and performance requirements.

Note: The examples below provide a quick overview of accessing Enterprise Inference endpoints. For complete details on accessing deployed models, authentication workflows, and available model endpoints, refer to the Enterprise Inference documentation.

Authentication and Token Generation

If your deployment includes Keycloak identity management and APISIX API gateway, you’ll need to generate an access token before making API calls. The required parameters come from both your deployment inputs (KEYCLOAK_CLIENT_ID from keycloak_client_id) and outputs (BASE_URL and KEYCLOAK_CLIENT_SECRET from the deployment process):

# Generate access token for authenticated access
export TOKEN=$(curl -k -X POST $BASE_URL/token
-H ‘Content-Type: application/x-www-form-urlencoded‘
-d « grant_type=client_credentials&client_id=${KEYCLOAK_CLIENT_ID}&client_secret=${KEYCLOAK_CLIENT_SECRET}«
| jq -r .access_token)

REST API Integration

Direct API access follows OpenAI conventions, making integration straightforward for applications that already use OpenAI services:

# Basic chat completion request using Llama-3-70b
curl -k ${BASE_URL}/Meta-Llama-3.1-70B-Instruct/v1/completions
-X POST
-H « Authorization: Bearer $TOKEN«
-H « Content-Type: application/json«
-d ‘{
« model »: « meta-llama/Meta-Llama-3.1-70B-Instruct »,
« prompt »: « Explain the benefits of edge computing for IoT applications »,
« max_tokens »: 500,
« temperature »: 0.7
}‘

Python SDK Integration

Python applications can integrate Enterprise Inference using the standard OpenAI Python client:

from openai import OpenAI

# Initialize client with your Enterprise Inference endpoint
client = OpenAI(
api_key=« your-access-token »,
base_url=« https://your-cluster-url/Meta-Llama-3.1-70B-Instruct/v1 »
)

# Standard completion request
response = client.completions.create(
model=« meta-llama/Meta-Llama-3.1-70B-Instruct »,
prompt=« Analyze quarterly sales trends and provide insights. »,
max_tokens=1000,
temperature=0.3
)

print(response.choices[0].text)

This compatibility approach reduces integration friction while providing enterprises with control over their AI infrastructure. Applications can maintain familiar OpenAI patterns while benefiting from improved performance, lower latency, and enhanced data privacy through dedicated infrastructure.

Conclusion

Intel AI for Enterprise Inference represents a transformative approach to enterprise AI deployment that addresses the fundamental challenges organizations face when implementing AI at scale. With IBM Cloud now offering Intel Gaudi 3 AI accelerators alongside the platform’s multi-hardware optimization strategy, enterprises can deploy OpenAI-compatible inference endpoints that deliver superior performance economics in the form of price-performance ratios for large models while maintaining complete API compatibility with existing applications. The deployable architecture methodology transforms what traditionally requires weeks of complex coordination between multiple teams into automated, repeatable patterns that provision production-ready AI infrastructure in hours rather than months. By supporting popular model families including Llama, Qwen, DeepSeek, and Mistral through familiar OpenAI APIs, the platform eliminates integration friction while providing the performance advantages of dedicated Intel hardware and the operational benefits of IBM Cloud’s proven architectural blueprints. Organizations can focus on developing AI applications rather than managing infrastructure complexity, gaining a foundation that scales from pilot projects to production workloads while maintaining the governance, security, and operational excellence that enterprise environments demand.

Ready to get started? Deploy Intel AI for Enterprise Inference on your IBM Cloud environment today and experience the benefits of automated, enterprise-grade AI infrastructure.

Link to Catalog

Link to Documentation

Le projet THINK

Next-Gen AI Inference: Intel® Xeon® Processors Power Vision, NLP, and Recommender Workloads

Document Summarization: Transforming Enterprise Content with Intel® AI for Enterprise RAG

AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound

In-production AI Optimization Guide for Xeon: Search and Recommendation Use Case

Argonne’s Aurora Supercomputer Helps Power Breakthrough Simulations of Quantum Materials

Argonne’s Aurora Supercomputer Drives Simulations to Explore How Light Shapes Quantum Materials

AERIS Earth Systems Model Pushes AI for Science to New Heights

Leveraging Edge AI for Business Innovation