Issue 01 — Spring 2026
The European magazine on private AI

Guide

How to Run Your Own AI Instead of ChatGPT (Self-Hosted Guide 2026)

Run ChatGPT-level AI on your own servers with open-source models. Options, hardware requirements, and step-by-step setup guide.

You can run a large language model as capable as ChatGPT on a server in your office — or in a European data center you control. The model weights are free. The inference software is free. Your data never touches anyone else’s infrastructure.

That is no longer a theoretical possibility. In 2026, open-source AI models have reached the point where the gap between a self-hosted deployment and OpenAI’s API is narrow enough that most business tasks produce equivalent results. The question is not whether self-hosted AI is viable. It is whether you have the team, the hardware, and the appetite for operational complexity.

This guide gives you everything you need to decide — and, if you decide to proceed, everything you need to build.

Quick start

Easiest path Ollama + Open WebUI — running in 30 minutes
Best general model Llama 3.3 70B — strong across all business tasks
Best reasoning model DeepSeek R1 67B — rivals GPT-4 on complex analysis
Minimum hardware 16 GB RAM, GPU recommended (Apple Silicon or NVIDIA)
Production-ready enterprise ORCA on PRISMA ★ — managed, multi-model, compliant

ORCA is developed by HT-X S.r.l., publisher of this site.

Why self-host AI

Three forces are pushing European companies toward self-hosted AI:

Data sovereignty is non-negotiable. When you use ChatGPT, your prompts travel to OpenAI’s servers in the United States. For companies processing personal data, health records, financial information, or trade secrets, that is a GDPR liability. The Italian Data Protection Authority fined OpenAI EUR 15 million in 2024. The maximum penalty under GDPR is EUR 20 million or 4% of global turnover. Self-hosting eliminates the transfer entirely.

Shadow AI is already in your company. Research from Gartner found that 77% of employees use AI tools their IT department has not sanctioned. Banning ChatGPT does not stop usage — it just removes visibility. Self-hosting gives employees a sanctioned tool that is as easy to use as ChatGPT, with the data staying inside your perimeter.

Cost predictability. OpenAI and Anthropic charge per token or per seat. As usage grows, so does the bill — linearly and indefinitely. A self-hosted deployment has a fixed infrastructure cost. Whether 10 people use it or 200, the hardware cost is the same. For companies with more than 20-30 active users, self-hosting is often cheaper within the first year.

The models: what to run in 2026

The open-source model landscape has matured dramatically. Here is what is available for business deployment:

Model Developer Parameters Strengths GPU requirement
Llama 3.3 Meta 8B, 70B, 405B Best general-purpose; strong multilingual 8B: 8 GB VRAM; 70B: 48 GB+
Mistral / Mixtral Mistral AI (Paris) 7B, 22B, 8x22B European languages; efficiency; MoE architecture 7B: 8 GB; 8x22B: 80 GB+
DeepSeek R1 DeepSeek 7B, 67B, 671B Reasoning, coding, complex analysis 67B: 48 GB+; 671B: multi-GPU
Qwen 3.5 Alibaba 7B, 72B, 235B Multimodal, multilingual, strong reasoning 72B: 48 GB+
GLM 5 Zhipu AI 9B, 32B Reasoning, coding, compact efficiency 32B: 24 GB
Kimi 2.5 Moonshot AI 70B+ Long-context (128K+), agent capabilities 48 GB+
Gemma 2 Google 2B, 9B, 27B Compact, efficient, good for edge deployment 9B: 12 GB; 27B: 24 GB

Which model should you start with? For most business use cases — document analysis, email drafting, report summarization, knowledge base Q&A — Llama 3.3 70B offers the best balance of capability and resource requirements. If your primary need is code generation or complex analytical reasoning, DeepSeek R1 67B is the strongest option. For companies needing a smaller, faster model that still performs well, Mistral 7B or Gemma 9B are excellent choices that run on modest hardware.

The key advantage of self-hosting: you are not locked into any single model. You can run Llama for general tasks, DeepSeek for reasoning, and Mistral for multilingual correspondence — all on the same infrastructure, switching models per use case.

The tools: how to serve models

The model is just the brain. You need software to load it, serve requests, and provide a user interface. Here are the main options:

Ollama

The simplest path from zero to running LLM. Ollama packages model management, quantization, and a local API server into a single command-line tool. Install it, run ollama pull llama3.3, and you have a working AI endpoint. Pair with Open WebUI for a browser-based chat interface that supports conversations, file uploads, and multi-model selection.

Best for: Getting started, small team deployments, developer workstations.

vLLM

The production-grade inference engine. vLLM uses PagedAttention and continuous batching to maximize GPU throughput — serving 2-4x more concurrent users per GPU than naive implementations. It exposes an OpenAI-compatible API, making it a drop-in replacement for applications built against OpenAI’s endpoint.

Best for: Production deployments serving 20+ concurrent users. When throughput and latency matter.

Hugging Face TGI (Text Generation Inference)

Hugging Face’s production inference server. Supports tensor parallelism (multi-GPU), quantization, Flash Attention, and watermarking. Slightly more complex than vLLM but offers more fine-grained control over serving configuration.

Best for: Organizations already in the Hugging Face ecosystem, or those needing advanced serving features like tensor parallelism across multiple GPUs.

ORCA ★ By HTX — Publisher of this site

Not a toolkit but a complete platform. ORCA wraps the inference engine, model management, RAG pipeline, user interface, authentication, and audit trail into a managed solution that HT-X installs on your infrastructure. You do not manage the serving layer — you use the AI.

Best for: Companies that want the benefits of self-hosted AI without the operational burden. ORCA handles model selection, updates, and compliance; the business focuses on using AI productively.

Hardware requirements

Hardware is the single biggest investment in self-hosted AI. Here is what you actually need:

Minimum: getting started (development / small team)

  • CPU: Modern multi-core (Apple M2+ or AMD/Intel with AVX2)
  • RAM: 16 GB minimum, 32 GB recommended
  • GPU: Apple Silicon (M2/M3/M4 with 16+ GB unified memory) or NVIDIA GPU with 8+ GB VRAM
  • Storage: 50-100 GB SSD for model weights
  • Models: 7B-13B parameter models (Mistral 7B, Llama 3.3 8B, Gemma 9B)
  • Users: 1-5 concurrent

This setup runs comfortably on a modern MacBook Pro or a mid-range workstation. Response times are acceptable for individual use but not for serving a team.

Production: serving a department (10-50 users)

  • GPU: NVIDIA A100 40GB or L40S 48GB (or equivalent)
  • RAM: 64-128 GB system RAM
  • CPU: 16+ cores
  • Storage: 500 GB NVMe SSD
  • Network: 10 GbE to internal network
  • Models: 70B parameter models (Llama 3.3 70B, DeepSeek R1 67B, Qwen 3.5 72B)
  • Users: 10-50 concurrent with acceptable latency

Budget: EUR 10,000-25,000 for hardware, depending on whether you buy or lease. Cloud GPU instances (e.g., on Hetzner, OVH, or other European providers) run EUR 1,500-3,000/month for equivalent capability.

Enterprise: serving the whole company (50-200+ users)

  • GPU: 2-4x NVIDIA A100 80GB or H100 (multi-GPU with NVLink)
  • RAM: 256+ GB system RAM
  • CPU: 32+ cores
  • Storage: 1+ TB NVMe
  • Models: 70B-405B parameter models, multiple models simultaneously
  • Users: 50-200+ concurrent

Budget: EUR 50,000-150,000 for hardware. At this scale, a managed solution like ORCA on dedicated infrastructure becomes significantly more cost-effective than building and operating the stack yourself.

Step-by-step: from zero to running AI

Here is the practical path for a company that wants to self-host AI, starting from nothing.

Step 1: Install Ollama (30 minutes)

Download Ollama from ollama.com. It runs on macOS, Linux, and Windows. On a Mac:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.3
ollama run llama3.3

You now have a working LLM on your machine. Test it with business-relevant prompts — summarize a document, draft an email, answer a question about your industry.

Step 2: Add a web interface (1 hour)

Install Open WebUI to give your AI a browser-based chat interface:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Navigate to localhost:3000. You have a ChatGPT-like interface running entirely on your hardware.

Step 3: Evaluate models (1-2 weeks)

Download several models and test them against your actual use cases. Create a benchmark document set — contracts, reports, emails, code — and evaluate each model’s output quality. Track:

  • Accuracy on domain-specific questions
  • Quality of generated text in your company’s languages
  • Response time at expected concurrent usage
  • Resource consumption (GPU memory, CPU)

Step 4: Set up production infrastructure (2-4 weeks)

If your evaluation confirms self-hosted AI is the right path:

  1. Provision a dedicated server with appropriate GPU (see hardware requirements above)
  2. Switch from Ollama to vLLM or TGI for production serving
  3. Implement user authentication (LDAP/SSO integration)
  4. Set up a RAG pipeline to connect the model to company documents
  5. Configure monitoring and alerting (GPU utilization, response latency, error rates)
  6. Implement audit logging for AI Act compliance
  7. Write an internal AI usage policy

Step 5: Deploy and train users (1-2 weeks)

Roll out to a pilot department. Provide brief training — the interface is intuitive, but users benefit from understanding what the AI can and cannot do, and how to write effective prompts. Collect feedback, tune the RAG pipeline, and iterate.

When self-hosting is not enough

Self-hosting is powerful but demanding. Here is when it stops making sense:

You do not have ML engineering capacity. A production self-hosted deployment requires ongoing attention: model updates, security patches, performance tuning, RAG pipeline maintenance, user management. If you do not have at least 0.5 FTE of engineering time to dedicate, the system will degrade over time.

You need guaranteed SLAs. If AI availability is business-critical — customer-facing chatbots, real-time document processing, production workflows — you need monitoring, failover, and incident response that a self-managed deployment cannot easily provide without significant investment.

Compliance is complex. The GDPR and AI Act require documentation, traceability, and audit readiness. Self-hosted open-source tools do not generate compliance documentation automatically. You need to build that layer yourself — or spend on consultants.

You want to focus on using AI, not operating it. Every hour spent debugging GPU drivers or optimizing batch sizes is an hour not spent using AI to improve your business.

For all of these cases, a managed on-premise solution bridges the gap. You keep the data sovereignty and cost benefits of self-hosting, while the vendor handles operational complexity. ORCA by HT-X is built for exactly this scenario: it runs on your hardware, uses the same open-source models, but HT-X manages the platform so you can focus on the business value.

The choice between DIY and managed is not about capability — both paths give you private ChatGPT. It is about where you want to spend your engineering hours.

Frequently asked questions

On-premise LLMs (Large Language Models) are AI models installed directly on company servers, rather than used through cloud services. This ensures data never leaves the company infrastructure, providing total privacy and GDPR compliance.

The leading open-source models in 2026 are: Llama 3 (Meta) for general use, Mistral for efficiency and European language performance, DeepSeek for advanced reasoning, Qwen 3.5 (Alibaba) for multimodal and multilingual tasks, GLM 5 (Zhipu AI) for reasoning and coding, and Kimi 2.5 (Moonshot AI) for long-context tasks. ORCA supports all these models.

It depends on the model and number of users. For an SME with 10-50 users, a server with an NVIDIA A100 GPU or equivalent is sufficient for 7-13B parameter models. For larger models (70B+), multi-GPU configurations are needed. HT-X sizes hardware based on specific requirements.

Modern open-source models (Llama 3, Mistral, DeepSeek, Qwen 3.5) achieve performance comparable to GPT-4 in most business tasks. For activities like document analysis, text generation, customer support and coding, differences are minimal. The advantage is total data privacy.

Self-hosting too complex?

ORCA gives you private AI without managing infrastructure. Same open-source models, same data sovereignty — but HT-X handles the setup, updates, and support.

Request a pilot