Can EULLM replace Ollama?

Yes. The Engine exposes the same Ollama APIs (port 11434, same endpoints) and is compatible with LangChain, Open WebUI and any Ollama client. Migration is a URL change. Additionally it offers continuous batching (2.5x throughput) and native audit logging.

Do you need a powerful GPU to use EULLM?

For inference with quantized models (Q4_K_M), 8 GB of VRAM is enough for a 7B model. For verticalizzazione with Forge you need A100 GPUs — but it's a one-time process, not continuous.

Is EULLM production-ready?

The Engine is production-ready with prebuilt binaries for Linux and macOS. Forge and Hub are functional but in integration phase. The roadmap targets full maturity by July 2026, before the EU AI Act takes effect.

EULLM: Why We Built a European LLM Platform - Software & AI

Back to Blog

Software & AIMarch 28, 2026

EULLM: Why We Built a European LLM Platform

eullm llm eu-ai-act rust on-premise open-source

Ninety-five percent of AI infrastructure used in Europe depends on American or Chinese companies. Every API call transmits data outside EU borders. With the EU AI Act effective August 2026, we built EULLM: local Rust-based inference, model specialization and a registry on European infrastructure.

EULLM: Why We Built a European LLM Platform - Software & AI | i3k

The Problem: Digital Sovereignty and AI in Europe

We've worked with European companies since 2015. When a client asks us for a RAG system or an AI assistant, the first question is always the same: "where does my data go?" The honest answer, until recently, was complicated. Even the most widespread "on-premise" solutions depend on American components: Meta's Llama models require mandatory branding, Ollama transmits telemetry, and most model registries run on AWS us-east. But the real point is different: the EU AI Act takes effect on August 2, 2026. Articles 53-55 impose specific obligations for general-purpose AI models: technical documentation, training transparency, risk assessment. Those using models via API don't have the control needed to meet these requirements. Those running them locally with tools that produce no audit trail don't either. We started building EULLM from a concrete need: RAG Enterprise PRO, our document retrieval system, needed an inference runtime that was fast, controllable and compliance-ready. Ollama worked well for development, but in production we had three problems: no native audit logging, sequential request processing (one user at a time), and dependency on non-EU infrastructure for models. EULLM was born to solve these three problems.

Engine: Rust-Based Inference with Continuous Batching

The heart of EULLM is the Engine: a single Rust binary that wraps llama.cpp and adds what's missing for enterprise use. No Python, no mandatory Docker, no runtime dependencies. Download the binary, run it, it works. The most important difference from Ollama is continuous batching. Ollama processes requests in a FIFO queue: if 16 users request a response simultaneously, the second waits for the first to finish, the third waits for the second, and so on. EULLM's Engine decodes all requests in parallel within a single GPU pass. The result: with 16 concurrent requests on an RTX 5070 Ti, EULLM reaches 259 tokens/second versus Ollama's 102. Total throughput is 2.5 times higher, and crucially all users receive streaming tokens from the start, with no queue. The Engine exposes two API families: Ollama-compatible (same port 11434, same endpoints) and OpenAI-compatible (/v1/chat/completions). This means any application using LangChain, LlamaIndex or Open WebUI works without modifications — just change the server URL. For RAG Enterprise, migrating from Ollama to EULLM Engine was literally an environment variable change. Every inference request is automatically logged to ~/.eullm/audit/audit.jsonl with timestamp, request ID, model used, input/output token count and user ID. This audit trail is exactly what's needed to demonstrate EU AI Act compliance: complete traceability of every model interaction, without adding custom middleware.

Forge: From Generic Model to Domain Expert

A generic 14-billion parameter model weighs about 28 GB and requires a datacenter GPU to run. For a company wanting a legal or medical assistant, it's overkill: it contains knowledge about cooking, astronomy, poetry — all useless for analyzing contracts. Forge is EULLM's "verticalizzazione" pipeline: it takes a generic model and transforms it into a domain expert that runs on consumer hardware. The process has five phases. First, structural pruning: it identifies neurons and attention heads least important for the target domain and removes them, reducing a 14B model to 7B parameters. Then knowledge distillation: the original model (teacher) trains the compressed one (student) to replicate its outputs on the specific domain, recovering quality lost in pruning. Third phase, quantization: compresses weights from float16 to int4 (Q4_K_M format), taking the file from ~14 GB to ~4.5 GB. Fourth phase, identity fine-tuning with LoRA: embeds the desired identity into the model (name, tone, knowledge boundaries) without full fine-tuning cost. Finally, export to GGUF format, ready for the Engine. We've already defined three verticalizzazione profiles: legal-it (Italian law), medical-de (German medicine), finance-fr (French finance). Each profile is a YAML file specifying the domain dataset, pruning and distillation hyperparameters, and minimum quality thresholds. A company can create its own profile for any domain: just a dataset of specific texts and an A100 GPU for 2-3 days. Why not simply use a quantized generic model? Because a specialized 7B parameter model beats a generic 14B on its domain tasks. Pruning removes noise, distillation concentrates knowledge, and identity fine-tuning ensures consistent, brandable responses. The result is a model a company can call "its own AI", not "a wrapper on Llama".

Hub and Compliance: Models on European Infrastructure

The last component is the Hub: a REST registry for publishing, discovering and downloading verticalizzati models. The difference from Hugging Face? EULLM's Hub runs exclusively on European servers (Hetzner in Germany, OVH in France), with zero telemetry to non-EU endpoints. Every model in the registry has two documents: a technical model card (architecture, training dataset, benchmarks, license) and a compliance card aligned with EU Regulation 2024/1689 (the EU AI Act). The compliance card specifies: risk classification, transparency measures, GDPR alignment, technical specifications, human oversight mechanisms and infrastructure details confirming EU data residency. Why does this matter? Because the EU AI Act doesn't just ask to "use AI responsibly" — it requires specific, verifiable documentation. A company using GPT-4 via API cannot produce this documentation: they don't know how the model was trained, don't control where data runs, have no audit trail. With EULLM, every piece of the chain is documented, local and verifiable. A deliberate choice: EULLM excludes Meta's Llama models. The Llama license requires "Built with Llama" branding in every derivative product, which is incompatible with white-label deployments where a company wants to brand the model as its own. We exclusively use Apache 2.0 or MIT licensed models: Qwen 3, Mistral, DeepSeek, Falcon 3. No branding constraints, no commercial use restrictions. EULLM is open-source under the Apache 2.0 license. The code is on GitHub and the official site is eullm.eu. Contributions are welcome. For us at i3k, EULLM is the missing piece: an inference infrastructure we can offer clients knowing that data stays in Europe, the model is specialized for their domain, and compliance is built-in, not an afterthought.

Related Services

See how we apply these technologies in our enterprise projects.

AI Enterprise Software AI Integration On-Premise Solutions Software Development

Interested?

All articles

Securvita S.r.l. — i3k.eu