
PyTorch: The Inference Runtime Behind Our AI Systems
Every time RAG Enterprise answers a question, CRM81 classifies a ticket, or LetsAI generates an image, PyTorch runs under the hood. You don't see it, but it's what turns models into results. Here's why we chose it and how we use it in production with NVIDIA GPUs.

PyTorch vs TensorFlow: Why We Chose PyTorch
When we started building our AI stack in 2023, TensorFlow was still the most widespread production framework. But the research community had already shifted massively to PyTorch, and with it the entire modern model ecosystem. Hugging Face, our primary model source, was born around PyTorch. Nearly all state-of-the-art models — BERT, GPT, LLaMA, Stable Diffusion — are developed in PyTorch first. The main technical reason is PyTorch's eager execution paradigm. When prototyping new pipelines for RAG Enterprise, we needed to debug tensors step by step, inspect intermediate dimensions, insert breakpoints. With PyTorch, code runs like normal Python: you use pdb, print tensors, and see exactly what's happening. With TensorFlow 1.x you had to build a static graph and then run it — a debugging nightmare. TensorFlow 2.x introduced eager execution, but the ecosystem had already migrated. 78% of academic papers on arXiv in 2025 use PyTorch. When you search for a reference implementation of a new algorithm, you find it in PyTorch. Period. There's a practical aspect few mention: recruiting. Junior ML developers come out of universities having used PyTorch. Training someone on TensorFlow requires extra time we can't afford.
CUDA Acceleration: On-Premise GPUs for Our Clients
One of the reasons our enterprise clients choose RAG Enterprise PRO is on-premise deployment: data never leaves the company. This means inference must run on local hardware, and for acceptable response times you need NVIDIA GPUs. PyTorch makes the switch from CPU to GPU almost trivial. The code is the same; only the device where tensors are allocated changes. In practice, our inference code has a single configuration variable: DEVICE=cuda:0 or DEVICE=cpu. The rest of the pipeline stays the same. The numbers are impressive. On our standard benchmark with 50,000 documents and the BGE-M3 model: on CPU (Xeon E5-2680), embedding computation for an average document takes 180ms. On GPU (RTX 4090), the same computation takes 12ms. A 15x factor that makes the difference between a usable system and a frustrating one. For clients with tighter budgets, we offer a configuration with RTX 3060 (12GB VRAM) that's sufficient for loads up to 30 concurrent users. For larger enterprise clients, we use servers with 2x A100 and PyTorch DataParallel to distribute the load. A critical aspect we learned: VRAM memory management. PyTorch doesn't automatically release GPU memory after inference. We implemented a custom memory pool with torch.cuda.empty_cache() and batch processing to avoid OOM (Out of Memory) under heavy load.
ONNX Export: Optimizing Models for Production
PyTorch is excellent for development and training, but in production every millisecond counts. That's why, once a model is validated, we export it to ONNX format (Open Neural Network Exchange) using torch.onnx.export(). The ONNX model can then be optimized with ONNX Runtime, which applies operator fusion, quantization, and hardware-specific optimizations. The results are consistent. Our classification model for CRM81 in native PyTorch has an average latency of 45ms per inference. After ONNX export with O2 optimizations, latency drops to 18ms — a 60% improvement with zero accuracy loss. For a system classifying hundreds of tickets per minute, this difference is significant. ONNX export also gives us deployment flexibility. A client wanted the classification model on a Windows server without GPU. With native PyTorch we would have needed to install the entire CUDA stack. With ONNX Runtime, 50 MB of dependencies is enough and the model runs on CPU with acceptable performance. For RAG Enterprise, we don't export the embedding model to ONNX because sentence-transformers already handles optimization internally. But for custom models we train in-house, ONNX export has become a standard step in our CI/CD pipeline: train with PyTorch, validate, export ONNX, benchmark, deploy. A lesson learned: not all models export easily to ONNX. Models with complex conditional logic or dynamic operations may require code modifications. We recommend testing ONNX export early in the development cycle, not at the end.
Related Services
See how we apply these technologies in our enterprise projects.
Interested?
Contact us to receive a personalized quote.
Securvita S.r.l. — i3k.eu