
Forty percent of documents our clients upload to RAG Enterprise PRO are scanned PDFs, contract images, or Excel sheets with complex formatting. Without a robust extraction system, those documents would be invisible to the AI. Here's how Apache Tika and OCR solve the problem.

The Format Problem: Why a Single Parser Isn't Enough
When a company entrusts us with their document archives, everything arrives: native PDFs, scanned PDFs, DOCX with tables, XLSX with macros, PowerPoint presentations, emails in EML format, and even TXT files with mysterious encoding. The first RAG Enterprise prototype used PyPDF2 for PDFs and python-docx for Word files. It worked for clean documents. Then the first real client arrived. A law firm handed us 15,000 documents. Thirty-five percent were PDFs scanned from 2000s-era copiers. Twenty percent were DOCX files with complex tables and embedded images. Ten percent were Excel sheets with structured data that needed to be AI-readable. PyPDF2 returned empty strings on scanned PDFs. python-docx lost the tables. We needed a solution that handled all formats with a single interface. Apache Tika is exactly that: a universal extraction framework supporting over 1,000 file formats. We integrated it into our pipeline through tika-python, the official Python wrapper, which communicates with a local Tika server via REST. The main advantage of Tika over fragmented solutions like pypdf + python-docx + openpyxl is consistency. A single endpoint, a single output format, a single error handling layer. Our document ingestion code went from 400 lines with manual format handling to 50 lines with Tika.
OCR for Scanned Documents: Tika + Tesseract
The real challenge isn't native PDFs — those already contain text in digital form. The problem is scanned PDFs: page images with no extractable text. This is where OCR (Optical Character Recognition) comes in. One might ask: why not use pytesseract directly? We tried it. pytesseract works well on single clean images, but handling a multi-page PDF requires first converting it to images (with pdf2image), then processing each page, handling rotation, resolution, deskewing. With Tika all of this is integrated: you pass it the PDF and it internally uses Tesseract for pages that need it, combining OCR and native extraction in a single flow. We configured Tika with Tesseract 5.x and language packs for Italian and English. The configuration lives in tika-config.xml where we specify the OCR strategy: we apply OCR only to pages that don't contain extractable text (AUTO strategy), avoiding unnecessary reprocessing of native pages. This halves ingestion times for mixed documents. OCR quality depends heavily on original scan quality. For the most problematic documents, we added a pre-processing layer with OpenCV: adaptive binarization, deskewing, and noise removal. The character error rate (CER) dropped from 12% to 3.8% on low-resolution scanned documents.
Integration in the RAG Pipeline: From Document to Vector
Text extraction is just the first step. The complete flow in the RAG Enterprise pipeline is: the document arrives via upload or API, Tika extracts text and metadata (author, creation date, page count), text is cleaned and normalized (removal of repeated headers/footers, whitespace normalization), text is segmented into semantic chunks with LangChain, each chunk is transformed into an embedding with sentence-transformers, and finally vectors are indexed in Qdrant. Tika also provides us valuable metadata used for filtering. When a user asks 'show me contracts from 2024', the system can filter by document creation date before even performing semantic search. This improves both precision and speed. A problem we solved: Excel documents. Tika extracts text from cells but loses tabular structure. For RAG Enterprise we built a specialized parser that converts Excel tables to markdown format before passing them to chunking. The AI can then understand that 'Q1 Revenue: 1.2M' is data in a specific row, not free text. Our production ingestion pipeline numbers: we process an average of 500 documents per hour on a server with 16 GB RAM and 4 cores. Native PDFs are processed at 200 pages/minute, scanned PDFs at 15 pages/minute (OCR is the bottleneck). For higher loads, we implemented a Celery queue that distributes work across multiple Tika workers.
Related Services
See how we apply these technologies in our enterprise projects.
Interested?
Contact us to receive a personalized quote.
Securvita S.r.l. — i3k.eu