| ▲ | polishdude20 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hey in really interested in your pipeline techniques. I've got some pdfs I need to get processed but processing them in the cloud with big providers requires redaction. Wondering if a local model or a self hosted one would work just as well. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | evilelectron 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I run llama.cpp with Qwen3-VL-8B-Instruct-Q4_K_S.gguf with mmproj-F16.gguf for OCR and translation. I also run llama.cpp with Qwen3-Embedding-0.6B-GGUF for embeddings. Drupal 11 with ai_provider_ollama and custom provider ai_provider_llama (heavily derived from ai_provider_ollama) with PostreSQL and pgvector. People on site scan the documents and upload them for archival. The directory monitor looks for new files in the archive directories and once a new file is available, it is uploaded to Drupal. Once a new content is created in Drupal, Drupal triggers the translation and embedding process through llama.cpp. Qwen3-VL-8B is also used for chat and RAG. Client is familiar with Drupal and CMS in general and wanted to stay in a similar environment. If you are starting new I would recommend looking at docling. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | chrisweekly 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Disclaimer: I'm an AI novice relative to many here. FWIW last wknd I spent a couple hours setting up self-hosted n8n with ollama and gemma3:4b [EDIT: not Qwen-3.5], using PDF content extraction for my PoC. 100% local workflow, no runtime dependency on cloud providers. I doubt it'd scale very well (macbook air m4, measly 16GB RAM), but it works as intended. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | tehologist a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Python pdftools to convert to images and tesseract to ocr them to text files. Fast free and can run on CPU. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jorl17 2 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Seconded, would also love to hear your story if you would be willing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||