As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.

▲

benterix 3 minutes ago | parent | next [-]

I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).

▲

iamflimflam1 2 hours ago | parent | prev | next [-]

I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...

▲

wiz21c 3 hours ago | parent | prev | next [-]

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

▲

unixhero 10 hours ago | parent | prev | next [-]

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

▲

baby_souffle 10 hours ago | parent | next [-]

LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/

	▲	unixhero 4 hours ago \| parent \| next [-]
		Thank you! I will give it a try and see if I can get that 4090 working a bit.
	▲	4 hours ago \| parent \| prev [-]
		[deleted]

▲

Alifatisk 3 hours ago | parent | prev [-]

You can use their models here chat.qwenlm.ai, its their official website

▲

VladVladikoff 12 hours ago | parent | prev | next [-]

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

▲

Workaccount2 11 hours ago | parent | next [-]

Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.

▲

richardlblair 12 hours ago | parent | prev | next [-]

With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.

▲

VladVladikoff 11 hours ago | parent [-]

It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.

▲

wongarsu 11 hours ago | parent [-]

Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias

Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)

1: https://arxiv.org/pdf/2402.14531

▲

arcanemachiner 10 hours ago | parent | next [-]

When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.

When that fails, "shut the fuck up" always seems to do the trick.

	▲	richardlblair 9 hours ago \| parent [-]
		I ripped into cursor today. It didn't change anything but I felt better lmao

▲

entropie 9 hours ago | parent | prev [-]

Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).

▲

mh- 12 hours ago | parent | prev [-]

Do you have some example images and the prompt you tried?

	▲	BOOSTERHIDROGEN 9 hours ago \| parent [-]
		also documented stack setup if could.

▲

netdur 11 hours ago | parent | prev [-]

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

	▲	rexreed 13 minutes ago \| parent [-]
		what fine tuning approach did you use?