Remix clone Hacker News

That's not what [1] says, though? Quoth: "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt-in to share data with us, such as by providing feedback in the Playground). "

"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"

Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)

▲

niklasd 7 days ago | parent | next [-]

We found that for extracting tables, OpenAIs LLMs aren't great. What is working well for us is Docling (https://github.com/DS4SD/docling/)

	▲	emmanueloga_ 7 days ago \| parent \| next [-]
		Haven't seen Docling before, it looks great! Thanks for sharing.
	▲	soci 7 days ago \| parent \| prev [-]
		agreed, extracting tables in pdfs using any of the available openAI models has been a waste of prompting time here too.

▲

emmanueloga_ 7 days ago | parent | prev [-]

> That's not what [1] says, though?

Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:

> * Chat Completions:

> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."

1: https://platform.openai.com/docs/models#how-we-use-your-data

▲

groby_b 6 days ago | parent [-]

Thanks for pointing there!

It's still not used for training, though, and the retention period is 30 days. It's... a livable compromise for some(many) use cases.

I kind of get the abuse policy reason for image inputs. It makes sense for multi-turn conversations to require a 1h audio retention, too. I'm just incredibly puzzled why schemas for structured outputs aren't eligible for zero-retention.

	▲	pconstantine 2 days ago \| parent \| next [-]
		It takes >50 seconds to generate these schemas for some pretty simple use-cases with large enums, for example. Imagine that latency added to each request...
	▲	emmanueloga_ 6 days ago \| parent \| prev [-]
		Gotcha, from what I could find online I think you are right. I was conflating data not under zero-retention-policy with data-for-training.