Remix.run Logo
raw_anon_1111 7 hours ago

From what I have found text -> structured text works well. I do a lot of call center based projects where I need to get intents (what API I need to call to fulfill the user’s request) and add slots (the variable part of the message like addresses).

Even Amazon’s cheapest and fastest model does that well - Nova Lite.

But even without using his framework, he did give me an obvious in hindsight method of handling image understanding.

I should have used a more advanced model to describe the image as free text and then used a cheap model to convert text to JSON.

I also had the problem that my process hallucinated that it understood the “image” contained in a Mac .DS_Store file