Remix.run Logo
litenboll 4 hours ago

Great idea, simple and effective. Tiny bit of feedback: seems like some listings use "unit count" for the number of balls, look at the most expensive listing for an example. Annoyingly the second most expensive balls have the number of dozens in the unit count instead.

rockdiesel 4 hours ago | parent [-]

Yeah. I'm trying to figure out how to combat these inconsistencies. Right now, I have some manual overrides, but not sure it's sustainable to keep manually overriding inconsistent listings.

Any thoughts? Should I default to what's in the product title instead of the unit count? Not sure the best way to combat this.

Propelloni 4 hours ago | parent | next [-]

Maybe you could build a heuristic around shipping weight? A single golf ball weighs about 45 to 50 g, so divide the shipping weight by, say, 50 g to account for boxing and so on and you get a rough estimate of the balls in the package.

rockdiesel 4 hours ago | parent [-]

O wow, that's an interesting approach. That would've never crossed my mind without posting this on HN. Appreciate the suggestion.

fultonn 3 hours ago | parent | prev | next [-]

what I've done for a similar script in the past:

    answer_initial = llm(prompt=prompt, site=site) # JSON with answer and any stuff needed to do heuristic checks.
    heuristic_results = heuristics(answer_final) # rule based.
    answer_final = llm(prompt-prompt, site=site, answer=answer_initial)
    mark_for_review = ... # basically just a bunch of hard-coded stuff I add flag possible failures for review.

You can use an extremely small/cheap model for something like this -- granite 4.0 micro works fine for me, 3.3 8b did as well, both run on my macbook. YMMV / try different models and see how it goes.
datsci_est_2015 3 hours ago | parent | prev | next [-]

The funny thing is, if your method becomes the dominant way of price discovery, a bad actor will simply try to circumvent the system to get their product ordered first, and you’ll be embattled in a Cold War.

See also: toilet paper sheet count comparisons.

tonygrue 4 hours ago | parent | prev | next [-]

You could make a list of all the metadata and pass it through a LLM to determine the quantity. You’ll need some sanity checking but if you prompt it with some examples it will do well. (Done something very similar myself.)

hluska 4 hours ago | parent | prev [-]

I’m not the person you replied to but I took a look at the data and this is an interesting one. You found a really cool data set and this will be fun.

Consider the top four most expensive golf balls on your current list:

TaylorMade 2021 TP5x (3+1 Box) 4DZ Golf Ball Pack, White — uses 4DZ in title, 48.0 in unit count in product specs.

Bridgestone Golf Tour B RXS Quadfecta - nothing in the title, unit count in product specs is 4.0. This one shows 4 dozen in a different spot than other balls.

TaylorMade Golf 2024 TP5 Golf Balls 3+1 Box Four Dozen — Four dozen in the title, unit count in product specs is 1.0 but it has 4.0 dozen in the same div as the Bridgestone balls.

Srixon Z Star Yellow Golf Balls - Buy 2 DZ Get 1 DZ Free — Title shows buy 2 DZ get 1 free. That’s represented as 2+1 or 3+1 in other data. In product specs it shows a unit count of 1.0.

— In that extremely limited sample, the product weight is a pretty good metric to show that the unit count is flawed though that only works in comparison to others. I wonder if you could do a multi pass approach, where you sort data first and then do a unit count versus weight check to find outliers and then start rocking through the titles? You’ll still end up digging through a lot of edge cases and that won’t be much fun but a multi pass would at least give you some insight into those weird edge cases.

rockdiesel 3 hours ago | parent [-]

I appreciate you taking a look. This product weight approach has me intrigued and something I'll look into.

I'm thinking I could just start with any listing where unit count = 1 and take a pass at those first. I haven't looked yet, but I'm guessing single unit counts are almost always inconsistent with the actual number of golf balls.