| ▲ | rockdiesel 4 hours ago | |||||||
Yeah. I'm trying to figure out how to combat these inconsistencies. Right now, I have some manual overrides, but not sure it's sustainable to keep manually overriding inconsistent listings. Any thoughts? Should I default to what's in the product title instead of the unit count? Not sure the best way to combat this. | ||||||||
| ▲ | Propelloni 4 hours ago | parent | next [-] | |||||||
Maybe you could build a heuristic around shipping weight? A single golf ball weighs about 45 to 50 g, so divide the shipping weight by, say, 50 g to account for boxing and so on and you get a rough estimate of the balls in the package. | ||||||||
| ||||||||
| ▲ | fultonn 3 hours ago | parent | prev | next [-] | |||||||
what I've done for a similar script in the past:
You can use an extremely small/cheap model for something like this -- granite 4.0 micro works fine for me, 3.3 8b did as well, both run on my macbook. YMMV / try different models and see how it goes. | ||||||||
| ▲ | datsci_est_2015 3 hours ago | parent | prev | next [-] | |||||||
The funny thing is, if your method becomes the dominant way of price discovery, a bad actor will simply try to circumvent the system to get their product ordered first, and you’ll be embattled in a Cold War. See also: toilet paper sheet count comparisons. | ||||||||
| ▲ | tonygrue 4 hours ago | parent | prev | next [-] | |||||||
You could make a list of all the metadata and pass it through a LLM to determine the quantity. You’ll need some sanity checking but if you prompt it with some examples it will do well. (Done something very similar myself.) | ||||||||
| ▲ | hluska 4 hours ago | parent | prev [-] | |||||||
I’m not the person you replied to but I took a look at the data and this is an interesting one. You found a really cool data set and this will be fun. Consider the top four most expensive golf balls on your current list: TaylorMade 2021 TP5x (3+1 Box) 4DZ Golf Ball Pack, White — uses 4DZ in title, 48.0 in unit count in product specs. Bridgestone Golf Tour B RXS Quadfecta - nothing in the title, unit count in product specs is 4.0. This one shows 4 dozen in a different spot than other balls. TaylorMade Golf 2024 TP5 Golf Balls 3+1 Box Four Dozen — Four dozen in the title, unit count in product specs is 1.0 but it has 4.0 dozen in the same div as the Bridgestone balls. Srixon Z Star Yellow Golf Balls - Buy 2 DZ Get 1 DZ Free — Title shows buy 2 DZ get 1 free. That’s represented as 2+1 or 3+1 in other data. In product specs it shows a unit count of 1.0. — In that extremely limited sample, the product weight is a pretty good metric to show that the unit count is flawed though that only works in comparison to others. I wonder if you could do a multi pass approach, where you sort data first and then do a unit count versus weight check to find outliers and then start rocking through the titles? You’ll still end up digging through a lot of edge cases and that won’t be much fun but a multi pass would at least give you some insight into those weird edge cases. | ||||||||
| ||||||||