Remix.run Logo
hdvr 5 days ago

It seems predicting the score directly (regression) is almost impossible without considering the associated domain. E.g. headlines with the letters GPT in it from openai.com, get an order of magnitude more votes than similar headlines from other sites.

PaulHoule 5 days ago | parent [-]

To go into more detail.

My best model was developed about two years ago and hasn't been updated. It uses bag-of-word features as an input into logistic regression. I tried a lot of things, like BERT+pooling, and they didn't help. A model that only considers the domain is not as good as the bag-of-words.

This kind of model reaches a plateau when it has seen about 10,000-20,000 samples so for any domain (e.g. nytimes.com, phys.org) that has more than a few thousand submissions it would make sense to train a model just for that domain.

YOShiNoN and I have also submitted so many articles in the last two years that it would be worth it for me personally to make a model based on our own submissions because ultimately I'm drawing them from a different probability distribution. (I have no idea to what extent submissions behave different depending on whether or not I submit them, I know I have both fans and haters.)

I see recommendation problems as involving both: "is the topic relevant" and "is the article good quality?" The title is good for the first but very limited for the second. The domain is probably more indicative of the second but my own concept of quality is nuanced and has a bit of "dose makes the poison" kind of thinking. For instance I think phys.org articles draw out a conclusion in a scientific paper that you might not get from a superficial read (good) but they also have obnoxious ads (bad). So I feel like I only want to post a certain fraction of those.

So far as regression goes, this is what bothers me. An article that has the potential to get 800 votes might get submitted 10 times and get

1,50,4,800,1,200,1,35,105,8

votes or something like that. The ultimate predictor would show me the probability distribution, but maybe that's asking too much, and all I can really expect is the mean which is about 120 in that case. That's not a bad estimate on some level, but if was using the L2 norm I'd get a very high loss except in that case where it was 105. The loss is going to be high no matter what prediction I make so it's not like I can make a better model can cut my loss in half, but rather I can make a better model and reduce my loss by 0.1% which doesn't seem like too great of a victory -- though on some level it is an honest account of the fact that it's a crap shoot and the real uncertainty of the problem which will never go away. On the other hand, the logistic regression model gives a probability which is a very direct expression of that uncertainty.

hdvr 5 days ago | parent [-]

It's an interesting problem. If most of the votes concentrate on the first submission, I wouldn't bother including subsequent submissions in the model. However if this is not the case (as in your example), you could actually include the past voting sequence, submission times, and domain, as predictors. In your example, the 800 votes might then (ideally) correspond to a better time slot and source/domain than the first single vote.