Remix.run Logo
PaulHoule 5 days ago

Re: “HN is very fickle“

I have a model that, given a headline, predicts if the story will get >10 votes. It’s a terrible model, for a few reasons. The most fundamental is that if the same article is submitted 10 times it could get wildly different scores, that’s the way it does. The tail end of the model [1] is logistic regression because it deals gracefully with this kind of situation. I wish I knew how to treat this as a regression problem (predict the score), there is probably a better loss function than what I use, but when I treat it at as a regression problem I get an even worse model.

The highest score this model ever gives is 70% for something like “Richard Stallman is dead”

I have another model that predicts If the comment/score ratio > 0.5 which is about the average for the site. This is a much better model, close to the first recommender models I made. Trained on articles with score > 10 the input is less noisy for one thing. It’s how a learned y’all like to talk about cars.

[1] what attention folks call the “head”

mooreds 5 days ago | parent | next [-]

> I have a model that, given a headline, predicts if the story will get >10 votes.

Do you incorporate post time into this model?

This is pure anecdata but I've found that certain posting times lead to more upvotes for what feel like the same type of stories.

hunter2_ 5 days ago | parent [-]

There is surely a combination of location and sleep schedule contributing to this, and each have cultural implications: maybe British users gravitate toward certain topics, maybe the best developers tend to be night owls, etc. -- and then you've got folks who use these sites while they work, while they commute, while they fall asleep...

It would be interesting to see some sort of personas-over-a-day graph.

PaulHoule 5 days ago | parent | next [-]

I've thought about it. I think of the people who are active in the 11-8 EST window when I'm not active as "the night shift" and I imagine that they're mainly geographically different from me. I imagine it skews towards Asians and Europeans.

Submission time is not such a good indicator of who interacts with a post because a post could be active for 12-24 hours and considering most people are awake 16 hours a day you're going to get people from all time zones. It's probably fair to say that comments written around 2am EST were not likely to have been written by North Americans and the same is true for submissions.

I've done some experiments that involved looking at submissions on days of the week individually or looking at dayparts (say the 5am-6am EST slot) or daypart+day-of-week and never felt I got a much better model as a result.

mooreds 5 days ago | parent [-]

> Submission time is not such a good indicator of who interacts with a post because a post could be active for 12-24 hours

Posts that make it to the front page are good for 12-4 hours, I agree.

But if a post gets less than 4 upvotes in the first 15ish minutes, my experience is that 99% of the time it won't get to the front page.

PaulHoule 5 days ago | parent [-]

Every few days I submit one that doesn’t get a lot of traction but I come back in 2-3 days and it is on the home page, I think these are responsible for maybe 1/3 of my karma but well under 5% of actual submissions. I dunno if these got resubmitted but still attributed to me or if it is part of dang’s “a good submission got missed” program.

mooreds 5 days ago | parent [-]

My guess is that those links are added to the second chance pool. https://news.ycombinator.com/pool

But the ways of the HN are deep and mysterious, so that's just a guess.

This is the best guide I've found: https://github.com/minimaxir/hacker-news-undocumented

mooreds 5 days ago | parent | prev [-]

That's a good point that different types of users gravitate toward certain types of content. Certain passes the sniff test.

Would be interesting to see you could back things out.

I also think there's an effect based on just how fast the new page is refreshed. I sometimes post in the early morning (US MT) and stuff can hang on there for a while (hour or two). By mid-day, it's more like a 30 minute lifetime on that page.

hdvr 5 days ago | parent | prev | next [-]

It seems predicting the score directly (regression) is almost impossible without considering the associated domain. E.g. headlines with the letters GPT in it from openai.com, get an order of magnitude more votes than similar headlines from other sites.

PaulHoule 5 days ago | parent [-]

To go into more detail.

My best model was developed about two years ago and hasn't been updated. It uses bag-of-word features as an input into logistic regression. I tried a lot of things, like BERT+pooling, and they didn't help. A model that only considers the domain is not as good as the bag-of-words.

This kind of model reaches a plateau when it has seen about 10,000-20,000 samples so for any domain (e.g. nytimes.com, phys.org) that has more than a few thousand submissions it would make sense to train a model just for that domain.

YOShiNoN and I have also submitted so many articles in the last two years that it would be worth it for me personally to make a model based on our own submissions because ultimately I'm drawing them from a different probability distribution. (I have no idea to what extent submissions behave different depending on whether or not I submit them, I know I have both fans and haters.)

I see recommendation problems as involving both: "is the topic relevant" and "is the article good quality?" The title is good for the first but very limited for the second. The domain is probably more indicative of the second but my own concept of quality is nuanced and has a bit of "dose makes the poison" kind of thinking. For instance I think phys.org articles draw out a conclusion in a scientific paper that you might not get from a superficial read (good) but they also have obnoxious ads (bad). So I feel like I only want to post a certain fraction of those.

So far as regression goes, this is what bothers me. An article that has the potential to get 800 votes might get submitted 10 times and get

1,50,4,800,1,200,1,35,105,8

votes or something like that. The ultimate predictor would show me the probability distribution, but maybe that's asking too much, and all I can really expect is the mean which is about 120 in that case. That's not a bad estimate on some level, but if was using the L2 norm I'd get a very high loss except in that case where it was 105. The loss is going to be high no matter what prediction I make so it's not like I can make a better model can cut my loss in half, but rather I can make a better model and reduce my loss by 0.1% which doesn't seem like too great of a victory -- though on some level it is an honest account of the fact that it's a crap shoot and the real uncertainty of the problem which will never go away. On the other hand, the logistic regression model gives a probability which is a very direct expression of that uncertainty.

hdvr 5 days ago | parent [-]

It's an interesting problem. If most of the votes concentrate on the first submission, I wouldn't bother including subsequent submissions in the model. However if this is not the case (as in your example), you could actually include the past voting sequence, submission times, and domain, as predictors. In your example, the 800 votes might then (ideally) correspond to a better time slot and source/domain than the first single vote.

cantor_S_drug 5 days ago | parent | prev [-]

[flagged]