Remix.run Logo
7d7n 20 hours ago

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.

srik 19 hours ago | parent | next [-]

[deleted]

paxys 19 hours ago | parent | next [-]

"Personal data" that was voluntarily published on a public microblogging platform with the explicit intention to share it with the world?

danparsonson 19 hours ago | parent | next [-]

Would it make a difference if we were talking about articles on a news website? I'm kind of on the fence on this one but I can see the point of view that just posting something online doesn't necessarily grant the end user an unlimited license to use the data. Source code is another example; open-sourcing a project doesn't automatically give someone else the right to use that code in their own projects.

Does Bluesky explicitly state the license the user will be publishing under (Creative Commons or whatever), or allow them to choose one?

paxys 19 hours ago | parent [-]

> Would it make a difference if we were talking about articles on a news website.

News articles are pretty explicitly copyrighted and published for a commercial purpose. The websites make their terms clear when you visit. I don't think anyone can argue that it is legal to copy and distribute these articles, same as a book or movie or song.

Data posted on Bluesky on the other hand is meant to be broadly shared using the AT protocol. It is quite literally a feature. If you create your own Bluesky client, for example, you aren't committing copyright violation by downloading someone else's posts on there. Similarly, you aren't going against any terms of service by consuming a firehose of data from an AT relay.

danparsonson 18 hours ago | parent [-]

Right, that's why I asked about Bluesky's content license; just because it's not in your face when you visit, doesn't mean you don't have to abide by it.

You understand that categories of usage are important, right? No-one is breaking the GPL by reading source code, but incorporating into your own codebase can be problematic if not done correctly. Similarly, human beings reading the data posted by a Bluesky user is not the same as aggregating and analysing the data of thousands of users. As I said I'm on the fence with this, but I do understand why someone might have a problem with it.

srik 19 hours ago | parent | prev [-]

[deleted]

loeber 19 hours ago | parent | next [-]

The data is public by default. You know this when you sign up and use the service. This should inform your expectations of how the data will be used.

ronsor 19 hours ago | parent | prev [-]

Don't make it public then.

srik 19 hours ago | parent [-]

[deleted]

ronsor 19 hours ago | parent | next [-]

Is it more entitled to observe public data than to willingly put data in the public and then expect to control the actions of others?

paxys 19 hours ago | parent | prev [-]

Is you reading my comment on HN also entitlement? I certainly didn't give you permission to do it. It may have some personal details that I don't want you to see. Why do you think that is okay?

19 hours ago | parent [-]
[deleted]
19 hours ago | parent | prev [-]
[deleted]
yawnxyz 19 hours ago | parent | prev [-]

I wonder how much time it takes to run this / what the script is / how resource intensive it is? Bsky is public right, so do you get rate limited? Do you scrape or use an official API? So many questions

Also, I feel like only recently there's been an influx of people who have actually interesting things to say so I'd love to see nextyear's dataset

viccis 19 hours ago | parent | next [-]

Not sure about bulk export but you can set up a full stream of all activity without even registering an account.

unshavedyak 18 hours ago | parent [-]

Blows my mind that they can send that much for free.

ks2048 19 hours ago | parent | prev [-]

I was checking out the Python API today (the "firehouse" via "atproto" package) and got 5000 posts in 7.5 seconds.

verdverm 13 hours ago | parent [-]

I believe they are enabling(ed?) filters so you can control how much and what you actually get from the firehose