| ▲ | perfmode 12 hours ago |
| > We believe in training our models using diverse and high-quality data. This includes
data that we’ve licensed from publishers, curated from publicly available or open-
sourced datasets, and publicly available information crawled by our web-crawler,
Applebot. > We do not use our users’ private personal data or user interactions when training
our foundation models. Additionally, we take steps to apply filters to remove certain
categories of personally identifiable information and to exclude profanity and unsafe
material. > Further, we continue to follow best practices for ethical web crawling, including
following widely-adopted robots.txt protocols to allow web publishers to opt out
of their content being used to train Apple’s generative foundation models. Web
publishers have fine-grained controls over which pages Applebot can see and how
they are used while still appearing in search results within Siri and Spotlight. Respect. |
|
| ▲ | bitpush 12 hours ago | parent | next [-] |
| When Apple inevitably partners with OpenAI or Anthropic, which by their definition isnt doing "ethical crawling", I wonder how I should be reading that. |
| |
| ▲ | jhickok 12 hours ago | parent | next [-] | | They already partnered with OpenAI, right? | | | |
| ▲ | wmf 11 hours ago | parent | prev | next [-] | | In theory Apple could provide their training data to be used by OpenAI/Anthropic. | | |
| ▲ | bitpush 11 hours ago | parent [-] | | It isn't "apple proprietary" data to give it to OpenAI. Also the bigger problem is, you can't train a good model with smaller data. The model would be subpar. |
| |
| ▲ | brookst 9 hours ago | parent | prev | next [-] | | I mean they also buy from companies with less ethical supply chain practices than their own. I don’t know that I need to feel anything about that beyond recognizing there’s a big difference between exercising good practices and refusing to deal with anyone who does less. | |
| ▲ | bigyabai 10 hours ago | parent | prev | next [-] | | "Good artists copy; great artists steal" - Famous Dead Person | |
| ▲ | napierzaza 6 hours ago | parent | prev | next [-] | | [dead] | |
| ▲ | fridder 11 hours ago | parent | prev [-] | | Same way as the other parts of their supply chain I suppose. |
|
|
| ▲ | darkoob12 2 hours ago | parent | prev | next [-] |
| You shouldn't believe Big Tech on their PR statements. They are decades behind in AI. I have been following AI research for a long time. You can find best papers published by Microsoft, Google, Facebook in past 15 years but not Apple. I don't know why but they didn't care about AI at all. I would say this is PR to justify their AI state. |
| |
| ▲ | ACCount36 37 minutes ago | parent [-] | | Apple used to be at the edge of AI. They shipped Siri before "AI assistant" went mainstream, they were one of the first to ship an actual NPU in consumer hardware and put neural networks into features people use. They were spearheading computational photography. They didn't publish research, they're fucking Apple, but they did do the work. And then they just... gave up? I don't know what happened to them. When AI breakthrough happened, I expected them to put up a fight. They never did. |
|
|
| ▲ | simonw 11 hours ago | parent | prev | next [-] |
| One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt |
| |
| ▲ | dijit 11 hours ago | parent | next [-] | | Uncharitable. Robots.txt is already the understood mechanism for getting robots to avoid scraping a website. | | |
| ▲ | simonw 11 hours ago | parent [-] | | People often use specific user agents in there, which is hard if you don't know what the user agents are in advance! | | |
| ▲ | lxgr 7 hours ago | parent | next [-] | | That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories. Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough. | |
| ▲ | 6 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | wat10000 11 hours ago | parent | prev [-] | | If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard. | | |
| ▲ | lxgr 7 hours ago | parent | next [-] | | Yes, but given the lack of generic "robot types" (e.g. "allow algorithmic search crawlers, allow archival, deny LLM training crawlers"), neither opt-in nor opt-out seems like a particularly great option in an age where new crawlers are appearing rapidly (and often, such as here, are announced only after the fact). | |
| ▲ | simonw 10 hours ago | parent | prev [-] | | Sure, but I still think it's OK to look at Apple with a raised eyebrow when they say "and our previously secret training data crawler obeys robots.txt so you can always opt out!" |
|
|
| |
| ▲ | conradev 6 hours ago | parent | prev [-] | | They documented it in 2015: https://www.macrumors.com/2015/05/06/applebot-web-crawler-si... |
|
|
| ▲ | aydyn 2 hours ago | parent | prev | next [-] |
| Respect, but its going to be terrible compared to every other company. You can only hamstring yourself so much. |
|
| ▲ | astrange 10 hours ago | parent | prev | next [-] |
| > Using our web crawling strategy, we sourced pairs of images with corresponding alt-texts. An issue for anti-AI people, as seen on Bluesky, is that they're often "insisting you write alt text for all images" people as well. But this is probably the main use for alt text at this point, so they're essentially doing annotation work for free. |
| |
| ▲ | simonw 10 hours ago | parent | next [-] | | I think it is entirely morally consistent to provide alt text for accessibility even if you personally dislike it being used to train AI models. | | |
| ▲ | astrange 9 hours ago | parent [-] | | It's fine if you want to, but I think they should consider that basically nobody is reading it. If it was important for society, photo apps would prompt you to embed it in the image like EXIF. Computer vision is getting good enough to generate it; it has to be, because real-world objects don't have alt text. | | |
| ▲ | simonw 9 hours ago | parent | next [-] | | I actually use Claude to generate the first draft of most of my alt text, but I still do a manual review of it because LLMs usually don't have enough contents to fully understand the message I'm trying to convey with an image: https://simonwillison.net/2025/Mar/2/accessibility-and-gen-a... | | | |
| ▲ | lxgr 8 hours ago | parent | prev [-] | | Why would photo apps do what's "important for society"? Annotating photos takes time/effort, and I could totally imagine photo apps being resistant to prompting their users for that, some of which would undoubtedly find it annoying, and many more confusing. Yet I don't think that one can conclude from that that annotations aren't helpful/important to vision impaired users (at least until very recently, i.e. before the widespread availability of high quality automatic image annotations). In other words, the primary user base of photo editors isn't the set of people that would most benefit from it, which is probably why we started seeing "alt text nudging" first appear on social media, which has both producer and consumer in mind (at least more than photo editors). | | |
| ▲ | astrange 4 hours ago | parent [-] | | > Why would photo apps do what's "important for society"? One would hope they're responsive to user demands. I should say Lightroom does have an alt text field, but like phone camera apps don't. Apple is genuinely obsessed with accessibility (but bad at social media) and I think has never once advocated for people to describe their photos to each other. |
|
|
| |
| ▲ | ACCount36 27 minutes ago | parent | prev | next [-] | | > Bluesky Bluesky is where insane Twitter people go when they get too insane for Twitter. | |
| ▲ | barbazoo 7 hours ago | parent | prev [-] | | > An issue for anti-AI people, as seen on Bluesky, is that they're often "insisting you write alt text for all images" people as well. But this is probably the main use for alt text at this point, so they're essentially doing annotation work for free. How did you come to the conclusion that those two groups overlap so significantly? | | |
|
|
| ▲ | epolanski 7 hours ago | parent | prev | next [-] |
| Respect actions, not words and PR. |
|
| ▲ | bigyabai 10 hours ago | parent | prev [-] |
| Gotta polish that fig-leaf to hide Apple's real stance towards user privacy: arstechnica.com/tech-policy/2023/12/apple-admits-to-secretly-giving-governments-push-notification-data/ > Apple has since confirmed in a statement provided to Ars that the US federal government "prohibited" the company "from sharing any information," |
| |
| ▲ | brookst 9 hours ago | parent [-] | | I mean if you throw out all contrary examples, I suppose you are left with the simple lack of nuance you want to believe | | |
| ▲ | bigyabai 9 hours ago | parent [-] | | All examples contrary to what? Admitting to being muzzled by feds? Take all the space you need to lay out your contrary case. Did the San Bernadino shooter predict this? |
|
|