Remix.run Logo
nickpsecurity 2 days ago

I enjoyed Robert Heinlin's work. I'd probably keep it in my training set if copyright allowed.

What I might drop are the many articles with little content that strictly reiterate racist and sexist claims from intersectionality. The various narratives, like how black people had less of X, they embed in so many news reports. It usually jars our brain, too, since the story isn't even about that. They keep forcing certain topics and talking points into everything hoping people will believe and repeat it if they hear it enough. The right-wing people do this on some topics, too.

I'd let most things people wrote, even some political works on many topics, into the training set. The political samples would usually be the best examples of those ideologies, like Adam Smith or Karl Marx. Those redundant, political narratives they force into non-political articles would get those pages deleted. If possible, I'd just delete those sections containing the random tangent. For political news, I'd try to include a curated sample with roughly equal amounts of left and right reports with some independents thrown in.

So, only manipulative content that constantly repeats the same things would get suppressed. Maybe highly-debated topics, too, so I could include a small number of exemplars. Then, reduce the domination of certain groups in what politics were there. Then, align it to be honest and polite but no specific politics.

I'm very curious what a GPT3-level AI would say about many topics if trained that way instead of Progressive-heavy training like OpenAI, etc.