> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

▲

fredericoluz 4 hours ago | parent | next [-]

it seems that the examples they showed off with haiku work. i'd guess llama is just too bad

▲

fredericoluz 4 hours ago | parent | prev [-]

same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.

	▲	zozbot234 2 hours ago \| parent [-]
		AIUI the paper's examples are from a version of Claude not Llama? The thinking process is going to be extremely model-specific.