| ▲ | oersted 4 days ago | |||||||||||||||||||||||||
Impressive work, but I'm confused on a number of fronts: - You are serving closed models like Claude with your CTGT policy applied, yet, the way you described your method, it involves modifying internal model activations. Am I misunderstanding something here? - Could you bake the activation interventions into the model itself rather than it being a runtime mechanism? - Could you share the publications of the research associated with this? You stated it comes from UCSD. - What exactly are you serving in the API? Did you select a whitelist of features to suppress you thought would be good? Which ones? Is it just the "hallucination" direction that you showcase in the benchmark? I see some vague personas, but no further control other than that. It's quite black-boxy the way you present it right now. I don't mean this as a criticism, this looks great, I just want to understand what it is a bit better. | ||||||||||||||||||||||||||
| ▲ | cgorlla 4 days ago | parent [-] | |||||||||||||||||||||||||
>yet, the way you described your method, it involves modifying internal model activations It's a subtlety, but part of it works on API based models, from the post: "we combine this with a graph verification pipeline (which works on closed weight models)" The graph based policy adjudication doesn't need access to the model weights. >Could you bake the activation interventions into the model itself rather than it being a runtime mechanism? You could via RFT or similar on the outputs. It functions as a layer on top of the model without affecting the underlying weights, so the benefit is that it does not create another artifact for a given customization. >What exactly are you serving in the API? It's the base policy configuration that created the benchmark results, along with various personas to give users an idea of how uploading a custom policy would work. For industry-specific deployments, we have additional base policies that we deploy for that vertical, so this is meant to simulate that aspect of the platform. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||