sounds like AlphaDev [1] might be a better approach for a problem like this.
[1] https://github.com/google-deepmind/alphadev