▲ | paradite 3 days ago | |
Hey. I like your roast on benchmarks. I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts: Example recent one on GPT-5: https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under... All results: |