(Translated by https://www.hiragana.jp/)
[2410.21071] Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks