## [[2305.01210] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation](https://arxiv.org/abs/2305.01210) Code evaluation datasets are used to evaluate how well LLMs perform at generating code. They contain pairs of coding problems and test cases. Apparently, these tests cases are not comprehensive enough, and HUMANEVAL, a popular code evaluation dataset, was routinely missing bugs in generated code. > Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality [...] In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. [...] We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. **Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average!** ## [Red Pen Reviews: Anti-Diet: Reclaim Your Time, Money, Well-Being, and Happiness Through Intuitive Eating](https://www.redpenreviews.org/reviews/anti-diet/) Red Pen Reviews evaluates popular diet books for scientific accuracy, healthfulness, and reference accuracy (whether or not it accurately summarizes its citations). Here they evaluate a book written by a proponent of [Health At Every Size](https://en.wikipedia.org/wiki/Health_at_Every_Size). HAES has become popular lately and makes some eyebrow raising claims, so I was interested to read this review. This book argues that: > 1. Dieting does not lead to long-term weight loss. > 2. Obesity does not directly harm health. > 3. Weight stigma and weight cycling explain why obesity is linked to health problems. How do these claims fare? > For scientific accuracy, AD received an overall score of 1.3 out of 4, meaning that its claims are poorly supported. > The first claim, that dieting does not lead to long-term weight loss, received a score of 2.3 out of 4, meaning it’s weakly supported by evidence. > The second claim, that obesity does not directly harm health, received a score of 0.7 out of 4, indicating that it is opposed by evidence. > The third claim, that weight stigma and (diet-induced) weight cycling explain why obesity is linked to negative health outcomes, received a score of 1.3 out of 4, indicating it is poorly supported by evidence. I think HAES proponents are correct that the diet industry oversells the power of diets (transformation isn't the norm but a 5-10% loss is achievable and improves health), but I think denying the negative health effects of obesity are a bridge too far. Check out the review for a deep dive on the evidence. I recently [[Maintenance Phase Podcast Review|Argued On The Internet]] about this, and I was relieved to see that real experts in nutrition came to a similar conclusion as me, even citing the same studies.