I. Introduction
The release of ChatGPT stirred up great excitement among students and worries among educators. Indeed, a survey by Stanford University showed that 17% of respondents had used ChatGPT on their final exams and another survey found that more than 40% of students used ChatGPT for coursework multiple times each week [1]. Some faculty members resorted to AI content detectors to identify AI-generated content in submissions. However, many students complained about being wrongfully accused of cheating with AI because of faulty results by these detectors. One such incident at the University of California, Davis, where the accused student was eventually exonerated, was reported by USA Today [2]. Still, many other cases fail to get media attention. A search on Reddit, a social website with two-thirds of users in their twenties and thirties [3], using the keywords “accused” and “AI” yielded many posts, including two in May 2023, each gathering over 3,000 comments, about wrongful accusations involving AI use in assignments [4], [5]. The posts underscore the considerable stress and frustration experienced by students in such situations. The large amount of student complaints across social media platforms begs the question: Is the error rate statistically significant enough to cast doubt on the reliability of these detectors? This study uses statistical methods to evaluate the ability of detectors to accurately distinguish between human and AI-generated writing.