Publication Details
Abstract
Autocode new generations of AI-driven coding assistants such as GitHub Copilot and many large language models (LLM) have made vast sectors of the incumbent software development workflow obsolete. One such model, popularly known as ChatGPT, has been in the limelight because of its capacity to generate code that can be executed as well as assist with debugging and other software engineering tasks. As reliance on the technology grows in industry, many are asking whether code written by ChatGPT is trustworthy and can be decent enough quality for production-grade software systems. Whereas previous work focused only on functional correctness and therefore offered little insight into general software quality. More generally from a software engineering perspective, this paper provides a thorough empirical characterization of the reliability of code generated by ChatGPT. The above are components of the proposed multidimensional reliability framework Accuracy maintainability (not efficiency) cleanliness & security readability testability uniformity We apply controlled empirical studies on 500 code artifacts derived from five software engineering tasks samples and two programming languages. The output code was analyzed by a range of scanners, automated static analysis tools, statistical significance tests and common software quality measures. These results suggest that for fast coding assistance, ChatGPT is a suitable programming assistant in terms of readability and basic functional correctness. The results do also show, however, some significant reliability problems. In particular, the quality of directly generated AI code is lacking significantly due to inconsistent results over repeated executions and decreased simplicity for non-toy programs in addition to security vulnerabilities easily exploited. Evidence from the Field There is much variation in if, where, and how reliability is variable for kinds of work and quality characteristics. The paper concludes that although ChatGPT brings productivity benefits, it is not yet a dependable software development assistant. Instead, a QA method should validate the results. We believe that our proposed evaluation framework, along with the empirical evidences presented in this paper is a guide for researchers and practitioners who want to adopt big language models into existing software engineering practices.