This is an article about experimentation with the impressive ChatGPT which can be found here https://chat.openai.com/chat and an attempt at fully automated code reviews by priming the prompt enough to get useful answers.
If you like this type of content, follow me on Twitter @luyben.
If you know what ChatGPT is, continue below the first break for the results:
As the ChatGPT hype continues, there is a growing interest in its potential applications in the world of software development. One area where ChatGPT has shown promise is in the realm of automated code reviews.
Automated code reviews are a type of code analysis that uses specialized tools to automatically check code for errors, security vulnerabilities, and adherence to coding standards. This can be a time-consuming and tedious process for developers, who often have to manually review large amounts of code to ensure its quality and correctness.
Enter ChatGPT, the state-of-the-art language model developed by OpenAI. By using natural language processing (NLP) techniques, ChatGPT is able to understand the meaning and context of code, making it an ideal candidate for automating code reviews.
One of the key benefits of using ChatGPT for automated code reviews is its ability to provide instant feedback to developers. Instead of having to wait for a human reviewer to manually inspect their code, developers can simply ask ChatGPT for an analysis, and receive an immediate response. This not only saves time, but also allows developers to fix errors and issues as they work, rather than having to wait until the end of a project to address them.
Another advantage of using ChatGPT for automated code reviews is its ability to provide more thorough and accurate analysis. Because ChatGPT uses NLP techniques to understand the meaning and context of code, it can provide a more comprehensive analysis than traditional code review tools, which often rely on simple keyword matching. This means that developers can be confident that ChatGPT is catching all potential errors and issues, and providing helpful feedback for how to fix them.
Of course, ChatGPT is not a replacement for human code reviewers. Its analysis is based on the code it is given, and it is not capable of understanding the broader context or goals of a project. However, as a tool for helping developers quickly and accurately identify issues in their code, ChatGPT has the potential to be a valuable addition to any software development team.
Disclaimer: the above blurb was mostly written by ChatGPT; below are the results, by me.
One part to point out is that ChatGPT tells us:
Another advantage of using ChatGPT for automated code reviews is its ability to provide more thorough and accurate analysis.
And while this is mostly true, you need a lot of checking as, while much of the analysis is accurate, a lot is plainly wrong as will be described below.
The strategy we used is to parse the PHP code with a PHP parser:
and the experimental GPT API & server:
The project in question is a 300k, 1000 PHP file project, created in 2021/2022 and runs in production. It took slightly more than an hour to generate; I had to make some changes to the scripts I used to format the info and tweak the prompts. The final version can do this in less than one hour.
Note that ChatGPT is overloaded most of the time, so I ran the analysis at night in Europe which is the time when it’s fastest. During the US day, it is practically unworkable, giving different types of errors which basically mean: ‘cannot handle the stress!’.
The way I created the analysis is by parsing all .php files in a give directory, going into the AST until I have a chunk that’s small enough for ChatGPT to munch on and then sending it off with a (substantial) prompt, asking it to for an analysis of the code.
Results of the review are saved in a directory ./results and https://github.com/BlueHatbRit/mdpdf is used to merge them into a PDF with the following: https://github.com/tluyben/md2pdf.
Most files were models and controllers; there were a lot of views, but those were copy/paste templates.
A quick review by myself of this codebase had the following conclusions;
- many 2000-style PHP issues like creating SQL statements from strings, including
$_GET/$_POSTconcats directly into the strings; it is a nightmare in that regard
- no MySQL use of limit, so all db results are limitless
- mixing proper PHP templating with creating HTML by concatenating strings together; unreadable and error prone
- XSS and CSRF opportunities everywhere
Generally I found it strange that no prepared statements were used anywhere, which is not done in 2022. ChatGPT agreed with that.
What is surprising to me, but that’s because I’m having some human associations with this GPT as it lies, fabricates and bluffs
(which is lovely in my opinion), it tries to explain the code even though it was not asked to:
It is usually not very helpful, but it’s very unhuman like. A reviewer would not generally do that, but I could not find a prompt capable of stopping this.
Because of the NDA I have with the company that created this codebase, I cannot insert recognisable code. But there were quite a few instances where ChatGPT created ‘better code’ which was just a copy of the original code…
I changed my code to detect that.
Some examples of generic code comments ChatGPT came up with:
And more detailed but still generic:
It also concludes things that I would normally read past unless bug hunting, like this;
Although I saw these things as well, it seems ChatGPT takes a good look at all possible security issues;
These are indeed present all over this codebase.
Another very nice thing is that it sometimes indeed reasons correctly about how the code will execute; in this case, there is a call;
$today = date('Y-m-d', time());
in a O(n^2) loop; while there can be a correct case for this (time() is dynamic), this is not intended here by the programmer and, because time() is dynamic, the compiler cannot safely optimize this away;
Some times it comes to very strange conclusions which you might read over, here is an example:
While ChatGPT provides a nice enough analysis in many cases, the same result can be had with static analysis. However, the results would be less or, perhaps, completely not readable for management. The style of explaining, while technical, is quite readable to people who don’t choke up when seeing a little bit of code (which they can read around).
Performance (execution speed) is not greatn for a machine, but compared to a human, it is blazingly fast; churning out a review every 2 seconds average. Static analysis finds more and faster, but usually doesn’t explain things in a nice and readable form.
Because the developers consistently made the same mistakes, the feedback was quite repetitive. This is not a bad thing; this happen with human reviews too.
The most annoying thing about GPT(2-3-3.5) etc is that it cuts off responses; this is a property of these transformers which needs to be addressed. It is, for non-IT/AI people completely alien that this can happen. More annoyingly, OpenAI somehow decided to remove the ‘Next’ prompt (https://twitter.com/VFDraganescu/status/1600563871504388096) which was handy in that case, if you could knew it was cut off.
All things considered, it did a good, albeit repetitive job. There were very strange mistakes (giving away it does not actually understand), but those instances are easy to gloss over while picking up the actually important bugs and issues.
A combination of static analysis to find the issues and after that explaining the effects and fixes by the AI seems to be the way to go here. x``