Talk: Code Quality in the AI Age

Exploring the fact-based realities of AI-assisted coding

AI-Asssisted Coding: The promises
Uncovering the AI productivity claim
Human motivation: are we outsourcing the fun and adding to the mundane
Are we focusing AI on the right problem
No need to settle at 55% faster, 10x ourselves
The big win: optimise for software maintenance
Research: AI refactoring
Can AI help us refactor existing code
Fact checking the AI refactorings: Can we separate the good from the bad refactorings
Outcome: elevated to the level of human experts with a fact-checking model
References

AI-Asssisted Coding: The promises

Travelling back in time to the 1970s
Setting the scene, in 2024 began programming with the Atari 2600
No operating system, 128 bytes of RAM
Compensated for lack of RAM with 4K of ROM for game logic, graphics and sound
GitHub Copilot gave an ad on this code "55% faster"
Raised existential questions: do I want to be 55% faster at a hobby? At what?

Uncovering the AI productivity claim

"The impact of AI on developer productivity: Evidence from GitHub Copilot"
Researchers claimed it was faster, but with no guarantee that it transfers to real world tasks
This was done in experimental setting, almost like a classroom
They also point out that the study suggested less experienced developers would benefit most
"this study does not examine the effects of AI on code quality"
What are the implications? Will we make the world a better place?

Human motivation: are we outsourcing the fun and adding to the mundane

We are basically turning ourselves into "maintenance programmers"

Are we focusing AI on the right problem

Writing new code is a small part (5%) of what we do
55% of 5% is just 1 hour a week
Not disruptive nor groundbreaking
The big potential win for AI is in understanding code
Existing system understanding tells us what to change
What if we refocus AI so it becomes easier to understand?

No need to settle at 55% faster, 10x ourselves

Data to support: 'Code Red: The Business Impact of Code Quality' (arxiv) Tornhill and Borg (2022)
Categorises code as:
'green' (easy),
'yellow' (problematic, more complicated than the problem calls for)
'red' (worst)
Red category makes you 10x slower, even if problem is similar scale
If we can use AI to turn red code green, we can indeed make ourselves 10x faster
We want to ensure AI generates green code, else it just generates maintenance burden

In a follow-up to 'Code Red' study, looked at on-boarding cost

New programmers need extra time to adapt to it, especially with lower code quality
Borg, Tornhill Mones (2023) U Owns The Code...

The big win: optimise for software maintenance

Refactoring is defined as improving the design of existing code without change of behaviour

It's not a refactoring unless we improve the design - Need a gold standard to judge if we did or not
It's not a refactoring if we fail to preserve the behaviour of the original code, e.g. we introduce a bug

"Refuctoring": when we fail to keep these requirements

Research: AI refactoring

'Refactoring vs. Refuctoring'
Measured code quality improvement using the 'Code Health' metric
Only code level metric that correlates with business outcome
Aggregated metric as there's never a single metric that governs multifaceted problem
File-level metric, 3 categories

Module/class level smells e.g. low cohesion, God classes

Low cohesion: class with too many business rules, makes code hard to understand
Low cohesion class that grows becomes a God class

Function level smells e.g. copy-pasted logic, God functions, primitive obsession
Implementation smells e.g. deep nested logic, complex conditionals

Roughly 20% of all programmer mistakes are due to things like deep nested logic
Just doesn't play well with how the human brain works
Code health:
green = healthy code with low defect risk
yellow = maintenance risk
red = worse...

Can AI help us refactor existing code

Benchmarking on 100k refactorings of real world code
99%+ valid code, around 68% improved code health, and only 18-37% made a valid refactoring
Let's say you had a coworker who broke the code in 70-80% of cases
When it comes to a machine we accept it

Fact checking the AI refactorings: Can we separate the good from the bad refactorings

How do we know which is refactoring vs. refuctoring?
CodeScene ACE: auto-refactor the code
Refactoring goes to a 'model selector' first
Analyses the code, selects the best AI service for the job, the AI services have different strengths
Throw away bad refactorings, make another attempt, ask another AI
Demo: refactoring nested conditionals, naming things (logical operators)

Outcome: elevated to the level of human experts with a fact-checking model

98% correctness from 30% range
Focus on comprehending code over mere writing
Understanding existing code is a very human-intensive aspect

References

Code Red
Refactoring vs. Refuctoring
Your Code As A Crime Scene (2023)