How do we see the four levels of AI SWE?

The SAE (Society of Automotive Engineers) developed the L0‒L5 levels for autonomous driving in its J3016 Autonomous Driving Level Classification Standard. This standard categorizes driving automation into six levels, clearly defining the differences in driving responsibility, control, system capabilities, and driving environment at various levels of automation.To align automated software development with autonomous driving, it's important to first discuss the complexity of both tasks. The difficulties and complexity of software development were systematically revealed in the 1975 book The Mythical Man-Month by Fred Brooks. Software development is not only a technical challenge but also a system engineering problem involving communication, collaboration, and complexity management. Brooks's famous "No Silver Bullet" concept is still cited today to describe the inherent complexity in software development. In the book, he divides the difficulties of software development into two categories: inherent difficulties and accidental difficulties.

Inherent Difficulties

These are the complexities that are unavoidable and irremovable in software development, regardless of how advanced the tools or processes may be. They can only be managed or alleviated, not eliminated. Brooks identifies the following sources of inherent difficulty:

Complexity: Software systems are highly complex, with intricate relationships between modules, making understanding and maintenance difficult. Unlike physical engineering, software lacks visible structures and is more abstract.
Conformity: Software must conform to hardware, other systems, user requirements, and standards. These external constraints are often designed by humans, irregular, and inconsistent, making them difficult to predict or control.
Changeability: Software is not fixed like hardware; it often needs continuous modifications to adapt to new requirements, policies, technologies, or business environments.
Invisibility: Unlike buildings or machinery, software lacks a visible structure. Its design and implementation are abstract, making it hard to fully grasp the system's structure, which limits management and communication.

These inherent properties mean that software development is not simply a task of labor stacking but an intellectually intensive and cognitively challenging creative activity. This is why "having AI write code" is much trickier than "having AI drive a car" in some respects: the latter, while physically challenging, has a relatively closed state space with clear goals. In contrast, the former involves abstract needs, vague specifications, changing environments, and ambiguous semantics, making it extremely difficult to standardize.

Accidental Difficulties

These difficulties are not inherent to software itself but arise from imperfect tools, methods, or organizational structures:

Communication and collaboration problems: Communication barriers between team members can lead to misunderstandings, misinterpretations of requirements, and design disagreements. The larger the project, the higher the communication cost.
Difficulty in progress estimation: It is difficult to accurately estimate software development progress, often leading to underestimation of required time, especially in later stages of integration and debugging.
The "Mythical Man-Month" fallacy: One of Brooks's famous points is that "assigning more people to a project doesn't always speed it up." Sometimes, adding people can even slow progress due to higher communication and coordination costs.
Lack of reuse: In software development, many features are repeatedly developed from scratch, lacking effective modularization and component reuse, leading to inefficiency.

Over the past few decades, much progress in software engineering has focused on reducing these accidental complexities. Modern IDEs, version control systems (such as Git), testing frameworks, continuous integration tools, containerized deployments, and even AI programming assistants (like GitHub Copilot or ChatGPT) are continuously reducing accidental difficulties, allowing developers to focus more on the inherent complexities.In the real world, especially in teams with a fixed organization, these two types of difficulties often intertwine and affect each other. Managers are also "people," so they also face significant "accidental difficulties." We often see large companies undergoing "failed" organizational transformations, and "negligent" (those who lack technical or business understanding) managers becoming "victims" of the mythical man-month fallacy. Superfluous or negligent managers can slow progress.Additionally, in driving, typically, once you have a driver's license, you can drive on the road. While there is a gap between novice and experienced drivers, it is not that large. For software developers, however, looking at the technical levels in major companies shows that the skill tree and growth path are both vast and long. Thus, the complexity of software development is a multidimensional increase compared to autonomous driving, with the most direct help coming from "accidental difficulties" mentioned earlier, such as GitHub Copilot, intelligent autocompletion in Cursor, or using GPT to generate parts of code to help resolve coding challenges. Clearly, this can be considered as L1.

L0: No Assistance

The control subject is human, and humans participate throughout the process, completing all tasks manually.

L1: Assisted Coding

The control subject is human, with AI assisting in code completion, scaffolding code, or even generating individual functions. The generated code requires human oversight and decision-making regarding its adoption. This is considered programming assistance.While AI may have seen more code than anyone else, in terms of responsibility, an L1 AI can hardly take on the duties expected of a human intern, still remaining a tool. Interns are typically expected to independently complete simple development tasks. These tasks might not require deep system understanding or changes to unrelated parts, and they do not increase the system's complexity. Interns need to understand the existing system and the given requirement, then complete the code development, unit testing, and integration testing in a well-documented and easy-to-use development environment. This leads us to define L2.

L2: Local Structured Task Automation

At this stage, AI can assist in completing one or more local task chains and can produce reasonably complete code with clear instructions and good contextual support. For example, it might automatically generate functions, test cases, interface definitions, or even part of the documentation from natural language specifications. The AI can iterate and fix issues based on feedback from compiling or testing results. This means AI has initial automation capabilities in localized, structured, repetitive, and well-defined development tasks.However, the premise for L2 is that AI works in a "well-set, controllable development environment" with clear boundaries and full support systems, including but not limited to:

Defined module responsibilities and interface specifications (e.g., API specifications, class design)
Predefined test frameworks and CI processes (e.g., using JUnit, pytest templates)
Mature runtime environments and debugging tools (e.g., providing sample data, mock frameworks, log capture)
Weak or isolated dependencies between tasks and the overall system (no need to understand system-level architecture or cross-module impacts)

Final delivery decisions still require human intervention, with multiple interventions during the intermediate processes to achieve conditional partial automation. In some ways, L2 can mitigate communication, progress estimation, and code reuse issues from human difficulties.Moving toward the next AI level in line with human programmers' growth path, the AI should be capable of module-level development and maintenance tasks. To do so, the AI must understand not just the functionality, responsibilities, and implementation of a single module but also its role in the overall system, including external dependencies like databases or upstream usage methods. It must also address the "changeability" challenges by considering how to implement complex new requirements in a way that manages the module's complexity. Knowledge of the module's code implementation details, the corresponding business domain, and expertise in code abstraction (OOP, FP) are necessary. Additionally, AI must tackle the "complexity" challenge by being able to independently manage testing, deployment, and setting up test and deployment environments, including mocking external dependencies.The underlying infrastructure (Infra) required to support AI involves system-level support for closed-loop delivery, including code management (e.g., GitHub), sandbox systems, external dependency mocking, and DevOps (CI/CD) capabilities.

L3: Module-Level Task Conditional Automation

At this stage, AI begins to approach the role of a junior developer, capable of handling module-level development and maintenance tasks, with an initial "system understanding." The AI is no longer confined to completing individual functions but can work continuously around a module, combining knowledge of both code and the business domain. It also has end-to-end capabilities within the system-level infrastructure for that module.

L4: System Architecture-Level Automation

For L4 SWE-AI, the goal is not to replace creative technological breakthroughs but to assist in designing system architectures and automating their implementation. This greatly reduces the energy expenditure of architects on the execution level, allowing them to focus more on inherent challenges such as module splitting, architecture evolution, and system collaboration model design. Key capabilities include:

Walking through the design and implementation of each module and ensuring the evolution paths align with system design
Understanding the relationships between modules within the overall system, including interaction methods and dependencies
Completing automation of testing and deployment for each module