Back in 2017, Andrej Karpathy wrote an essay titled Software 2.0 about how Deep Neural Networks are empowering a shift from building software from code to building software with data. This shift is now referred to more often as Software 2.0. In particular, when giving meaningful semantic interpretations to raw data from the real world. Inspirational problems include:
These simple semantic evaluation tasks are powerful on their own and can drive quality control and efficiency in many important industries. However, that is not the limit of Software 2.0, as it can be enhanced with traditional algorithms for more complex planning and control tasks like:
These tasks have traditionally been extremely challenging to machine learning professionals, with professional histories often registering more failed or underperforming projects than successes. Aligned industries tend to be somewhat risk averse, attracting more risk averse professionals, and operating in a conservative, incremental manner. Leading to an overall under-utilization of the Software 2.0 stack, despite the hype and perhaps over-utilization in a more risk seeking “Tech” industry.
Software 2.0 offers a solution to this professional anxiety: I, as a computer scientist, don’t have to run the show. I don’t have to reason about every detail of the algorithm. I don’t have to customize or tweak everything to the details of my issue. Rather, I can take a supportive secondary role where I build the platform and the infrastructure that ingests and trains a model on the algorithm, and someone else can worry about the details, by developing the data, and ensuring that the data is actually sufficient to solve the problem.
It’s very similar to how a hardware developer, after the Software 1.0 transformation caused by the development of the microchip, no longer has to drive every detail of the hardware’s behavior. Instead, when confronted with a novel problem, they focus on assembling a platform, mostly from ready-made components, and simply ensure that the specs and configuration allow the software to do what it needs to do, without worrying about the details of how the software works or how it is developed.
Industry leaders like Andrew Ng have been championing this concept and starting startups with the hope of powering this Software 2.0 development with powerful training/inference platforms. However, understanding of the philosophy of this workflow is still poorly understood, leading to defaulting to better understood, but slower and more wasteful Software 1.0 workflows based on static dataset collection.
Software 2.0 has several barriers to implementation, the biggest being:
The second issue tends to get easier over time as the rest of the system matures. But the first issue does not get solved by time, rather it is driven by computer scientists running the show. Below, I’ll explain the principles involved in making someone else the center of Software 2.0 workflows.
In 2022, I joined a fascinating mid-stage startup that checked all the boxes. It did automated analysis of raw real world data, it had a real business model, and it was run by good, pragmatic people.
They built remarkably human Software 2.0 workflow centering on the humans labeling the data: the Data Managers. This is not an inflated title, they really managed a lot, and a lot of project success depended on their skill. Unlike Software 1.0 workflows, where the labelers simply hand off the data at some point, at this company, the data managers iterated with product advisors to determine the best way to label the data, actually labeled the data, trained the model, used the model to find good labels to add, evaluated the model, built out metrics, advised studies, and more. The remarkable scope of their responsibilities meant that relatively little time was spent on actually labeling or reviewing data, maybe 40% of their time on average.
The lead Data Manager, a very talented individual, but with no machine learning background or education, over time, developed a strong philosophy of the true human demands of data management which surpass any particular data labeling processes used per application:
These demanding requirements pave the way for a whole new skilled craft—the Software 2.0 data developer.
While all the above intellectual demands have an important place, there is a much greater degree of mind-numbing grind in Software 2.0 development than Software 1.0 development. Combined with the high degree of responsibility and focus of understanding on the Data Manager, second-guessing and decision exhaustion are huge problems.
To minimize this second-guessing and allow for the possibility of smooth flow in the labeling, targeted processes can be established to commit to workflows that are understood to work, and see changes through to where success can be determined.
These processes will be a huge help to workflow efficiency and project timelines if well designed and can stall a project if poorly designed. Thus, these processes should be large on the minds of anyone leading or directing such a project.
However, processes must take a second-place role behind the human Data Manager and others heavily involved in the project. The quality of the people matters more than the quality of the processes. Junior Data Managers can work with more senior ones to help work through tricky issues and to get ideas for effective processes to solve those issues. However, we found that bringing in outside expertise more distantly connected to the project to try to establish processes consistently caused churn and poor results. Instead, regular review of the process with those heavily involved yields the most consistent results.
Software 2.0 is built on top of Software 1.0, similar to how Software 1.0 is built on top of hardware. This means that Software 2.0 still needs regular software developers to build the system that powers the Software 2.0 data management workflow above, help collect raw data from edge sources, and build out any inference pipelines in the application. The role of the Software 2.0 developer is to be a supportive engineering/data operations role that takes a second hurdle to the Data Manager. As this was my role, I identified a few principles that helped me be effective in this position:
Another key role is domain specialist, a true expert in the domain. This role is key to building a robust product that is useful in practice and trustworthy. This role helps gauge the scope of the model, offers criticism of the model’s behavior and performance, and helps find difficult edge cases and identify failures.
Generally speaking, this role needs to have the following philosophy in mind:
The success of any project is most closely related to the quality and dedication of the team that does the work, including both individual qualities and team meshing. However, there is still benefit to learning from past successes and failures. In particular, properly reflective experiences yield principles to evaluate future strategic and tactical approaches. This cohesion onto particular strategies, tactics, and individual roles can boost team cohesion. This section is dedicated to the principles we have learned that generally succeed in a Software 2.0 approach. This analysis is not complete, nor free of errors, as it is dependent on the mistakes and choices we made in our approach.
To start, let’s examine all the differences we found from best practices in Software 1.0 style machine learning.
To emphasize the differences and changes one can expect from a Software 1.0 process to a Software 2.0 process, let’s review the basic Software 1.0 development process.
Meanwhile, the Software 2.0 process emphasizes data iteration over model iteration, so the main iteration loop is on the data, rather than the model, requiring a bit of a reordering of the steps to support this data iteration.
The key insights that were discovered upon implementing this process in practice are:
The most surprising finding to an ML engineer is how unimportant metrics end up being in a Software 2.0 workflow. In Software 1.0, no solid progress on models can be made without solid metrics, so a dogmatic assertion of the field is that metrics development needs to come before model development, and any model development is constrained by the capabilities of the metrics to measure the true performance. Because otherwise, any changes are shots in the dark, and unlikely to hit their mark by random chance.
However, in Software 2.0, metrics are secondary. The reason is that the Data Manager is reviewing model results all day, every day, and they develop an intuition for model behavior independently of any metrics there may be.
Furthermore, this experience-driven model understanding can often be superior in finesse to any metrics one might design in complex practical cases. If there are any downsides, it is that metrics look at a lot of data at once, versus a data management process that typically subsamples the data heavily during review. And of course, good metrics are easily communicable to outside groups, and experience-based intuitions are not. But at a strategic level, metrics can take a second tier of importance in a Software 2.0 world, can come later in development, and can be of lower quality than the training data.
The second surprising finding is how training data size becomes a mixed bag in human-centric Software 2.0 workflows, and keeping datasets small and highly curated can be the way to go. In Software 1.0 workflows, more data is always better. Models can get more, better feedback from more data, even if its quality is relatively low. Especially in cross-modal datasets, more data allows models to find rare associations that simply wouldn’t be present in smaller datasets.
However, in human-centric Software 2.0, more data means more data to review for errors, more data to balance, and more data to refactor if the labeling strategy changes slightly. I.e. similar to how in Software 1.0, all code is a liability, and simplicity is key, in Software 2.0, all data is a liability, and simplicity is also key.
How to reconcile this concept of data liabilities with the success of huge datasets, such as the 400 million image/text pairs to train CLIP, or the 1 billion masks used to train the SAM model? Or the hundreds of terabytes of raw text used to train LLMs?
It’s simple: those models were trained in a human-exploiting regime, rather than a human-centric regime. Human-exploiting regimes have an entirely distinct set of guiding principles that focus on the importance of a good architect to make good decisions at project inception, high-quality “from the wild” data collection/filtering, and labeler arbitration routines to ensure consistency and cleanliness.
The fundamental issue with human-exploiting training strategies, and the real-world failure of these large datasets in building good semantic analysis engines, is the loss of control implied by giving the dataset construction over to a particular process. The process takes control and starts pushing the dataset in unexpected directions in cases of fundamental ambiguities or data domain discrepancies. Interestingly, all of the datasets meant for “general computer vision” (Imagenet), “general object detection” (COCO), or “general segmentation” (Segment Anything) all fail with even slightly out-of-domain data, and have much less value in the real world than first thought.
Interestingly, modern LLMs take an increasingly mixed approach, with an initial large-scale pretraining step focused on architecture and large-scale data, and also an increasingly human-centric RHFL fine-tuning stage, with much of the innovation in the field being identifying and fixing model biases with careful human expertise and attention, rather than large data. While the exact methods behind state of the art models are kept closed, it seems as if these reinforcement learning sets are kept relatively small and agile, with high-skill labelers. Dependency on hordes of unskilled data labelers is limited in such state of the art AI.
This hybrid fine-tuning approach looks increasingly like the trend for real-world machine learning in the next 5-10 years. However, it’s important to note that other domains have yet to build out the very clean fine-tuning workflows and concepts that have been built for LLMs, and attempts in other domains have not had as much success so far. Innovation will be important in proving the value of hybrid systems in practice.
As datasets become larger, the Data Manager’s idea of “holding the dataset in their head” is no longer realistic. Rather, advanced tool use is required to analyze, understand, and control the larger set of artifacts. Some of the visualizations which proved exceptionally valuable are:
Many more advanced and domain-specific tools are possible, these are just the essentials for any Software 2.0 workflow.
Especially as tooling grows more advanced, substantial training and adaptation to the tool are needed. Ultimately, the humans and the tools they use should greatly surpass the capabilities of any automated of either fully automated or fully manual system, but this will require significant investment into true mastery of the technology, adapting to it as necessary. The ideal is a cyborg-like hybrid workflow, where sometimes the system prompts the human, and sometimes the human prompts the system. The human provides precise judgment and broad vision, and the computer provides broad analysis and precise memory.
A quote on an OpenAI blog for a reinforcement learning project (can’t find it) said “How did we improve performance? By fixing bugs. And how did we improve performance even more? By fixing more bugs”.
Almost every problem in a principled, general purpose machine learning system looks like a bug, not a missing feature. It looks like a bug when the system can’t differentiate between uncertainty (missing data) and ambiguity (contradicting data). It seems like a bug when the system loses good generalization during sequential fine-tuning due to parameter collapse. It seems like a bug when scenes fed to the neural network center every object perfectly, hurting performance on non-centered objects in inference. It seems like a bug when training data is broken apart into sentences when the inference data is a continuous stream. These types of “bugs” appear with similar frequency in Software 1.0 and 2.0 systems.
The reason these issues all end up looking like bugs (at least at first) is operational. Model behavior is analyzed, hurting performance with a fix demanded on a short schedule. Bad behavior is isolated, a fix is found, implemented, and things get better. Isolating, and resolving these sorts of “bugs” ended up being much of the value we ML Platform developers added once our platforms reached a certain level of maturity. There seemed to be no real end to them.
The reason these continue to appear indefinitely is that this is just what normal ML development looks like in a Software 2.0 world. Each “fix” extends the system’s capabilities, allowing it to work in more situations with more types sort of datasets. Data developers, now that their project success is more consistent and reliable, grow more ambitious, try new ways of labeling data, and try to solve more challenging business problems, and put more pressure on the underlying ML capabilities, requiring a new emergency “fix” to get it to actually work.
The long term advantages gained with this beneficial loop of more capable ML technologies are why excellence and ambition in the ML Platform team are so valuable. System integrators who can pull ML research projects together into a working system will not be able to reliably identify root causes of system misbehavior and apply general-purpose fixes.
The first step in an ML process is to identify the semantic label schema. However, this labeling schema is too important and too challenging to rely on a one-time judgment at inception. Course correction later in the project should be expected to enable consistent improvements in performance and utility. However, continuous change is not necessarily beneficial. Shorter-term commitments to particular labeling schemes are critical to enable labeling consistency and reduce second-guessing anxiety. So making the best judgments possible at key junctures, and backing those commitments until the next juncture is a helpful pattern.
The importance of the careful creation of labeling schemas cannot be understated. Here are some examples where a subtle change in label definition radically changed the outcome of a project:
Thus, even in modern machine learning with large, synthetic datasets, labeling, and modeling choices are key drivers of end project success. These decisions are also critical in smaller, data-scarce projects. In our experience with data scarcity, the following labeling characteristics are very helpful:
The central process of Software 2.0 (the bulk of the middle part of the project) is an “active learning” labeling/training loop. Active learning is when the current iteration of the model is used to more efficiently label and curate data for the next iteration of the model. This process can start immediately after an initial labeling strategy is determined.
At a basic level, this loop improves labeling speed, because model predictions are often accurate, and can simply be verified. However, we have found that the active learning loop can be much more than a labeling speed improvement, as model predictions allow data managers to understand what the model is already good at, and not add data that is already very high accuracy, resulting in smaller, easier to modify datasets. To explain in more detail, here is a high-level description of this training process:
To maximize the value of active learning processes, the following objectives should be kept in mind:
In Deep Learning, much hype is made about “End to End” learning, that is, a model that takes in raw data and outputs actions that drive a fully automated system, like a self-driving car. The benefit is that humans do not need to design the intermediate representations; these representations can be learned via backpropagation, and thus these representations can be much more informative and higher-dimensional than a human can visualize or review for accuracy.
In practice, end-to-end training is still impractical in many concrete applications, due to a lack of data, computational limitations, and optimization challenges (overfitting, instability, etc). In these systems, the low-dimensional, highly structured, human reviewable intermediate representations become an advantage that allows humans to identify and fix problems with the system. This results in a system with one or more independently trained components, such as a semantic vision system, audio processing, and perhaps certain complex control components, with the rest handled with more handcrafted logic.
In such a mixed Software 1.0/2.0 system, it becomes important to evaluate whole system performance as best as possible. Whole system evaluation allows discovery of compounding failure cases, de-prioritization of self-correcting issues (when failures in one sub-system are corrected by another), and other cross-component concerns.
End to end evaluation is also practical much more often than end-to-end training, requiring much less data. It is also very valuable, as good understanding of these cross-component concerns ends up being invaluable in making good decisions at key points during the development process, and so building out test sets is valuable, even if expensive.
ML Inference systems that expect a human to review results comprehensively, for example in medical diagnosis, true “end to end” evaluation might involve a human re-reviewing the results. While this is important to evaluate, running this test has a high marginal cost, as opposed to the high fixed cost of setting up a fully automated test.
Setting up permanent employees available to run these tests for the entire project duration is a good option, and is almost certainly necessary in the most complex projects (the AlphaGo project had a professional Go player on staff to probe the AI for weaknesses, for example).
However, another option that can allow for cheaper but imperfect end-to-end testing on simpler projects is to simulate the human with another model that is trained on human expert actions. While not perfect, and not always advisable, general trends in results are likely to correlate between real human experts and simulated human experts.
While metrics are less important in Software 2.0, as mentioned earlier they can still provide significant value when they can be used to improve the dataset or make key decisions. The key characteristics that make a good metric include:
The benefits of these principles apply in both end-to-end evaluation, and in single component evaluation, however, tracing and acting on errors is harder in end-to-end evaluation, allowing single-component evaluation to shine.
Those experienced with ML know that people and project management aren’t everything. Good people can create an awesome model, and still ultimately fail to provide anything of durable value.
These failures can usually be tracked to two parts of projects: The very beginning (project choice) and the very end (long-term maintenance).
Developing Software 2.0 can be a difficult, expensive, and slow operation to undertake, and the resulting product must be well targeted and highly valuable. In the end, all ML projects without a strong business case eventually fail. ML projects require significant maintenance, computer infrastructure, and support, and need durable sources of income to support operations. Given these costs, driving projects from the business side, and working backwards, can be a much smoother process than trying to find applications for novel ML technologies. The entire Software 2.0 paradigm is about paving smoother, more consistent paths to business automation success, depending more on team quality and commitment, and less on luck and timing with experimental technologies.
While no complete guide on Software 2.0 products can be assembled, as there is limitless room for creativity and innovation, some decent principles for those just starting out in Software 2.0 are:
As you might have guessed, “data-centric” software development requires a fair bit of data. This means that project choice needs to include the prevalence of data as a variable. However, data is more complex than “lots” or “little” data.
Ultimately, it is very hard to train a model that automatically generalizes across domains or wide gaps in image specification. Some principles:
Even with decent data domain equivalence, sometimes, you are trying to find true edge cases. A literal needle in a haystack, in industrial quality control. Or a screw. Or a piece of plastic. Or any other imaginable object. In Pathology, the main type or two of cancers cause 95%+ of cases, but there are hundreds of types of cancers, any of which can conceivably be present, but which will occur with exponentially decreasing frequency. Do you have access to decent coverage of the most important edge cases? If not, then you should not attempt to build a data-centric product.
However, it might not be all bad. There are several techniques to maximizing the use of the data you do have to try to cover the rarest edge cases by leveraging your existing data.
There are two main approaches to this:
Machine learning products can also fail at the end of the project lifecycle—long term maintenance. In particular, the famous data-drift problem is a major issue.
For example, let’s say you are making a street car vision system. In a year, there will be new models of cars on the road that your system has never seen. In 2 years, the cameras will be replaced, which will inevitably have different settings/resolutions. These changes are “data drifts” which will degrade the performance of any deep learning model. How will your product keep up?
This issue is particularly important since most real-world AI is built in a very fast-moving hardware landscape. Think about how much the hardware on these devices has changed in the last 10 years:
If this data-drift problem is not addressed, your product will lose value very fast. The increasingly ML-aware business community is increasingly aware of this and will demand a strategy up-front.
Before a data drift is mitigated, it should be detected. This is the idea of “Continuous Monitoring”, where the performance of the system is tracked in production after release. However, unlikely dev-ops, where errors can be tracked, it is not always so easy to monitor the performance of an ML system, since an “error” is not always super well specified, and even if it is, it is often the case that no one is going around entering errors into the system in production. Some ideas to implement continuous monitoring in these label-scarce production scenarios include:
Complementary to Continuous Monitoring is continuous data collection, for training/test sets. The need for new data to solve new problems is why it is important to try to get the raw data that triggered the error, and not just the dissatisfaction report.
However, just accumulating data is not enough. This data must be reviewed, curated, and incorporated into the main training dataset to effectively and continuously improve the model’s quality.
Once an issue has been identified and fixed, it needs to be redeployed. Crucially, users will want assurance that the new model will be strictly better, and not worse in certain situations. The typical way to prevent these regressions is to have a continuously growing automated test set that checks all known data distributions, past and present, for regressions. Maintaining a good regression test suite can have the following difficulties:
One challenge here is simply categorizing the different data distributions in this test set. This is not a trivial matter. If these distributions are all lumped together, improvements in new datasets can outweigh serious regressions in old data distributions that might still be used by certain customers. If the distributions are separated out too finely, then they might be noisy, and not cover important cases. A reasonable separation strategy is key.
Duplication of data examples across all test classes it could possibly represent is a good strategy to keep datasets both large and highly stratified. I.e. if you have both a new camera, and it is taking photos in a new country, then it can belong to two test sets: the test set for the new country, and the one for the new camera. This eliminates the need for a combinatorial explosion of test classes.
Sometimes, a model change is not purely an accuracy improvement. It comes with a significant semantic change in how objects are interpreted by the model, in an effort to improve the end-product. These changes usually come in the form of finer-grained distinctions, more classes, higher-dimensional grading, etc. Under the new semantic definitions, old ground truth labels in the regression test data might be incorrect or incoherent, leaving old regression tests useless for evaluating new models.
As painful as it might sound, this semantically misaligned data should really be re-labeled to the new standards of truth. Keeping test sets small and balanced is key to reducing the pain experienced in this situation.
In higher-risk and/or regulated industries, clients will want to have a very good grasp of the risk profile of any change. They might not trust your regression test set and might insist on running their own testing before accepting any change to something as hard to analyze as a deep learning model. Even worse, regulators and watchdogs might also want to see formal third-party studies done on any significant changes. Interestingly, even if no such process exists, “standard” validation practices will sprout up very fast.
As most AI businesses get off the ground, surrounded by technical challenges, staffing challenges, and relationship challenges with partners and clients, they will likely trend towards accepting the rules and restrictions imposed on them. Unfortunately, this is not a good long term strategy. The winners in any competitive AI market will be those who improve the fastest. If model releases are delayed (think years of delay in the worst cases), key feedback from production will be missing, slowing improvements. Repeated model mistakes that are not rapidly fixed will decrease trust in your AI. The fact that the same person is holding back your release might also be demanding a fix to a problem, is ultimately irrelevant to this loss of trust.
The key principles involved in building trust in your release process include: