How to solve AI’s reproducibility crisis

Society will never fully trust artificial intelligence’s actions unless they can be shown to produce reasonably repeatable results in line with what their developers claim

How to solve AI’s reproducibility crisis
Getty Images

Reproducibility is often trampled underfoot in AI’s rush to results. And the movement to agile methodologies may only exacerbate AI’s reproducibility crisis. Without reproducibility, you can’t really know what your AI system is doing or will do, and that’s a huge risk when you use AI for any critical work, from diagnosing medical conditions to driving trucks to screening for security threats to managing just-in-time production flows.

Data scientists’ natural inclination is to skimp on documentation in the interest of speed when developing, training, and iterating machine learning, deep learning, and other AI models. But reproducibility depends on knowing the sequence of steps that produced a specific data-driven AI model, process, or decision.

Reproducibility falls apart if the data scientists who built an AI model failed to follow a repeatable approach to their work or document what they actually did in precise detail. In those scenarios, neither the original developers of an AI model nor anyone else can be confident that what they found can be reproduced at a later date by themselves or anyone else.

The reproducibility issues multiply as the underlying AI pipeline platforms—including modeling frameworks, hardware accelerators, and distributed data lakes—evolves on every level, thereby reducing the feasibility of standing up a precise replica of the original platform for any later-on cross-validation.

Shared devops platforms can ensure reproducibility

To ensure that reproducibility isn’t undermined by agile methods, data science teams should perform all their work on shared devops platforms. Those platforms—which are now offered by dozens of vendors—enable AI development teams to maintain trustworthy audit trails of the specific processes used by data science professionals to develop their AI deliverables. Data-science devops tools use rich repositories of associated metadata and a log of precisely how particular data, models, metadata, code, and other artifacts executed in the context of a particular process or decision. They also automate the following AI pipeline functions:

  • Logging of every step in the process of acquiring, manipulating, modeling, and deploying data-driven analytics.
  • Versioning all data, models, hyperparameters, metadata, and other artifacts at all stages in the development pipeline/
  • Retention of archival copies of all data sets, plots, scripts, tools, random seeds, and other artifacts used in every iteration of the modeling process.
  • Generation of detailed narratives that spell out how each step contributed to analytical results achieved in every iteration.
  • Accessibility and introspection at the project level by independent parties of every script, run, and result.

Why you also need to follow neural-net standad practices

But even if they rigorously adopt devops platforms and practices in their work, data scientists may inadvertently compromise reproducibility of results from run to run. Even when two data scientists are working from the same training data and machine-learning model versions, developers might inadvertently introduce run-to-run variability—and thus compromise reproducibility—through the following standard neural-net modeling practices.

  • Initialization: Models approach a useful result when their outputs converge on a specific target value. Acceleration of this process often requires that model initial weight values be set by sampling from a particular data distribution. This often increases the speed of convergence in setting the initial weights to zero. But it also introduces randomness into the initialization of each run. The result is that researchers may inadvertently deviate from strict reproducibility when they try to verify someone else’s findings.
  • Ordering: Models’ ability to learn the underlying patterns in a data set may be degraded by the order of the observations to which they’re exposed during training. Thus, data scientists have learned that it’s often a good practice to randomly shuffle a data set before each training run. But that practice also deviates from strict reproducibility of findings across runs.
  • Layering: Models may include layers and nodes that have inherent randomness, such as dropouts, which reduce overfitting by excluding some neural-network input nodes, both hidden and visible, from any particular training run. This typically results in the same input sample producing different layer activations from run to run, which deviates from strict reproducibility.
  • Migration: Model libraries can have subtly different behaviors from version to version, even if the front-end API and the training data set remain unchanged when migrating a model from one framework to another. This can compromise reproducibility if subsequent machine-learning developers can’t roll back to the back-end modeling frameworks used to generate a particular result.
  • Execution: Model code may not produce consistent computations across runs if not executed on the same versions and configurations of back-end hardware accelerators, such as GPUs, due to changes in the hardware implementations of any given operation, such as convolutional processing. This can undermine reproducibility by subsequent researchers who aren’t aware of these hardware dependencies and thus can’t turn off the changed operations without compromising the scope of their testing.

Don’t skimp on the audit trail

If data scientists ensure that their work meets consistent standards of transparency, auditability, and accessibility, they make it easier for others to spot these reproducibility issues. For all these reasons, data scientists should always ensure that agile methods leave a sufficient audit trail for independent verification, even if their peers (or compliance specialists) never choose to take them up on that challenge.

However, as I’ve just spelled out, some of those issues may be due to arcane nuances of modeling, training, cross-framework portability, and back-end hardware dependency. As a result, they may not be easily spotted in peer reviews by other data scientists, unless those reviews include the geekiest algorithm, framework, and hardware specialists you can find. And that will be increasingly difficult to find such people when you need them—at any price—in a world of democratized AI development.

Reproducibility will remain a daunting challenge for AI professionals. But it will keep resurfacing and growing in importance for all of AI stakeholders, because society will never fully trust the automated actions taken by these embedded intelligent assets unless they can be shown to produce reasonably repeatable results in line with what their developers claim.

Copyright © 2018 IDG Communications, Inc.