5 Python distributions for mastering machine learning

From bare-bones to full-blown, learn which edition of Python is best for your machine learning projects

5 Python distributions for mastering machine learning
torange-fr.com (Creative Commons BY or BY-SA)

If you’re doing work in statistics, data science, or machine learning, the odds are high you’re using Python. And for good reason, too: The rich ecosystem of libraries and tooling, and the convenience of the language itself, make Python an excellent choice.

But which Python? There are a number of distributions of the language, and each one has been created along different lines and for different audiences. Here we’ve detailed five Python incarnations, from the most generic to the most specific, with details about how they stack up for handling machine learning jobs.

Anaconda Python

Anaconda has come to prominence as a major Python distribution, not just for data science and machine learning but for general purpose Python development as well. Anaconda is backed by a commercial provider of the same name (formerly Continuum Analytics) that offers support plans for enterprises. 

The Anaconda distro provides, first and foremost, a Python distribution outfitted with easy access to the packages often used in data science: NumPy, Pandas, Matplotlib, and so on. They’re not simply bundled with Anaconda, but available via a custom package management system called Conda. Conda-installed packages can include tricky external binary dependencies that couldn’t be managed through Python’s own Pip. (Note you can still use Pip if you want to, but you won’t get the benefits that Conda provides for those packages.) Each package is kept up-to-date by Anaconda, and many of them are compiled with the Intel MKL extensions for speed.

The other major advantage Anaconda confers is a graphical environment, the Anaconda Navigator. The Navigator isn’t an IDE, but rather a convenient GUI front end for Anaconda features including the Conda package manager and user-configured virtual environments. You can also user Navigator to manage third-party applications such as Jupyter notebooks and the Visual Studio Code IDE.

A minimal install of Anaconda, called Miniconda, installs only enough of the Anaconda base to get you started, but can be expanded with other Conda- or Pip-installed packages as you need them. This is useful if you want to take advantage of Anaconda’s rich gamut of libraries, but need to keep things lean.

ActivePython

Data science is just one of the use cases for ActivePython, which was designed to serve as a professionally supported edition of the language with consistent implementations across architectures and platforms. This helps if you’re using Python for data science on platforms like AIX, HP-UX, and Solaris, as well as Windows, Linux, and MacOS.

ActivePython tries to stick as closely as possible to Python’s original reference incarnation. Instead of a special installer for complex math-and-stats packages (the Anaconda approach), ActivePython pre-compiles many of those packages, using the Intel MKL extensions where needed, and provides them as pack-ins with the default installation of ActivePython. They don’t have to be formally installed; they’re available right out of the box.

However, if you want to upgrade to a newer version of those precompiled packages, you will need to wait until the next build of ActivePython itself comes out. This makes ActivePython more consistent as a whole—a valuable thing to have when the reproducibility of results matters—but also less flexible.

CPython

If you want to begin your machine learning work from scratch, using nothing but the official, plain-vanilla version of Python, pick CPython. So named because it is the reference edition of the Python runtime written in C, CPython is available from the Python Software Foundation website, and provides only the tools needed to run Python scripts and manage packages.

CPython makes sense if you want to custom-build a Python environment for a machine learning or data science project, you trust yourself to do it right, and you don’t want any third-party alterations getting in the way. The source for CPython is readily available, so you can even custom-compile any alterations you might want to make for the sake of speed or project needs.

On the other hand, using CPython means you will have to deal with the ins and outs of installing and configuring packages like NumPy, with all of their dependencies—some of which have to be hunted down and added manually.

Some of this work has become less burdensome over the past few years, especially now that Python’s Pip package manager elegantly installs precompiled binaries of the kind used in many data science packages. But there are still many cases, especially on Microsoft Windows, where you’ll have to fit all the pieces together by hand—for instance, by manually installing a C/C++ compiler.

Another drawback to using CPython is that it does not use any of the performance-accelerating options useful in machine learning and data science, such as Intel’s Math Kernel Library (MKL) extensions. You’d have to build the NumPy and SciPy libraries to use Intel MKL all on your own.

Enthought Canopy

The Enthought Canopy distribution of Python resembles Anaconda in many ways. It is constructed with data science and machine learning as its primary use cases, comes with its own curated package index, and provides both graphical front ends and command-line tools for managing the whole setup. Enterprise users can also purchase the Enthought Deployment Server, a behind-the-firewall package management system. Machine learning packages built for Canopy use the Intel MKL extensions.

The main difference between Anaconda and Canopy is scope. Canopy is more modest, Anaconda more comprehensive. For instance, whereas Canopy includes command-line tools for creating and managing Python virtual environments (useful when dealing with different sets of packages for different machine learning workflows), Anaconda provides a GUI for that job. On the other hand, Canopy also includes a handy built-in IDE—a combination file browser, Jupyter notebook, and code editor—that is useful for jumping right in and getting to work without fuss.

WinPython

The original mission behind WinPython was to provide an edition of Python built specifically for Microsoft Windows. Back when CPython builds for Windows were not especially robust, WinPython filled a useful niche. Today, CPython’s Windows edition is quite good, and WinPython has turned toward filling cracks still not paved over by CPython—especially for data science and machine learning applications.

By default, WinPython is portable. The entire WinPython distribution fits into a single directory that can be placed anywhere and run anywhere. A WinPython installation can be delivered as an archive or on a USB drive, preinstalled with all the environment variables, packages, and scripts needed for a given job. It’s a useful way to pack up all that’s needed to train a particular model or reproduce a specific data experiment. Or you can register a WinPython installation with Windows and run it as if it had been natively installed (and unregister it later, if you wish).

Many of the trickier elements of a machine-learning-centric Python distribution are also covered. Most of the key libraries—NumPy, Pandas, Jupyter, and interfaces to the R and Julia languages—are included by default and built against the Intel MKL extensions where relevant. The Mingw64 C/C++ compiler also comes packaged with NumPy in WinPython, so that binary Python extensions can be built from source (for instance, by way of Cython) without having to install a compiler.

WinPython has its own package installer, WPPM, which handles packages that come with prebuilt binaries as well as pure-Python packages. And for those who just want a bare-bones version of WinPython with no packages included by default, WinPython offers a “zero version,” along the same lines as Anaconda’s Miniconda.

Copyright © 2018 IDG Communications, Inc.