Setting Up Python for Machine Learning on Windows

This Post Was Originally Published on Real Python on Oct 31st, 2018 by Renato Candido.

Python has been largely used for numerical and scientific applications in the last years. However, to perform numerical computations in an efficient manner, Python relies on external libraries, sometimes implemented in other languages, such as the NumPy library, which is partly implemented using the Fortran language.

Due to these dependencies, sometimes it isn’t trivial to set up an environment for numerical computations, linking all the necessary libraries. It’s common for people to struggle to get things working in workshops involving the use of Python for machine learning, especially when they are using an operating system that lacks a package management system, such as Windows.

In this article, you’ll:

  • Walk through the details for setting up a Python environment for numerical computations on a Windows operating system
  • Be introduced to Anaconda, a Python distribution proposed to circumvent these setup problems
  • See how to install the distribution on a Windows machine and use its tools to manage packages and environments
  • Use the installed Python stack to build a neural network and train it to solve a classic classification problem

Introducing Anaconda and Conda

Since 2011, Python has included pip, a package management system used to install and manage software packages written in Python. However, for numerical computations, there are several dependencies that are not written in Python, so the initial releases of pip could not solve the problem by themselves.

To circumvent this problem, Continuum Analytics released Anaconda, a Python distribution focused on scientific applications and Conda, a package and environment management system, which is used by the Anaconda distribution. It’s worth noticing that the more recent versions of pip can handle external dependencies using wheels, but, by using Anaconda, you’ll be able to install critical libraries for data science more smoothly. (You can read more on this discussion here.)

Although Conda is tightly coupled to the Anaconda Python Distribution, the two are distinct projects with different goals:

  • Anaconda is a full distribution of the software in the PyData ecosystem, including Python itself along with binaries for several third-party open-source projects. Besides Anaconda, there’s also Miniconda, which is a minimal Python distribution including basically Conda and its dependencies so that you can install only the packages you need, from scratch.

  • Conda is a package, dependency, and environment management system that could be installed without the Anaconda or Miniconda distribution. It runs on Windows, macOS, and Linux and was created for Python programs, but it can package and distribute software for any language. The main purpose is to solve external dependencies issues in an easy way, by downloading pre-compiled versions of software.

    In this sense, it is more like a cross-platform version of a general purpose package manager such as APT) or YUM), which helps to find and install packages in a language-agnostic way. Also, Conda is an environment manager, so if you need a package that requires a different version of Python, by using Conda, it is possible to set up a separate environment with a totally different version of Python, maintaining your usual version of Python on your default environment.

There’s a lot of discussion regarding the creation of another package management system for the Python ecosystem. It’s worth mentioning that Conda’s creators pushed Python standard packaging to the limit and only created a second tool when it was clear that it was the only reasonable way forward.

Curiously, even Guido van Rossum, at his speech at the inaugural PyData meetup in 2012, said that, when it comes to packaging, “it really sounds like your needs are so unusual compared to the larger Python community that you’re just better off building your own.” (You can watch a video of this discussion.) More information about this discussion can be found here and here.

Anaconda and Miniconda have become the most popular Python distributions, widely used for data science and machine learning in various companies and research laboratories. They are free and open source projects and currently include 1400+ packages in the repository. In the following section, we’ll go through the installation of the Miniconda Python distribution on a Windows machine.

Installing the Miniconda Python Distribution

In this section, you’ll see step-by-step how to set up a data science Python environment on Windows. Instead of the full Anaconda distribution, you’ll be using Miniconda to set up a minimal environment containing only Conda and its dependencies, and you’ll use that to install the necessary packages.

The installation processes for Miniconda and Anaconda are very similar. The basic difference is that Anaconda provides an environment with a lot of pre-installed packages, many of which are never used. (You can check the list here.) Miniconda is minimalist and clean, and it allows you to easily install any of Anaconda’s packages.

In this article, the focus will be on using the command line interface (CLI) to set up the packages and environments. However, it’s possible to use Conda to install Anaconda Navigator, a graphical user interface (GUI), if you wish.

Miniconda can be installed using an installer available here. You’ll notice there are installers for Windows, macOS, and Linux, and for 32-bit or 64-bit operating systems. You should consider the appropriate architecture according to your Windows installation and download the Python 3.x version (at the time of writing this article, 3.7).

There’s no reason to use Python 2 on a fresh project anymore, and if you do need Python 2 on some project you’re working on, due to some library that has not been updated, it is possible to set up a Python 2 environment using Conda, even if you installed the Miniconda Python 3.x distribution, as you will see in the next section.

After the download finishes, you just have to run the installer and follow the installation steps:

  • Click on Next on the welcome screen:
Miniconda Installer 1
  • Click on I Agree to agree to the license terms:
Miniconda Installer 2
  • Choose the installation type and click Next. Another advantage of using Anaconda or Miniconda is that it is possible to install the distribution using a local account. (It isn’t necessary to have an administrator account.) If this is the case, choose Just Me. Otherwise, if you have an administrator account, you may choose All Users:
Miniconda Installer 3
  • Choose the install location and click Next. If you’ve chosen to install just for you, the default location will be the folder Miniconda3 under your user’s personal folder. It’s important not to use spaces in the folder names in the path to Miniconda, since many Python packages have problems when spaces are used in folder names:
Miniconda Installer 4
  • In Advanced Installation Options, the suggestion is to use the default choices, which are to not add Anaconda to the PATH environment variable and to register Anaconda as the default Python. Click Install to begin installation:
Miniconda Installer 5
  • Wait while the installer copies the files:
Miniconda Installer 6
  • When the installation completes, click on Next:
Miniconda Installer 7
  • Click on Finish to finish the installation and close the installer:
Miniconda Installer 8

As Anaconda was not included in the PATH environment variable, its commands won’t work in the Windows default command prompt. To use the distribution, you should start its own command prompt, which can be done by clicking on the Start button and on Anaconda Prompt under Anaconda3 (64 bit):

Anaconda Prompt Start

When the prompt opens, you can check if Conda is available by running conda --version:

To get more information about the installation, you can run conda info:

Now that you have Miniconda installed, let’s see how Conda environments work.

Understanding Conda Environments

When you start developing a project from scratch, it’s recommended that you use the latest versions of the libraries you need. However, when working with someone else’s project, such as when running an example from Kaggle or Github, you may need to install specific versions of packages or even another version of Python due to compatibility issues.

This problem may also occur when you try to run an application you’ve developed long ago, which uses a particular library version that does not work with your application anymore due to updates.

Virtual environments are a solution to this kind of problem. By using them, it is possible to create multiple environments, each one with different versions of packages. A typical Python set up includes Virtualenv, a tool to create isolated Python virtual environments, widely used in the Python community.

Conda includes its own environment manager and presents some advantages over Virtualenv, especially concerning numerical applications, such as the ability to manage non-Python dependencies and the ability to manage different versions of Python, which is not possible with Virtualenv. Besides that, Conda environments are entirely compatible with default Python packages that may be installed using pip.

Miniconda installation provides Conda and a root environment with a version of Python and some basic packages installed. Besides this root environment, it is possible to set up additional environments including different versions of Python and packages.

Using the Anaconda prompt, it is possible to check the available Conda environments by running conda env list:

This base environment is the root environment, created by the Miniconda installer. It is possible to create another environment, named otherenv, by running conda create --name otherenv:

As notified after the environment creation process is finished, it is possible to activate the otherenv environment by running conda activate otherenv. You’ll notice the environment has changed by the indication between parentheses in the beginning of the prompt:

You can open the Python interpreter within this environment by running python:

The environment includes Python 3.7.0, the same version included in the root base environment. To exit the Python interpreter, just run quit():

To deactivate the otherenv environment and go back to the root base environment, you should run deactivate:

As mentioned earlier, Conda allows you to easily create environments with different versions of Python, which is not straightforward with Virtualenv. To include a different Python version within an environment, you have to specify it by using python=<version> when running conda create. For example, to create an environment named py2 with Python 2.7, you have to run conda create --name py2 python=2.7:

As shown by the output of conda create, this time some new packages were installed, since the new environment uses Python 2. You can check the new environment indeed uses Python 2 by activating it and running the Python interpreter:

Now, if you run conda env list, you should see the two environments that were created, besides the root base environment:

In the list, the asterisk indicates the activated environment. It is possible to remove an environment by running conda remove --name <environment name> --all. Since it is not possible to remove an activated environment, you should first deactivate the py2 environment, to remove it:

Now that you’ve covered the basics of managing environments with Conda, let’s see how to manage packages within the environments.

Understanding Basic Package Management With Conda

Within each environment, packages of software can be installed using the Conda package manager. The root base environment created by the Miniconda installer includes some packages by default that are not part of Python standard library.

The default installation includes the minimum packages necessary to use Conda. To check the list of installed packages in an environment, you just have to make sure it is activated and run conda list. In the root environment, the following packages are installed by default:

To manage the packages, you should also use Conda. Next, let’s see how to search, install, update, and remove packages using Conda.

Searching and Installing Packages

Packages are installed from repositories called channels by Conda, and some default channels are configured by the installer. To search for a specific package, you can run conda search <package name>. For example, this is how you search for the keras package (a machine learning library):

According to the previous output, there are different versions of the package and different builds for each version, such as for Python 3.5 and 3.6.

The previous search shows only exact matches for packages named keras. To perform a broader search, including all packages containing keras in their names, you should use the wildcard *. For example, when you run conda search *keras*, you get the following:

As the previous output shows, there are some other keras related packages in the default channels.

To install a package, you should run conda install <package name>. By default, the newest version of the package will be installed in the active environment. So, let’s install the package keras in the environment otherenv that you’ve already created:

Conda manages the necessary dependencies for a package when it is installed. Since the package keras has a lot of dependencies, when you install it, Conda manages to install this big list of packages.

It’s worth noticing that, since the keras package’s newest build uses Python 3.6 and the otherenv environment was created using Python 3.7, the package python version 3.6.6 was included as a dependency. After confirming the installation, you can check that the Python version for the otherenv environment is downgraded to the 3.6.6 version.

Sometimes, you don’t want packages to be downgraded, and it would be better to just create a new environment with the necessary version of Python. To check the list of new packages, updates, and downgrades necessary for a package without installing it, you should use the parameter --dry-run. For example, to check the packages that will be changed by the installation of the package keras, you should run the following:

However, if necessary, it is possible to change the default Python of a Conda environment by installing a specific version of the package python. To demonstrate that, let’s create a new environment called envpython:

As you saw before, since the root base environment uses Python 3.7, envpython is created including this same version of Python:

To install a specific version of a package, you can run conda install <package name>=<version>. For example, this is how you install Python 3.6 in the envpython environment:

In case you need to install more than one package in an environment, it is possible to run conda install only once, passing the names of the packages. To illustrate that, let’s install numpy, scipy, and matplotlib, basic packages for numerical computation in the root base environment:

Now that you’ve covered how to search and install packages, let’s see how to update and remove them using Conda.

Updating and Removing Packages

Sometimes, when new packages are released, you need to update them. To do so, you may run conda update <package name>. In case you wish to update all the packages within one environment, you should activate the environment and run conda update --all.

To remove a package, you can run conda remove <package name>. For example, this is how you remove numpy from the root base environment:

It’s worth noting that when you remove a package, all packages that depend on it are also removed.

Using Channels

Sometimes, you won’t find the packages you want to install on the default channels configured by the installer. For example, this is how you install pytorch, another machine learning package:

In this case, you may search for the package here. If you search for pytorch, you’ll get the following results:

Pythorch Anaconda Search

The channel pytorch has a package named pytorch with version 0.4.1. To install a package from a specific channel you can use the -c <channel> parameter with conda install:

Alternatively, you can add the channel, so that Conda uses it to search for packages to install. To list the current channels used, you can run conda config --get channels:

The Miniconda installer includes only the defaults channels. When more channels are included, it is necessary to set the priority of them to determine from which channel a package will be installed in case it is available from more than one channel.

To add a channel with the lowest priority to the list, you should run conda config --append channels <channel name>. To add a channel with the highest priority to the list, you should run conda config --prepend channels <channel name>. It is recommended to add new channels with low priority, to keep using the default channels prior to the others. So, alternatively, you can install pytorch, adding the pytorch channel and running conda install pytorch:

Not all packages are available on Conda channels. However, this is not a problem, since you also can use pip to install packages inside Conda environments. Let’s see how to do this.

Using pip Inside Conda Environments

Sometimes, you may need pure Python packages and, generally, these packages are not available on Conda’s channels. For example, if you search for unipath, a package to deal with file paths in Python, Conda won’t be able to find it.

You could search for the package here and use another channel to install it. However, since unipath is a pure Python package, you could use pip to install it, as you would do on a regular Python setup. The only difference is that you should use pip installed by the Conda package pip. To illustrate that, let’s create a new environment called newproject. As mentioned before, you can do this running conda create:

Next, to have pip installed, you should activate the environment and install the Conda package pip:

Finally, use pip to install the package unipath:

After installation, you can list the installed packages with conda list and check that Unipath was installed using pip:

It’s also possible to install packages from a version control system (VCS) using pip. For example, let’s install supervisor, version 4.0.0dev0, available in a Git repository. As Git is not installed in the newproject environment, you should install it first:

Then, install supervisor, using pip to install it from the Git repository:

After the installation finishes, you can see that supervisor is listed in the installed packages list:

Now that you know the basics of using environments and managing packages with Conda, let’s create a simple machine learning example to solve a classic problem using a neural network.

A Simple Machine Learning Example

In this section, you’ll set up the environment using Conda and train a neural network to function like an XOR gate.

An XOR gate implements the digital logic exclusive OR operation, which is widely used in digital systems. It takes two digital inputs, that can be equal to 0, representing a digital false value or 1, representing a digital true value and outputs 1 (true) if the inputs are different or 0 (false), if the inputs are equal. The following table (referred as a truth table in the digital systems terminology) summarizes the XOR gate operation:

Input A Input B Output: A XOR B
0 0 0
0 1 1
1 0 1
1 1 0

The XOR operation can be interpreted as a classification problem, given that it takes two inputs and should classify them in one of two classes represented by 0 or 1, depending on whether the inputs are equal to each other or different from one another.

It is commonly used as a first example to train a neural network because it is simple and, at the same time, demands a nonlinear classifier, such as a neural network. The neural network will use only the data from the truth table, without knowledge about where it came from, to “learn” the operation performed by the XOR gate.

To implement the neural network, let’s create a new Conda environment, named nnxor:

Then, let’s activate it and install the package keras:

keras is a high-level API that makes easy-to-implement neural networks on top of well-known machine learning libraries, such as TensorFlow.

You’ll train the following neural network to act as an XOR gate:

Neural Network to Implement XOR

The network takes two inputs, A and B, and feeds them to two neurons, represented by the big circles. Then, it takes the outputs of these two neurons and feeds them to an output neuron, which should provide the classification according to the XOR truth table.

In brief, the training process consists of adjusting the values of the weights w_1 until w_6, so that the output is consistent with the XOR truth table. To do so, input examples will be fed, one at a time, the output will be calculated according to current values of the weights and, by comparing the output with the desired output, given by the truth table, the values of the weights will be adjusted in a step-by-step process.

To organize the project, you’ll create a folder named nnxor within Windows user’s folder (C:\Users\IEUser) with a file named nnxor.py to store the Python program to implement the neural network:

nnxor.py file

In the nnxor.py file, you’ll define the network, perform the training, and test it:

First, you import numpy, initialize a random seed, so that you can reproduce the same results when running the program again, and import the keras objects you’ll use to build the neural network.

Then, you define an X array, containing the 4 possible A-B sets of inputs for the XOR operation and a y array, containing the outputs for each of the sets of inputs defined in X.

The next five lines define the neural network. The Sequential() model is one of the models provided by keras to define a neural network, in which the layers of the network are defined in a sequential way. Then you define the first layer of neurons, composed of two neurons, fed by two inputs, defining their activation function as a sigmoid function in the sequence. Finally, you define the output layer composed of one neuron with the same activation function.

The following two lines define the details about the training of the network. To adjust the weights of the network, you’ll use the Stochastic Gradient Descent (SGD) with the learning rate equal to 0.1, and you’ll use the mean squared error as a loss function to be minimized.

Finally, you perform the training by running the fit() method, using X and y as training examples and updating the weights after every training example is fed into the network (batch_size=1). The number of epochs represents the number of times the whole training set will be used to train the neural network.

In this case, you’re repeating the training 5000 times using a training set containing 4 input-output examples. By default, each time the training set is used, the training examples are shuffled.

On the last line, after the training process has finished, you print the predicted values for the 4 possible input examples.

By running this script, you’ll see the evolution of the training process and the performance improvement as new training examples are fed into the network:

After the training finishes, you can check the predictions the network gives for the possible input values:

As you defined X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), the expected output values are 0, 1, 1, and 0, which is consistent with the predicted outputs of the network, given you should round them to obtain binary values.

Where To Go From Here

Data science and machine learning applications are emerging in the most diverse areas, attracting more people. However, setting up an environment for numerical computation can be a complicated task, and it’s common to find users having trouble in data science workshops, especially when using Windows.

In this article, you’ve covered the basics of setting up a Python numerical computation environment on a Windows machine using the Anaconda Python distribution.

Now that you have a working environment, it’s time to start working with some applications. Python is one of the most used languages for data science and machine learning, and Anaconda is one of the most popular distributions, used in various companies and research laboratories. It provides several packages to install libraries that Python relies on for data acquisition, wrangling, processing, and visualization.

Fortunately there are a lot of tutorials about these libraries available at Real Python, including the following:

Also, if you’d like a deeper understanding of Anaconda and Conda, check out the following links:

Comments

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">