# Python machine learning software

Today there are a large number of software tools for creating Machine Learning models. The first such tools were formed among scientists and statisticians, where the R and Python languages are popular, ecosystems have historically been developed for processing, analyzing and visualizing data in these languages, although there are also certain machine learning libraries for Java, Lua, C ++. At the same time, the interpreted programming languages are much slower than the compiled ones, therefore, in the interpreted language, the data preparation and the structure of the models are described, and the main calculations are carried out in the compiled language.

In this post, we will talk mainly about libraries that have an implementation in Python, since this language has a large number of packages for integration into various kinds of services and systems, as well as for writing various information systems. The material contains a general description of well-known libraries and will be useful primarily to those who are beginning to study the field of ML and want to roughly understand where to look for the implementation of certain methods.

When choosing specific packages for solving problems, it is first of all necessary to decide whether they contain a mechanism for solving your problems. So, for example, for image analysis, you will most likely have to deal with neural networks, and for working with text - with recurrent ones, with a small amount of data, neural networks will most likely have to be abandoned.

## Python general purpose libraries

All the packages described in this section are somehow used to solve almost any machine learning task. Often they are enough to build a model entirely, at least as a first approximation.

#### Numpy

An open source library for performing linear algebra operations and numerical conversions. As a rule, such operations are necessary to convert datasets, which can be represented as a matrix. The library has a large number of operations for working with multidimensional arrays, Fourier transforms and random number generators. The de facto numpy storage formats are the standard for storing numeric data in many other libraries (for example, Pandas, Scikit-learn, SciPy).

**Website**: www.numpy.org

#### Pandas

Library for data processing. With its help, you can download data from almost any source (integration with the main data storage formats for machine learning), calculate various functions and create new parameters, and construct queries to data using aggregate functions akin to SQL. In addition, there are various matrix transformation functions, a sliding window method and other methods for obtaining information from data.

**Website**: pandas.pydata.org

#### Scikit-learn

A software library with more than a decade of history contains implementations of almost all possible transformations, and often one of them is enough to fully implement the model. As a rule, when programming almost any model in Python, some transformations using this library are always present.

Scikit-learn contains methods for dataset splitting into test and training, calculation of basic metrics over data sets, cross-validation. The library also has basic machine learning algorithms: linear regression (and its modifications of Lasso, ridge regression), support vectors, decision trees and forests, etc. There are also implementations of basic clustering methods. In addition, the library contains methods for working with parameters (features) that are constantly used by researchers: for example, reduction of dimension by the method of principal components. Part of the package is the imblearn library, which allows you to work with unbalanced samples and generate new values.

**Website**: www.scikit-learn.org

#### Scipy

Quite an extensive library, designed for research. It includes a large set of functions from mathematical analysis, including the calculation of integrals, the search for maximum and minimum, the signal and image processing functions. In many ways, this library can be considered an analogue of the MATLAB package for Python developers. With its help, it is possible to solve systems of equations, use genetic algorithms, and perform many optimization tasks.

**Website**: www.scipy.org

## Specific Libraries

In this section, libraries are considered either with a specific area of applicability, or popular with a limited number of users.

#### Tensorflow

The library, developed by Google Corporation for working with tensors, is used to build neural networks. Support for computing on video cards has a version for C ++. On the basis of this library, higher-level libraries are built for working with neural networks at the level of entire layers. So, some time ago, the popular Keras library began using Tensorflow as the main backend for computing instead of the similar Theano library. To work on NVIDIA graphics cards, the cuDNN library is used. If you work with pictures (with convolutional neural networks), you will most likely have to use this library.

**Website**: www.tensorflow.org

#### Keras

A library for building neural networks supporting the main types of layers and structural elements. It supports both recurrent and convolutional neural networks, incorporates the implementation of well-known neural network architectures (for example, VGG16). Some time ago, layers from this library became available inside the Tensorflow library. There are ready-made functions for working with images and text (Embedding words, etc.). Integrated into Apache Spark with the dist-keras distribution.

**Website**: www.keras.io

#### Caffe

A framework for learning neural networks from the University of Berkeley. Like TensorFlow, it uses cuDNN to work with NVIDIA graphics cards. It contains the implementation of a larger number of well-known neural networks, one of the first frameworks integrated into Apache Spark (CaffeOnSpark).

**Website**: cafee.berkeleyvision.org

#### pyTorch

Allows you to port Torch library for Lua to Python. Contains the implementation of algorithms for working with images, statistical operations and tools for working with neural networks. Separately, you can create a set of tools for optimization algorithms (in particular, stochastic gradient descent).

**Website**: www.torch.ch

## Realizations of gradient boosting over decisive trees

Such algorithms invariably cause heightened interest, since they often show a better result than neural networks. This is especially evident if you have at your disposal not very large data sets (a very rough estimate: thousands and tens of thousands, but not tens of millions). Among the winning models on the kaggle competitive platform, gradient boost algorithms over decisive trees are quite common.

As a rule, implementations of such algorithms are in libraries of machine learning of a wide profile (for example, in Scikit-learn). However, there are special implementations of this algorithm, which can often be found among the winners of various competitions. It is worth highlighting the following.

#### Xgboost

The most common implementation of gradient boosting. Appearing in 2014, by 2016 it had gained considerable popularity. To select a partition, sorting and models based on histogram analysis are used.

**Website**: github.com/dmlc/xgboost

#### LightGBM

Microsoft’s gradient boost option released in 2017. Gradient-based One-Side Sampling (GOSS) is used to select the split criteria. There are methods for working with categorical features, i.e. with signs that are clearly not expressed by a number (for example, the name of the author or the make of the car). It is part of the Microsoft DMTK project dedicated to the implementation of machine learning approaches for .Net.

**Website**: www.dmtk.io

#### Cat boot

The development of Yandex, which was released, like LightGBM, in 2017. Implements a special approach to the processing of categorical features (based on target encoding, ie, the substitution of categorical features by statistics based on the predicted value). In addition, the algorithm contains a special approach to building a tree, which showed the best results. Our comparison showed that this algorithm works better out of the box “out of the box”, i.e. without setting any parameters.

**Website**: catboost.yandex

#### Microsoft Cognitive Toolkit (CNTK)

The Microsoft framework has a C ++ interface. Provides implementation of various neural network architectures. It can be an interesting integration with .Net.

**Website**: www.microsoft.com/en-us/cognitive-toolkit

## Other development resources

As machine learning was popularized, projects to simplify development and bring it into graphical form with access via online repeatedly appeared. In this field, there are several.

#### Azure ML

Machine learning service on the Microsoft Azure platform, where you can build data processing in the form of graphs and perform calculations on remote servers, with the ability to include Python code and others.

**Site**: azure.microsoft.com/ru-ru/services/machine-learning-studio

#### IBM DataScience experience (IBM DSX)

Service for work in the Jupyter Notebook environment with the ability to perform calculations in the Python language and on others. Supports integration with known datasets and Spark, an IBM Watson project.

**Website**: ibm.com/cloud/watson-studio

## Social Science Packages

Among them, the IBM Statistical Package for the Social Sciences (SPSS), an IBM software for processing statistics in the social sciences, supports a graphical interface for defining data processing. Some time ago it became possible to embed machine learning algorithms into the overall execution structure. In general, limited support for machine learning algorithms is becoming popular among packages for statisticians that already include statistical functions and visualization methods (for example, Tableau and SAS).

## Conclusion

The choice of software package on the basis of which the task will be solved is usually determined by the following conditions.

- The environment in which the model will be used: whether Spark support is needed, which services need to be integrated.
- Data features. What is the data: image, text, or is it a set of numbers, what kind of processing do they require?
- Predisposition of models to this type of problem. Data from images are usually processed with convolutional neural networks, and for small data sets, algorithms based on decision trees are used.
- Restrictions on computing power, both in training and in use.

As a rule, when developing in the Python language, the use of general purpose libraries (Pandas, Scikit-learn, numPy) cannot be avoided. This led to the fact that their interface is supported by most specialized libraries, but if this is not the case, one should understand that you have to write connectors yourself or choose another library.

You can build the first model using a relatively small number of libraries, and then you have to decide whether to spend time on parameters (feature engineering) or on choosing the optimal library and algorithm, or perform these tasks in parallel.

Now a little about the recommendations for the selection. If you need an algorithm that works best out of the box, this is Catboost. If you intend to work with images, you can use Keras and Tensorflow or Caffe. When working with text, you need to decide whether you are going to build a neural network and take context into account. If yes, the same wishes as for the images, if a “bag of words” is enough (frequency characteristics of the occurrence of each word), gradient boosting algorithms will do. With small data sets, you can use the algorithms for generating new data from Scikit-learn and linear methods implemented in the same library.

As a rule, the described libraries are enough for most tasks, even for winning the competition. The field of machine learning is developing very quickly - we are confident that new frameworks have already appeared at the time of writing this post.

*Nikolay Knyazev, the head of the Jet Infosystems machine learning group*