Silver BlogTop 20 Python Libraries for Data Science in 2018

Our selection actually contains more than 20 libraries, as some of them are alternatives to each other and solve the same problem. Therefore we have grouped them as it's difficult to distinguish one particular leader at the moment.



By ActiveWizards

Image

Python continues to take leading positions in solving data science tasks and challenges. Last year we made a blog post overviewing the Python’s libraries that proved to be the most helpful at that moment. This year, we expanded our list with new libraries and gave a fresh look to the ones we already talked about, focusing on the updates that have been made during the year.

Our selection actually contains more than 20 libraries, as some of them are alternatives to each other and solve the same problem. Therefore we have grouped them as it's difficult to distinguish one particular leader at the moment.

 

Core Libraries & Statistics

 

1. NumPy (Commits: 17911, Contributors: 641)

Traditionally, we start our list with the libraries for scientific applications, and NumPy is one of the principal packages in this area. It is intended for processing large multidimensional arrays and matrices, and an extensive collection of high-level mathematical functions and implemented methods makes it possible to perform various operations with these objects.

During the year, a large number of improvements have been made to the library. In addition to bug fixes and compatibility issues, the crucial changes regard styling possibilities, namely the printing format of NumPy objects. Also, some functions can now handle files of any encoding that is available in Python.

 
2. SciPy (Commits: 19150, Contributors: 608)

Another core library for scientific computing is SciPy. It is based on NumPy and therefore extends its capabilities. SciPy main data structure is again a multidimensional array, implemented by Numpy. The package contains tools that help with solving linear algebra, probability theory, integral calculus and many more tasks.

SciPy faced major build improvements in the form of continuous integration into different operating systems, new functions and methods and, what is especially important - the updated optimizers. Also, many new BLAS and LAPACK functions were wrapped.

 
3. Pandas (Commits: 17144, Contributors: 1165)

Pandas is a Python library that provides high-level data structures and a vast variety of tools for analysis. The great feature of this package is the ability to translate rather complex operations with data into one or two commands. Pandas contains many built-in methods for grouping, filtering, and combining data, as well as the time-series functionality. All of this is followed by impressive speed indicators.

There have been a few new releases of the pandas library, including hundreds of new features, enhancements, bug fixes, and API changes. The improvements regard pandas abilities for grouping and sorting data, more suitable output for the apply method, and the support in performing custom types operations.

 
4. StatsModels (Commits: 10067, Contributors: 153)

Statsmodels is a Python module that provides many opportunities for statistical data analysis, such as statistical models estimation, performing statistical tests, etc. With its help, you can implement many machine learning methods and explore different plotting possibilities.

The library is continuously developing, enriching new and new opportunities. Thus, this year brought time series improvements and new count models, namely GeneralizedPoisson, zero inflated models, and NegativeBinomialP, and new multivariate methods - factor analysis, MANOVA, and repeated measures within ANOVA.

 

Visualization

 
5. Matplotlib (Commits: 25747, Contributors: 725)

Matplotlib is a low-level library for creating two-dimensional diagrams and graphs. With its help, you can build diverse charts, from histograms and scatterplots to non-Cartesian coordinates graphs. Moreover, many popular plotting libraries are designed to work in conjunction with matplotlib.

There have been style changes in colors, sizes, fonts, legends, etc. As an example of an appearance improvements are an automatic alignment of axes legends and among significant colors improvements is a new colorblind-friendly color cycle.

 
6. Seaborn (Commits: 2044, Contributors: 83)

Seaborn is essentially a higher-level API based on the matplotlib library. It contains more suitable default settings for processing charts. Also, there is a rich gallery of visualizations including some complex types like time series, jointplots, and violin diagrams.

The seaborn updates mostly cover bug fixes. However, there were improvements in compatibility between FacetGrid or PairGrid and enhanced interactive matplotlib backends, adding parameters and options to visualizations.

 
7. Plotly (Commits: 2906, Contributors: 48)

Plotly is a popular library that allows you to build sophisticated graphics easily. The package is adapted to work in interactive web applications. Among its remarkable visualizations are contour graphics, ternary plots, and 3D charts.

The continuous enhancements of the library with new graphics and features brought the support for "multiple linked views" as well as animation, and crosstalk integration.

 
8. Bokeh (Commits: 16983, Contributors: 294)

The Bokeh library creates interactive and scalable visualizations in a browser using JavaScript widgets. The library provides a versatile collection of graphs, styling possibilities, interaction abilities in the form of linking plots, adding widgets, and defining callbacks, and many more useful features.
Bokeh can boast with improved interactive abilities, like a rotation of categorical tick labels, as well as small zoom tool and customized tooltip fields enhancements.

 
9. Pydot (Commits: 169, Contributors: 12)

Pydot is a library for generating complex oriented and non-oriented graphs. It is an interface to Graphviz, written in pure Python. With its help, it is possible to show the structure of graphs, which are very often needed when building neural networks and decision trees based algorithms.