DEV Community

Phylis Jepchumba, MSc
Phylis Jepchumba, MSc

Posted on

Python Libraries Every Data Scientist Must Know.

Pandas.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Uses the following data structures;

  • DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

  • Series represent one-dimensional data structures, similar to an array.

Applications

  • General data wrangling and data cleaning
  • ETL (extract, transform, load) jobs for data transformation and data storage, as it has excellent support for loading CSV files into its data frame format
  • Used in academic and commercial areas, including statistics, finance and neuroscience.
  • Time-series-specific functionality, such as date range generation, linear regression and date shifting.

Read more about pandas

Numpy.

Numpy stands for Numerical Python.
It is a Python library that provides a multidimensional array object and an assortment of routines for fast operations on arrays, including mathematical, logical, sorting, selecting, discrete Fourier transforms, basic linear algebra and many others.

Applications

  • Extensively used in data analysis
  • Creates powerful N-dimensional array
  • Forms the base of other libraries, such as SciPy and scikit-learn
  • Replacement of MATLAB when used with SciPy and matplotlib

Read more about numpy

Scikit-learn.

It is the most useful library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling.

Applications

  • clustering
  • classification
  • regression
  • model selection
  • dimensionality reduction

Read more about Scikit-learn

Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Applications

  • Correlation analysis of variables
  • Outlier detection using a scatter plot etc.
  • Visualize the distribution of data to gain instant insights
Seaborn

Seaborn is a Python data visualization library based on matplotlib.
It provides a high-level interface for drawing attractive and informative statistical graphics.

Seaborn has important features that helps in;

  • Built in themes for styling matplotlib graphics
  • Visualizing univariate and bivariate data
  • Fitting in and visualizing linear regression models
  • Plotting statistical time series data

Read more about Seaborn

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning consisting of comprehensive, flexible ecosystem of tools, libraries and community resources that lets developers easily build and deploy ML powered applications.

Applications

  • Speech and image recognition
  • Text-based applications
  • Time-series analysis
  • Video detection

Read More about TensorFlow

Keras

Similar to TensorFlow, Keras is a popular library that is used extensively for deep learning and neural network modules.
Keras supports both the TensorFlow and Theano backends.

Applications

  • For developing and evaluating deep learning models.

Read more about Keras

SciPy

SciPy in Python is an open-source library used for solving mathematical, scientific, engineering, and technical problems.

It allows users to manipulate the data and visualize the data using a wide range of high-level Python commands.
SciPy is built on the Python NumPy extention.

Applications

  • Solving differential equations and the Fourier transform
  • Optimization algorithms
  • Linear algebra

Read more about SciPy

πŸ₯³πŸ₯³

Top comments (2)

Collapse
 
brad profile image
BrandonKMLee

Some curious recommendations: NetworkX, iGraph, Networkit or Graph-Tools for Graph ML, CDLib for Community Detection, KarateClub for Structural Node clustering.

Collapse
 
phylis profile image
Phylis Jepchumba, MSc

Thank you