Monday, March 30, 2015

Python for Data Scientists - SciPy


This article continues the Python for Data Scientists series by talking about SciPy. It is built on top of NumPy, of which we've already talked in the previous article. SciPy provides many user-friendly and efficient numerical routines addressing a number of different standard problem domains in scientific computing such as integration, differential and sparse linear system solvers, optimizers and root finding algorithms, Fourier Transforms, various standard continuous and discrete probability distributions and many more. Together NumPy and SciPy form a reasonably complete computational replacement for much of MATLAB along with some of its add-on toolboxes.


Installation of SciPy is trivial. In many cases, it will be already supplied to you with python distribution, or as usual may be installed manually using python package manager
pip install scipy
Depending on the running OS, you might be needing to install gfortran, prior to SciPy installation.



Very often we want to find a maxima or minima of the function, that is find a solution for optimization problem. Let's see how to do this with SciPy, by finding maxima of Bessel function. Since optimization is a process of finding a minima, we are negating the function:
from scipy import special, optimize
f = lambda x: -special.jv(3, x) # define a function
sol = optimize.minimize(f, 1.0) # optimize the function


Today we cannot imagine ourselves without statistics. From generating random variables to emitting some events at known probability, statistics has deeply ingrained itself into any developer's toolset.
from scipy.stats import percentileofscore
list = [1, 2, 3, 4]
percentileofscore(list, 3) # what percentage lies beneath 3 => 75

Singular Value Decomposition

SVD or Singular Value Decomposition has many useful applications in signal processing and statistics. As a data scientist you will be meeting it a lot! Dimension reduction, collaborative filtering, you name it, it is always there. Let's see how to calculate one:
from scipy import linalg
a = np.random.randn(9, 6) + 1.j*np.random.randn(9, 6)
U, s, Vh = linalg.svd(a)


We'll finish our overview with an example of interpolation. Very often we want to approximate a continues function by evaluating point at constant rate. SciPy provides a handful of functions to do so in multiple dimensions.
from scipy.interpolate import interp1d
import numpy as np
x = np.linspace(0, 10, 10)
y = np.cos(-x**2/8.0)
f = interp1d(x, y)
SciPy contains numerous functions from various domain of science. Be sure to overview them all in the documentation, as most probably your next task is already fully implemented, tested and optimized by one of the provided functions of this wonderful package.