Monday, March 30, 2015

Python for Data Scientists - SciPy


This article continues the Python for Data Scientists series by talking about SciPy. It is built on top of NumPy, of which we've already talked in the previous article. SciPy provides many user-friendly and efficient numerical routines addressing a number of different standard problem domains in scientific computing such as integration, differential and sparse linear system solvers, optimizers and root finding algorithms, Fourier Transforms, various standard continuous and discrete probability distributions and many more. Together NumPy and SciPy form a reasonably complete computational replacement for much of MATLAB along with some of its add-on toolboxes.


Installation of SciPy is trivial. In many cases, it will be already supplied to you with python distribution, or as usual may be installed manually using python package manager
pip install scipy
Depending on the running OS, you might be needing to install gfortran, prior to SciPy installation.



Very often we want to find a maxima or minima of the function, that is find a solution for optimization problem. Let's see how to do this with SciPy, by finding maxima of Bessel function. Since optimization is a process of finding a minima, we are negating the function:
from scipy import special, optimize
f = lambda x: -special.jv(3, x) # define a function
sol = optimize.minimize(f, 1.0) # optimize the function


Today we cannot imagine ourselves without statistics. From generating random variables to emitting some events at known probability, statistics has deeply ingrained itself into any developer's toolset.
from scipy.stats import percentileofscore
list = [1, 2, 3, 4]
percentileofscore(list, 3) # what percentage lies beneath 3 => 75

Singular Value Decomposition

SVD or Singular Value Decomposition has many useful applications in signal processing and statistics. As a data scientist you will be meeting it a lot! Dimension reduction, collaborative filtering, you name it, it is always there. Let's see how to calculate one:
from scipy import linalg
a = np.random.randn(9, 6) + 1.j*np.random.randn(9, 6)
U, s, Vh = linalg.svd(a)


We'll finish our overview with an example of interpolation. Very often we want to approximate a continues function by evaluating point at constant rate. SciPy provides a handful of functions to do so in multiple dimensions.
from scipy.interpolate import interp1d
import numpy as np
x = np.linspace(0, 10, 10)
y = np.cos(-x**2/8.0)
f = interp1d(x, y)
SciPy contains numerous functions from various domain of science. Be sure to overview them all in the documentation, as most probably your next task is already fully implemented, tested and optimized by one of the provided functions of this wonderful package.
Tuesday, March 10, 2015

Python for Data Scientists - NumPy


We'll start our Python for Data Scientists series with NumPy, short for Numerical Python, which is the foundational package for scientific computing in Python. One of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data. Here are some of the things it provides:
  • A fast and efficient multidimensional array object ndarray
  • Functions for performing element-wise computations arrays
  • Tools for reading and writing array-based data sets to disk
  • Linear algebra operations, Fourier transform, and random number generation
  • Tools for integrating connecting C, C++, and Fortran code to Python 
Knowing Numpy is fundamental and while by itself it does not provide very much high-level data analytical functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like pandas much more effectively.


Since everyone uses Python for different applications, there is no single solution for setting up Python and required add-on packages. Personally I recommend using one of the following base Python distributions:
  • Enthought Python Distribution: a scientific-oriented Python distribution from Enthought. This includes Canopy Express, a free base scientific distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and Canopy Full, a comprehensive suite of more than 300 scientific packages across many domains.
  • Python(x,y): A free scientific-oriented Python distribution for Windows.
If you'd rather install your packages by yourself, then the following code will do the trick:
pip install numpy


ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. This is important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Consider the next snippet:
import numpy as np
arr = np.arange(15) # returns numbers from 0 to 15, but as an array
arr[5:8] = 12       # assign 12 to items indexed from 5 to 8
arr.sort()          # sorts the array
arr = 1 / arr       # self assignment of 1 divided by each array item
arr.reshape((3, 5)) # reshapes array into 3x5 matrix
arr[arr < 5] = 0    # zeroes elements greater than 5
The code is self explanatory and gives you a little taste of what you can do with NumPy. Let us take a step further.

Universal functions

A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results. Look at the next examples of some them. For more details, have a look at it's page.
x = np.sqrt(arr)    # element-wise square root
y = np.random.randn(8) * 100
y = np.floor(y)     # floors each element of the array
np.maximum(x, y)    # element-wise maximum

Storing Arrays on Disk in Binary Format and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.
arr1 = np.arange(10)'some_array', arr2)
arr2 = np.load('some_array.npy')
np.array_equal(arr1, arr2)
Loading text from files is a fairly standard task. It will at times be useful to load data into vanilla NumPy arrays using np.loadtxt or the more specialized np.genfromtxt. These functions have many options allowing you to specify different delimiters, converter functions for certain columns, skipping rows, and other things.

Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and others are the building block of nearly every data algorithm. numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LAPACK, or the Intel MKL.
import dot from np, allclose
import randn from np.random
import svd from np.linalg

a = randn(9, 6)
b = randn(9, 6)
c = a + 1j*b                         # initiate complex matrix
U, s, V = svd(a, full_matrices=True) # perform svd decomposition
S = np.zeros((9, 6), dtype=complex)  # 9x6 complex zero matrix
S[:6, :6] = np.diag(s)               # swap diagonals
allclose(a, dot(U, dot(S, V)))       # equal within a tolerance
This will conclude the tutorial about NumPy and feel free to check it's documentation in depth. Next time we'll be taking a deeper look into Python Data Science tool kit with an overview about SciPy.