DataScience Deep Dive: Python for Data Scientists

Introduction

We'll start our Python for Data Scientists series with NumPy, short for Numerical Python, which is the foundational package for scientific computing in Python. One of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data. Here are some of the things it provides:

A fast and efficient multidimensional array object ndarray
Functions for performing element-wise computations arrays
Tools for reading and writing array-based data sets to disk
Linear algebra operations, Fourier transform, and random number generation
Tools for integrating connecting C, C++, and Fortran code to Python

Knowing Numpy is fundamental and while by itself it does not provide very much high-level data analytical functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like pandas much more effectively.

Installation

Since everyone uses Python for different applications, there is no single solution for setting up Python and required add-on packages. Personally I recommend using one of the following base Python distributions:

Enthought Python Distribution: a scientific-oriented Python distribution from Enthought. This includes Canopy Express, a free base scientific distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and Canopy Full, a comprehensive suite of more than 300 scientific packages across many domains.
Python(x,y): A free scientific-oriented Python distribution for Windows.

If you'd rather install your packages by yourself, then the following code will do the trick:

pip install numpy

Features

ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. This is important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Consider the next snippet:

import numpy as np
arr = np.arange(15) # returns numbers from 0 to 15, but as an array
arr[5:8] = 12       # assign 12 to items indexed from 5 to 8
arr.sort()          # sorts the array
arr = 1 / arr       # self assignment of 1 divided by each array item
arr.reshape((3, 5)) # reshapes array into 3x5 matrix
arr[arr < 5] = 0    # zeroes elements greater than 5

The code is self explanatory and gives you a little taste of what you can do with NumPy. Let us take a step further.

Universal functions

A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results. Look at the next examples of some them. For more details, have a look at it's page.

x = np.sqrt(arr)    # element-wise square root
y = np.random.randn(8) * 100
y = np.floor(y)     # floors each element of the array
np.maximum(x, y)    # element-wise maximum

Storing Arrays on Disk in Binary Format

np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.

arr1 = np.arange(10)
np.save('some_array', arr2)
arr2 = np.load('some_array.npy')
np.array_equal(arr1, arr2)

Loading text from files is a fairly standard task. It will at times be useful to load data into vanilla NumPy arrays using np.loadtxt or the more specialized np.genfromtxt. These functions have many options allowing you to specify different delimiters, converter functions for certain columns, skipping rows, and other things.

Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and others are the building block of nearly every data algorithm. numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LAPACK, or the Intel MKL.

import dot from np, allclose
import randn from np.random
import svd from np.linalg

a = randn(9, 6)
b = randn(9, 6)
c = a + 1j*b                         # initiate complex matrix
U, s, V = svd(a, full_matrices=True) # perform svd decomposition
S = np.zeros((9, 6), dtype=complex)  # 9x6 complex zero matrix
S[:6, :6] = np.diag(s)               # swap diagonals
allclose(a, dot(U, dot(S, V)))       # equal within a tolerance

This will conclude the tutorial about NumPy and feel free to check it's documentation in depth. Next time we'll be taking a deeper look into Python Data Science tool kit with an overview about SciPy.

Python for Data Scientists - NumPy