Monday, April 13, 2015

Python for Data Scientists - Pandas


Introduction

Having learnt NumPy and SciPy in previous articles, let's discuss our next package, called pandas.
Pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. It is, as you will see, one of the critical ingredients enabling Python to be a powerful and productive data analysis environment.

Pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL), mingling DataFrame - a two-dimensional tabular, column-oriented data structure with both row and column labels. It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

For users of the R language for statistical computing, the DataFrame name will be familiar, as the object was named after the similar R data.frame object. However the functionality provided by R is merely a subset of that provided by the pandas DataFrame.

Installation

Installation of pandas is as everything in Python ecosystem, a piece of cake. For those working with Python distribution, it's been pre-packed for you. To install it manually using python package manager

pip install pandas

Data Structures

To get started with pandas, you will need to get comfortable with its two data structures used throughout the library: Series and DataFrame. I won't get into details about Panel and Panel4D, somewhat less-used containers, of which you can read on pandas site.

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

import pandas as pd
import numpy as np
pd.Series([1,3,5,np.nan,6,8])

# 0     1
# 1     3
# 2     5
# 3   NaN
# 4     6
# 5     8

A DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

dates = pd.date_range('20130101',periods=6)
pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#                    A         B         C         D
# 2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
# 2013-01-02  1.212112 -0.173215  0.119209 -1.044236
# 2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
# 2013-01-04  0.721555 -0.706771 -1.039575  0.271860
# 2013-01-05 -0.424972  0.567020  0.276232 -1.087401
# 2013-01-06 -0.673690  0.113648 -1.478427  0.524988

As you can see DataFrame has both a row and column index; it can be thought of as a dict of Series.

Examples

Index Objects

As we've seen in the DataFrame example, we can provide a custom index to both DataFrame and Series objects and then reference items through by using it.
s = Series(range(3), index=['a', 'b', 'c'])
s['a'] # 0
s['b'] = 16.5

Reindexing

Consider a case, where you'd like to change the indices of your data resulting alternation, addition or removal of entities.
index = ['a', 'c', 'd']
columns = ['GBP', 'USD', 'EUR']
data = np.arange(9).reshape((3, 3))      # reshape 9x1 to 3x3
frame = DataFrame(data, index=index, columns=columns)
#   GBP USD EUR
# a 0   1   2 
# c 3   4   5
# d 6   7   8

frame.reindex(columns=['GBP', 'JPY', 'EUR'])
#   GBP JPY EUR
# a 0   NaN   2 
# c 3   NaN   5
# d 6   NaN   8

frame.drop('GBP')
#   JPY EUR
# a NaN   2 
# c NaN   5
# d NaN   8

As you can see, both adding and removing indices is as easy as breathing. Both methods support array parameters, so bulk data alternation is also possible and even advisable from optimization purposes.

Arithmetic and data alignment

Another important pandas feature is arithmetic behavior between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2
# a 5.2
# c 1.1
# d NaN
# e 0.0
# f NaN
# g NaN

Here column d, f and g were converted to NaN as they didn't have a match in both series. The same applies to DataFrames of course. One thing you might find useful is filling the NaN values with some defaults. This can be achieved using filling function, supported by all corresponding methods: add, sub, div and mul.

Merge

What about morphing 2 data objects. No worries - pandas comes to rescue. It provides various facilities for easily combining together objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

key = ['foo', 'foo']
left = pd.DataFrame({'key': key, 'lval': [1, 2]})
#    key  lval
# 0  foo     1
# 1  foo     2

right = pd.DataFrame({'key': key, 'rval': [4, 5]})
#    key  rval
# 0  foo     4
# 1  foo     5

pd.concat([left,right])
#    key  lval  rval
# 0  foo     1   NaN
# 1  foo     2   NaN
# 0  foo   NaN     4
# 1  foo   NaN     5

merged = pd.merge(left, right, on='key')
#    key  lval  rval
# 0  foo     1     4
# 1  foo     1     5
# 2  foo     2     4
# 3  foo     2     5

merged.groupby('key').sum()
#      lval  rval
# key            
# foo     6    18

Handling Missing Data

Very often, if not always, we deal with incomplete data, either by it's nature like sensor data or as a result of human error like spreadsheets. Pandas provides various functionality to deal with such situations.

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna() # similar to data[data.notnull()]
# 0 1.0
# 2 3.5
# 4 7.0

data.fillna(0)
# 0 1.0
# 1 0.0
# 2 3.5
# 3 0.0
# 4 7.0

With every evolving API, pandas provides numerous functionality, which will ease any data scientist life. Make sure you keep yourself updated with the features of every release.