DataScience Deep Dive: April 2015

Introduction

Having learned some basic packages of Python, you probably started to wonder that working through the python console is not very productive. In R we have RStudio, of which we've already talked in first articles. Good folks of Python community have developed an IPython - an interactive Python console and web environment.

Installation

As usual we are using python pip package manager to install the package:

pip install ipython

Usage

Once the package is installed, you can launch the console version by simply typing it's name in the console:

ipython

Once the application is started, one can simply type python commands and observe the results. Though it doesn't look that far different from the ordinary python console, it provides auto-quoting, code completion, search of previously executed commands, output caching and many more.
As I said previously, there are two modes of running the IPython - console and web. To launch the web interface, one must add the "notebook" parameter:

ipython notebook

This will open http://localhost:8888/tree URL using a default browser. You can then create new IPython files, with ipynb extension, or upload one from the local file system.

The graphical interface is no doubt much more comfortable to use and adds numerous editing and flow control features on top of those supported by the console mode.

If you have chosen to work with Python language for your data project, IPython is a must to have tool. Make sure to have it in your toolkit.

Introduction

Having learnt NumPy and SciPy in previous articles, let's discuss our next package, called pandas.
Pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. It is, as you will see, one of the critical ingredients enabling Python to be a powerful and productive data analysis environment.

Pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL), mingling DataFrame - a two-dimensional tabular, column-oriented data structure with both row and column labels. It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

For users of the R language for statistical computing, the DataFrame name will be familiar, as the object was named after the similar R data.frame object. However the functionality provided by R is merely a subset of that provided by the pandas DataFrame.

Installation

Installation of pandas is as everything in Python ecosystem, a piece of cake. For those working with Python distribution, it's been pre-packed for you. To install it manually using python package manager

pip install pandas

Data Structures

To get started with pandas, you will need to get comfortable with its two data structures used throughout the library: Series and DataFrame. I won't get into details about Panel and Panel4D, somewhat less-used containers, of which you can read on pandas site.

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

import pandas as pd
import numpy as np
pd.Series([1,3,5,np.nan,6,8])

# 0     1
# 1     3
# 2     5
# 3   NaN
# 4     6
# 5     8

A DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

dates = pd.date_range('20130101',periods=6)
pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#                    A         B         C         D
# 2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
# 2013-01-02  1.212112 -0.173215  0.119209 -1.044236
# 2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
# 2013-01-04  0.721555 -0.706771 -1.039575  0.271860
# 2013-01-05 -0.424972  0.567020  0.276232 -1.087401
# 2013-01-06 -0.673690  0.113648 -1.478427  0.524988

As you can see DataFrame has both a row and column index; it can be thought of as a dict of Series.

Examples

Index Objects

As we've seen in the DataFrame example, we can provide a custom index to both DataFrame and Series objects and then reference items through by using it.

s = Series(range(3), index=['a', 'b', 'c'])
s['a'] # 0
s['b'] = 16.5

Reindexing

Consider a case, where you'd like to change the indices of your data resulting alternation, addition or removal of entities.

index = ['a', 'c', 'd']
columns = ['GBP', 'USD', 'EUR']
data = np.arange(9).reshape((3, 3))      # reshape 9x1 to 3x3
frame = DataFrame(data, index=index, columns=columns)
#   GBP USD EUR
# a 0   1   2 
# c 3   4   5
# d 6   7   8

frame.reindex(columns=['GBP', 'JPY', 'EUR'])
#   GBP JPY EUR
# a 0   NaN   2 
# c 3   NaN   5
# d 6   NaN   8

frame.drop('GBP')
#   JPY EUR
# a NaN   2 
# c NaN   5
# d NaN   8

As you can see, both adding and removing indices is as easy as breathing. Both methods support array parameters, so bulk data alternation is also possible and even advisable from optimization purposes.

Arithmetic and data alignment

Another important pandas feature is arithmetic behavior between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2
# a 5.2
# c 1.1
# d NaN
# e 0.0
# f NaN
# g NaN

Here column d, f and g were converted to NaN as they didn't have a match in both series. The same applies to DataFrames of course. One thing you might find useful is filling the NaN values with some defaults. This can be achieved using filling function, supported by all corresponding methods: add, sub, div and mul.

Merge

What about morphing 2 data objects. No worries - pandas comes to rescue. It provides various facilities for easily combining together objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

key = ['foo', 'foo']
left = pd.DataFrame({'key': key, 'lval': [1, 2]})
#    key  lval
# 0  foo     1
# 1  foo     2

right = pd.DataFrame({'key': key, 'rval': [4, 5]})
#    key  rval
# 0  foo     4
# 1  foo     5

pd.concat([left,right])
#    key  lval  rval
# 0  foo     1   NaN
# 1  foo     2   NaN
# 0  foo   NaN     4
# 1  foo   NaN     5

merged = pd.merge(left, right, on='key')
#    key  lval  rval
# 0  foo     1     4
# 1  foo     1     5
# 2  foo     2     4
# 3  foo     2     5

merged.groupby('key').sum()
#      lval  rval
# key            
# foo     6    18

Handling Missing Data

Very often, if not always, we deal with incomplete data, either by it's nature like sensor data or as a result of human error like spreadsheets. Pandas provides various functionality to deal with such situations.

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna() # similar to data[data.notnull()]
# 0 1.0
# 2 3.5
# 4 7.0

data.fillna(0)
# 0 1.0
# 1 0.0
# 2 3.5
# 3 0.0
# 4 7.0

With every evolving API, pandas provides numerous functionality, which will ease any data scientist life. Make sure you keep yourself updated with the features of every release.

Python for Data Scientists - IPython

Introduction

Installation

Usage

Python for Data Scientists - Pandas

Introduction

Installation

Data Structures

Examples

Index Objects

Reindexing

Arithmetic and data alignment

Merge

Handling Missing Data

Blog Archive

Popular Posts

Labels