We as humanity are producing tons and tons of data every second. Although we might have been producing some sizeable data about 50-60 years back at that time the aggregation of data was a big issue as well. Come to modern times, and we have great data aggregation mechanisms in place. Many libraries and programming languages have been created to further analyse this stored or in-motion data. In that respect, I will be covering the data analysis toolkit from the python’s world, called pandas.
Python itself is a very sleek, simple, and powerful language. There are other powerful software and programming languages around which can do data analysis as well e.g. R programming language. Both R and Python have vibrant communities but its not about competition but about usage. So let’s see the pandas first.
Python Pandas is a powerful, impressive, to-the-point, and simple data analysis toolkit. By the way, last time when I saw its manual it was 2045 packages, but don’t look at its documentation, look at how powerful it will be if you just keep learning incrementally.
It provides data conversion (e.g. from CSV to other formats), indexing, I/O, plotting, reshaping, categorizing, and much more. The main concepts around which you need to wrap around your head might be Groupby/Resample/Rolling techniques. It is also evolving at a high pace, and I see features added almost on a daily basis e.g. recent features were new API for DataFrames/Series, Groupby Enhancements, better support for URLs, SciPy sparse matrix enhancements, excel output for styled DataFrames, IntervalIndexing, and much more.
You can install panda with Anaconda, Miniconda, PyPI, Linux distribution’s package manager, from sources, and others. There are many more ways to get it on any system. E.g. here I can install with using source on linux
# install dependencies
sudo apt-get install python-numpy cython
# test numpy
>>> import numpy
>>> print numpy.__version__
>>> import Cython
>>> print Cython.__version__
# download pandas
mkdir -p ~/projects
git clone https://github.com/pydata/pandas.git
# build pandas
python setup.py build_ext --inplace
>>> import pandas
>>> print pandas.__version__
# enjoy your pet panda
You can also do your own contributions to Pandas using git on github, doing fork, and all sort of funky things we developers do on github. But in any case, you need to follow some standards e.g. cpplint, PEP8, backward compatibility standards etc. Don’t worry just follow the community.
What to know beforehand
You should know what data structures are, and why they are used. What is the mutability of data, and how you can move, copy, remove, or manipulate basic data?
Basic object creation
You can create panda objects called series using the following line of code
import pandas as pd
import numpy as np
s = pd.Series([2,4,6, np.nan, 8, 10])
In the above, I created a Series but you can also create a DataFrame as well. Once these are created then you can manipulate it in multiple ways e.g. you can subset, plot, apply different functions ( the community has already made a lot so it is better to look into references than to build your own).
Pandas is an exciting toolkit. If you are a python developer then learning it won’t be a difficult task, but I believe rest assured that you can do most of the data analytics tasks which many other industry leading software tools can do.