Extracting and Transforming Data in Python
Extracting and Transforming Data in Python
In order to get insights from data you have to play with it a little..
It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. And in today’s article I will list those techniques with code samples and some explanations. Let’s get started.
For this article I’ve created a sample DataFrame with random numbers to play with it. We will use this data as an example during the explanations in the article.
import pandas as pdimport numpy as npcols = ['col0', 'col1', 'col2', 'col3', 'col4']rows = ['row0', 'row1', 'row2', 'row3', 'row4']data = np.random.randint(0, 100, size=(5, 5))df = pd.DataFrame(data, columns=cols, index=rows)df.head()
Out[2]: col0 col1 col2 col3 col4row0 24 78 42 7 96row1 40 4 80 12 84row2 83 17 80 26 15row3 92 68 58 93 33row4 78 63 35 70 95
Indexing DataFrames
To extract data from pandas DataFrame we can use direct indexing or accessors. We can select necessary rows and columns using it’s labels:
df['col1']['row1']
Out[3]: 4
Please, note the order in this type of indexing: first you specify column label and then row. But the truth is, datasets with is rare and small, while in real life we work with much heavier machinery. It is much better to select data using accessors — .loc and .iloc. The difference between them is that .loc accepts labels and .iloc — indexes. Also when we use accessors first we specify rows and then columns. I had some difficult time in the beginning to get used to it — SQL background, what else can you say..
So, to select a single value using accessors you’d do the following:
df.loc['row4', 'col2']Out[4]: 35
df.iloc[4, 2]Out[5]: 35
Using indexing we can select a single value, Series or DataFrame from a DataFrame (sorry for tautology). Above I have shown how to select a value.
To subselect few columns, just pass nested list of it’s labels and DataFrame will be returned:
df_new = df[['col1','col2']]df_new.head(3)Out[6]: col1 col2row0 78 42row1 4 80row2 17 80
If you want to select also specific rows, add its indexes and you will get a DataFrame again. This technique is called slicing and more in detail about it — below.
df_new = df[['col1','col2']][1:4]df_new.head(3)Out[7]: col1 col2row1 4 80row2 17 80row3 68 58
To select a Series you have to select a single column with all or range of rows. Each line of code will produce the same output:
df['col0']df.loc[:,'col0']df.iloc[:, 0]Out[8]: row0 24row1 40row2 83row3 92row4 78Name: col0, dtype: int32
The colon means that we want to select all rows or all columns — df.loc[:,:] or df.iloc[:,:] will return all values. And slowly we’ve got to the slicing — selecting specific ranges from our data. To slice a Series you just add a range of rows you want to select using its indexes:
df['col3'][2:5]Out[12]: row2 26row3 93row4 70Name: col3, dtype: int32
And don’t forget about ranging in Python — first element included, second excluded. So the code above will return rows with indexes 5, 6, 7, 8 and 9. And indexes start from 0.
Slicing DataFrames works the same way. With just one nuance. When using .loc (labels) both borders are included. For example, select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and all columns:
df.loc['row1':'row4', :]Out[20]: col0 col1 col2 col3 col4row1 40 4 80 12 84row2 83 17 80 26 15row3 92 68 58 93 33row4 78 63 35 70 95
df.iloc[1:4, :]Out[21]: col0 col1 col2 col3 col4row1 40 4 80 12 84row2 83 17 80 26 15row3 92 68 58 93 33
The first line of code above selected row1, row2, row3 and row4. While the second — row1, row2 and row3 only. And few more examples below.
Select columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4 and all rows:
df.loc[:, 'col1':'col4']Out[22]: col1 col2 col3 col4row0 78 42 7 96row1 4 80 12 84row2 17 80 26 15row3 68 58 93 33row4 63 35 70 95
df.iloc[:, 1:4]Out[23]: col1 col2 col3row0 78 42 7row1 4 80 12row2 17 80 26row3 68 58 93row4 63 35 70
Select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4:
df.loc['row1':'row4', 'col1':'col4']Out[24]: col1 col2 col3 col4row1 4 80 12 84row2 17 80 26 15row3 68 58 93 33row4 63 35 70 95
df.iloc[1:4,1:4]Out[25]: col1 col2 col3row1 4 80 12row2 17 80 26row3 68 58 93
Use a list to select specific columns or rows that are not in a range.
df.loc['row2':'row4', ['col1','col3']]Out[28]: col1 col3row2 17 26row3 68 93row4 63 70
df.iloc[[2,4], 0:4]Out[30]: col0 col1 col2 col3row2 83 17 80 26row4 78 63 35 70
Filtering DataFrames
Filtering is a more general tool to select parts of the data based on properties of interest of the data itself and not on indexes or labels. DataFrames have several methods for filtering. Underlying idea for all these methods is a Boolean Series. The df[‘col1’] > 20 (we assume col1 is of type integer) will return a Boolean Series where this condition is true. I will put here the output of .head() method, so you don’t need to scroll up to match the numbers.
Out[2]: col0 col1 col2 col3 col4row0 24 78 42 7 96row1 40 4 80 12 84row2 83 17 80 26 15row3 92 68 58 93 33row4 78 63 35 70 95
So, to select part of a DataFrame where values of col1 are bigger than 20 we will use the following code:
df[df['col1'] > 20]# assigning variable also workscondition = df['col1'] > 20df[condition]Out[31]: col0 col1 col2 col3 col4row0 24 78 42 7 96row3 92 68 58 93 33row4 78 63 35 70 95
We can combine those filters using standard logical operators (and — &, or — |, not — ~). Notice usage of parenthesis for these operations.
df[(df['col1'] > 25) & (df['col3'] < 30)] # logical andOut[33]: col0 col1 col2 col3 col4row0 24 78 42 7 96
df[(df['col1'] > 25) | (df['col3'] < 30)] # logical orOut[34]: col0 col1 col2 col3 col4row0 24 78 42 7 96row1 40 4 80 12 84row2 83 17 80 26 15row3 92 68 58 93 33row4 78 63 35 70 95
df[~(df['col1'] > 25)] # logical notOut[35]: col0 col1 col2 col3 col4row1 40 4 80 12 84row2 83 17 80 26 15
Dealing with 0 and NaN values
Almost always datasets have zero or NaN values and we definitely want to know where they are. Ours is particular, so we will modify it a little:
df.iloc[3, 3] = 0df.iloc[1, 2] = np.nandf.iloc[4, 0] = np.nandf['col5'] = 0df['col6'] = np.NaNdf.head()Out[57]: col0 col1 col2 col3 col4 col5 col6row0 24.0 78 42.0 7 96 0 NaNrow1 40.0 4 NaN 12 84 0 NaNrow2 83.0 17 80.0 26 15 0 NaNrow3 92.0 68 58.0 0 33 0 NaNrow4 NaN 63 35.0 70 95 0 NaN
To select columns that don’t have any zero value we can use .all() method (all data is present):
df.loc[:, df.all()]Out[43]: col0 col1 col2 col4 col6row0 24.0 78 42.0 96 NaNrow1 40.0 4 NaN 84 NaNrow2 83.0 17 80.0 15 NaNrow3 92.0 68 58.0 33 NaNrow4 NaN 63 35.0 95 NaN
If we want to find a column that have at least one nonzero (any) value, this will help:
df.loc[:, df.any()]Out[47]: col0 col1 col2 col3 col4row0 24.0 78 42.0 7 96row1 40.0 4 NaN 12 84row2 83.0 17 80.0 26 15row3 92.0 68 58.0 0 33row4 NaN 63 35.0 70 95
To select columns with any NaN:
df.loc[:, df.isnull().any()]Out[48]: col0 col2 col6row0 24.0 42.0 NaNrow1 40.0 NaN NaNrow2 83.0 80.0 NaNrow3 92.0 58.0 NaNrow4 NaN 35.0 NaN
Select columns without NaNs:
df.loc[:, df.notnull().all()]Out[49]: col1 col3 col4 col5row0 78 7 96 0row1 4 12 84 0row2 17 26 15 0row3 68 0 33 0row4 63 70 95 0
We can drop those rows containing NaNs, but it’s a dangerous game — dropping usually isn’t a solution. You have to understand your data and deal with such rows wisely. I warned you.
Are you sure you wanna know it? OK.. ?
df.dropna(how='all', axis=1) # if all values in a column are NaN it will be droppedOut[69]: col0 col1 col2 col3 col4 col5row0 24.0 78 42.0 7 96 0row1 40.0 4 NaN 12 84 0row2 83.0 17 80.0 26 15 0row3 92.0 68 58.0 0 33 0row4 NaN 63 35.0 70 95 0
df.dropna(how='any', axis=1) # if any value in a row is NaN it will be droppedOut[71]: col1 col3 col4 col5row0 78 7 96 0row1 4 12 84 0row2 17 26 15 0row3 68 0 33 0row4 63 70 95 0
This methods do not modify original DataFrame, so to continue working with filtered data you have to assign it to new dataframe or reassign to the existing one
df = df.dropna(how='any', axis=1)
The beauty of filtering is that we actually can select or modify values of one column based on another. For example, we can select values from col1 where col2 is grater than 35 and update those values by adding 5 to each:
# Find a column based on anotherdf['col1'][df['col2'] > 35]Out[74]: row0 78row2 17row3 68Name: col1, dtype: int32# Modify a column based on anotherdf['col1'][df['col2'] > 35] += 5df['col1']Out[77]: row0 83row1 4row2 22row3 73row4 63Name: col1, dtype: int32
And this brings us to the next part –
Transforming DataFrames
Once we selected or filtered our data we want to transform it somehow. The best way to do this is with methods inherited to DataFrames or numpy universal funcs, that transform entire column of data element-wise. Examples would be pandas’ .floordiv() function (from documentation: ‘Integer division of dataframe and other, element-wise’) or numpy’s .floor_divide() (doc: ‘Return the largest integer smaller or equal to the division of the inputs.’).
If those functions were not available we could write our own and use it with .apply() method.
def some_func(x): return x * 2
df.apply(some_func) -- # update each entry of a DataFrame without any loops
# lambda also works df.apply(lambda n: n*2) -- # the same
These functions do not return transformations, so we have to store it explicitly:
df['new_col'] = df['col4'].apply(lambda n: n*2)df.head()Out[82]: col0 col1 col2 col3 col4 col5 col6 new_colrow0 24.0 83 42.0 7 96 0 NaN 192row1 40.0 4 NaN 12 84 0 NaN 168row2 83.0 22 80.0 26 15 0 NaN 30row3 92.0 73 58.0 0 33 0 NaN 66row4 NaN 63 35.0 70 95 0 NaN 190
If index is a string it has a .str accessor that allows us to modify entire index at once:
df.index.str.upper()Out[83]: Index(['ROW0', 'ROW1', 'ROW2', 'ROW3', 'ROW4'], dtype='object')
Also, we cannot use .apply() method on index — the alternative for it is .map()
df.index = df.index.map(str.lower)Out[85]: Index(['row0', 'row1', 'row2', 'row3', 'row4'], dtype='object')
But .map() can be used on columns as well. For example:
# Create the dictionary: red_vs_bluered_vs_blue = {0:'blue', 12:'red'}# Use the dictionary to map the 'col3' column to the new column df['color']df['color'] = df['col3'].map(red_vs_blue)df.head()Out[92]: col0 col1 col2 col3 col4 col5 col6 new_col colorrow0 24.0 83 42.0 7 96 0 NaN 192 NaNrow1 40.0 4 NaN 12 84 0 NaN 168 redrow2 83.0 22 80.0 26 15 0 NaN 30 NaNrow3 92.0 73 58.0 0 33 0 NaN 66 bluerow4 NaN 63 35.0 70 95 0 NaN 190 NaN
Arithmetic operations on Series and DataFrames work directly. The expression below will create a new column where each value with index n is a sum of values with index n from ‘col3’ and ‘col7’.
df['col7'] = df['col3'] + df['col4'] df.head()Out[94]: col0 col1 col2 col3 col4 col5 col6 new_col color col7row0 24.0 83 42.0 7 96 0 NaN 192 NaN 103row1 40.0 4 NaN 12 84 0 NaN 168 red 96row2 83.0 22 80.0 26 15 0 NaN 30 NaN 41row3 92.0 73 58.0 0 33 0 NaN 66 blue 33row4 NaN 63 35.0 70 95 0 NaN 190 NaN 165
This is the second version of the article, because the first one was a complete mess — errors in the code, no examples and few other things. Thanks to the feedback I went through the article one more time and I think it looks much better now. I have covered basics of transforming and extracting data in Python with code snippets and examples here and hopefully it will be useful for people who are just starting their path in this field.
Meanwhile, love data science and smile more. We have to be positive as we have the sexiest job of 21st century ?