Data Aggregation using Pandas

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I summarise the data in a data frame?

Objectives
  • Access and summarize data stored in a Data Frame

  • Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame

  • Understand missing data

  • Changing to and from ‘NaN’ values

Using Pandas functions to summarise data in a Data Frame

For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variables we are often interested in how many of each unique values are present in the dataset.

We shall use the SAFI_clean.csv dataset to demonstrate how we can obtain these pieces of information

import pandas as pd
df_SAFI = pd.read_csv("SAFI_clean.csv")
df_SAFI

For numeric variables we can obtain a variety of basic statistical information by using the describe() method.

df_SAFI.describe()
           key_ID  no_membrs   years_liv       rooms   liv_count    no_meals
count  131.000000  131.00000  131.000000  131.000000  131.000000  131.000000
mean    85.473282    7.19084   23.053435    1.740458    2.366412    2.603053
std     63.151628    3.17227   16.913041    1.092547    1.082775    0.491143
min      1.000000    2.00000    1.000000    1.000000    1.000000    2.000000
25%     32.500000    5.00000   12.000000    1.000000    1.000000    2.000000
50%     66.000000    7.00000   20.000000    1.000000    2.000000    3.000000
75%    138.000000    9.00000   27.500000    2.000000    3.000000    3.000000
max    202.000000   19.00000   96.000000    8.000000    5.000000    3.000000

This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as NaN you will get a warning message.

You can also .describe() on a single variable basis.

df_SAFI['no_membrs'].describe()

There are also a set of methods which allow us to obtain individual values.

print(df_SAFI['no_membrs'].min())
print(df_SAFI['no_membrs'].max())
print(df_SAFI['no_membrs'].mean())
print(df_SAFI['no_membrs'].std())
print(df_SAFI['no_membrs'].count())
print(df_SAFI['no_membrs'].sum())
2
19
7.190839694656488
3.1722704895263734
131
942

Unlike the describe() method which converts the variable to a float (when it was originally an integer), the individual summary methods only do so for the returned result if needed.

Categorical variables

For categorical variables, numerical statistics don’t make any sense. For a categorical variable we can obtain a list of unique values used by the variable by using the unique() method.

pd.unique(df_SAFI['respondent_wall_type'])
array(['muddaub', ' muddaub', ' burntbricks', 'burntbricks', 'sunbricks', 'cement'], dtype=object)

Do you notice how ‘muddaub’ and ‘burntbricks’ appear twice, once with and once without a leading whitespace? We obviously overlooked to remove these whitespaces correctly in OpenRefine. There are lots of ways to do that in Python, one is to use the argument skipinitialspace=True in the pd.read_csv() function. You can read more about this and other arguments for pd.read_csv() in the pandas documentation.

SAFI_df = pd.read_csv('SAFI_clean.csv', skipinitialspace=True)
pd.unique(df_SAFI['respondent_wall_type'])
array(['muddaub', 'burntbricks', 'sunbricks', 'cement'], dtype=object)

We can also count the number of values in categorial variables in the same way as for numeric variables using count().

df_SAFI['respondent_wall_type'].count()

Exercise

  1. Get the unique entries for the affect_conflicts variable.

  2. Get the count value for affect_conflicts and compare it with the count for respondent_wall_type.

  3. Why do you think they are different?

Solution

print(pd.unique(df_SAFI['affect_conflicts']))
print(df_SAFI['affect_conflicts'].count())

We know from when we originally displayed the contents of the df_SAFI Dataframe that there are 131 rows in it. This matches the value for the respondent_wall_type count. The count for affect_conflicts however is only 92. If you look at the values in the affect_conflicts column using

df_SAFI['affect_conflicts']

you will see that there are several NaN values. NaN stands for Not a Number, i.e. the value is missing. There are only 92 non-missing values and this is what is reported by the count() method. For numeric variables, this value is also used in the calculation of the mean and std values.

Dealing with missing values

We can find out how many variables in our Dataframe contain any NaN values with the code

df_SAFI.isna().sum()
key_ID                   0
village                  0
interview_date           0
no_membrs                0
years_liv                0
respondent_wall_type     0
rooms                    0
memb_assoc              39
affect_conflicts        39
liv_count                0
items_owned             10
no_meals                 0
months_lack_food         0
instanceID               0
dtype: int64

or for a specific variable

df_SAFI['affect_conflicts'].isna().sum()
39

Data from most sources have the potential to include missing values. Whether or not this presents a problem at all depends on what you are planning to do.

The SAFI dataset we are using comes from a project called ‘Studying African Farmer-led Irrigation’. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of ‘NaN’ values in a record (one complete questionnaire) where you would still consider the record to be complete.

We have already seen how we can check for missing values. There are three other actions we need to be able to do:

  1. Remove complete rows which contain NaN
  2. Replace NaN with a value of our choice
  3. Replace specific values with NaN

With these options we can ensure that the data is suitable for the further processing we have planned.

Completely remove rows with NaNs

The dropna() method will delete all rows if any of the variables contain an NaN. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning.

print(df_SAFI.shape)
df_SAFI.dropna(inplace=True)
print(df_SAFI.shape)
(131, 14)
(88, 14)

Note about the inplace=True argument: When inplace=True is passed to a Dataframe method, the resulting (changed) data are put in place of the original data and nothing is returned, i.e. the orginal Dataframe is changed accordingly. If using inplace=False instead, the default setting, the original Dataframe is not changed and a copy with the changed data is returned.

If we only want to remove the rows where a specific variable has NaN values, we can use the subset argument in dropna().

df_SAFI = pd.read_csv("SAFI_clean.csv")
print(df_SAFI.shape)
df_SAFI.dropna(subset = ['items_owned'], inplace=True)
print(df_SAFI.shape)
(131, 14)
(121, 14)

Replace NaN with a value of our choice

The affect_conflicts variable answers the question: “Have you been affected by conflicts with other irrigators in the area ?”. There are 39 NaN values probably originating from interviewees not answering the question. We might want to have something like ‘no_response’ for certain analyses. We can replace values in a Dataframe with the replace() method; to create NaN values as the to_replace value we require the numpy module.

import numpy as np
df_SAFI.affect_conflicts.replace(to_replace=np.NaN, value='no_response', inplace=True)
df_SAFI

By using replace() with value=np.NaN we can replace a specific values with NaN:

df_SAFI.affect_conflicts.replace(to_replace='no_response', value=np.NaN, inplace=True)
df_SAFI

Aggregating data

Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the groupby method.

Having performed the groupby() we can then describe() the results. The format is similar to that which we have seen before except that the ‘grouped by’ variable appears to the left and there is a set of statistics for each unique value of the variable.

grouped_data = df_SAFI.groupby('respondent_wall_type')
grouped_data.describe()

You can group by more than one variable at a time by providing them as a list.

grouped_data = df_SAFI.groupby(['respondent_wall_type', 'village'])
grouped_data.describe()

You can also obtain individual statistics if you want.

no_membrs = df_SAFI.groupby(['village', 'respondent_wall_type'])['no_membrs'].mean()
no_membrs
village   respondent_wall_type
Chirodzo  burntbricks             8.181818
          muddaub                 5.625000
          sunbricks               6.000000
God       burntbricks             7.473684
          muddaub                 5.466667
          sunbricks               7.888889
Ruaca     burntbricks             7.730769
          cement                  7.000000
          muddaub                 7.000000
          sunbricks               8.285714
Name: no_membrs, dtype: float64

Exercise

  1. Read in the SAFI_clean.csv dataset.
  2. Get a list of the different rooms values.
  3. Groupby rooms and describe the results.
  4. Remove all rows with NaN values.
  5. repeat steps 2 & 3 and compare the results.

Solution

# Steps 1 and 2
import pandas as pd
df_SAFI = pd.read_csv("SAFI_clean.csv")
print(df_SAFI.shape)
print(pd.unique(df_SAFI['rooms']))
# Step 3
grouped_data = df_SAFI.groupby('rooms')
grouped_data.describe()
# steps 4 and 5
df_SAFI.dropna(inplace=True)
grouped_data = df_SAFI.groupby('rooms')
print(df_SAFI.shape)
print(pd.unique(df_SAFI['rooms']))
grouped_data.describe()

Key Points

  • Summarising numerical and categorical variables is a very common requirement

  • Missing data can interfere with how statistical summaries are calculated

  • Missing data can be replaced or created depending on requirement

  • Summarising or aggregation can be done over single or multiple variables at the same time