Python for Social Science Data: Glossary

Key Points

Introduction to Python
  • Python is an interpreted language

  • The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments

  • Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared

  • Jupyter notebooks is a complete IDE (Integrated Development Environment)

Python basics
  • The Jupyter environment can be used to write code segments and display results

  • Data types in Python are implicit based on variable values

  • Basic data types are Integer, Float, String and Boolean

  • Lists and Dictionaries are structured data types

  • Arithmetic uses standard arithmetic operators, precedence can be changed using brackets

  • Help is available for builtin functions using the help() function further help and code examples are available online

  • In Jupyter you can get help on function parameters using shift+tab

  • Many functions are in fact methods associated with specific object types

Python control structures
  • Most programs will require ‘Loops’ and ‘Branching’ constructs.

  • The if, elif, else statements allow for branching in code.

  • The for and while statements allow for looping through sections of code

  • The programmer must provide a condition to end a while loop.

Creating re-usable code
  • Functions are used to create re-usable sections of code

  • Using parameters with functions make them more flexible

  • You can use functions written by others by importing the libraries containing them into your code

Processing data from a file
  • Reading data from files is far more common than program ‘input’ requests or hard coding values

  • Python provides simple means of reading from a text file and writing to a text file

  • Tabular data is commonly recorded in a ‘csv’ file

  • Text files like csv files can be thought of as being a list of strings. Each string is a complete record

  • You can read and write a file one record at a time

  • Python has builtin functions to parse (split up) records into individual tokens

Dates and Time
  • Date and Time functions in Python come from the datetime library, which needs to be imported

  • You can use format strings to have dates/times displayed in any representation you like

  • Internally date and times are stored in special data structures which allow you to access the component parts of dates and times

Processing JSON data
  • JSON is a popular data format for transferring data used by a great many Web based APIs

  • The JSON data format is very similar to the Python Dictionary structure.

  • The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data

  • We can use Python code to extract values of interest and place them in a csv file

Reading data from a file using Pandas
  • pandas is a Python library containing functions and data structures to assist in data analysis

  • pandas data structures are the Series (like a vector) and the Dataframe (like a table)

  • the pandas read_csv function allows you to read an entire csv file into a Dataframe

Extracting row and columns
  • First key point.

Data Aggregation using Pandas
  • Summarising numerical and categorical variables is a very common requirement

  • Missing data can interfere with how statistical summaries are calculated

  • Missing data can be replaced or created depending on requirement

  • Summarising or aggregation can be done over single or multiple variables at the same time

Joining Pandas Dataframes
  • You can join pandas Dataframes in much the same way as you join tables in SQL

  • The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other.

  • concat() can also combine Dataframes by columns but the merge() function is the preferred way

  • The merge() function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.

Wide and long data formats
  • The melt() method can be used to change from wide to long format

  • The pivot() method can be used to change from the long to wide format

  • Aggregations are best done from data in the long format.

Data visualisation using Matplotlib
  • Graphs can be drawn directly from pandas, but it still uses matplotlib

  • Different graph types have different data requirements

  • Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all

  • Plotting multiple graphs on a single ‘canvas’ is possible

Accessing SQLite Databases
  • The SQLite database system is directly available from within Python

  • A database table and a pandas Dataframe can be considered similar structures

  • Using pandas to return all of the results from a query is simpler than using sqlite3 alone

Glossary

FIXME