Python for Social Science Data: Glossary

Key Points

Introduction to Python	Python is an interpreted language The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared Jupyter notebooks is a complete IDE (Integrated Development Environment)
Python basics	The Jupyter environment can be used to write code segments and display results Data types in Python are implicit based on variable values Basic data types are Integer, Float, String and Boolean Lists and Dictionaries are structured data types Arithmetic uses standard arithmetic operators, precedence can be changed using brackets Help is available for builtin functions using the `help()` function further help and code examples are available online In Jupyter you can get help on function parameters using `shift`+`tab` Many functions are in fact methods associated with specific object types
Python control structures	Most programs will require ‘Loops’ and ‘Branching’ constructs. The `if`, `elif`, `else` statements allow for branching in code. The `for` and `while` statements allow for looping through sections of code The programmer must provide a condition to end a `while` loop.
Creating re-usable code	Functions are used to create re-usable sections of code Using parameters with functions make them more flexible You can use functions written by others by importing the libraries containing them into your code
Processing data from a file	Reading data from files is far more common than program ‘input’ requests or hard coding values Python provides simple means of reading from a text file and writing to a text file Tabular data is commonly recorded in a ‘csv’ file Text files like csv files can be thought of as being a list of strings. Each string is a complete record You can read and write a file one record at a time Python has builtin functions to parse (split up) records into individual tokens
Dates and Time	Date and Time functions in Python come from the datetime library, which needs to be imported You can use format strings to have dates/times displayed in any representation you like Internally date and times are stored in special data structures which allow you to access the component parts of dates and times
Processing JSON data	JSON is a popular data format for transferring data used by a great many Web based APIs The JSON data format is very similar to the Python Dictionary structure. The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data We can use Python code to extract values of interest and place them in a csv file
Reading data from a file using Pandas	pandas is a Python library containing functions and data structures to assist in data analysis pandas data structures are the Series (like a vector) and the Dataframe (like a table) the pandas `read_csv` function allows you to read an entire `csv` file into a Dataframe
Extracting row and columns	First key point.
Data Aggregation using Pandas	Summarising numerical and categorical variables is a very common requirement Missing data can interfere with how statistical summaries are calculated Missing data can be replaced or created depending on requirement Summarising or aggregation can be done over single or multiple variables at the same time
Joining Pandas Dataframes	You can join pandas Dataframes in much the same way as you join tables in SQL The `concat()` function can be used to concatenate two Dataframes by adding the rows of one to the other. `concat()` can also combine Dataframes by columns but the `merge()` function is the preferred way The `merge()` function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.
Wide and long data formats	The `melt()` method can be used to change from wide to long format The `pivot()` method can be used to change from the long to wide format Aggregations are best done from data in the long format.
Data visualisation using Matplotlib	Graphs can be drawn directly from pandas, but it still uses matplotlib Different graph types have different data requirements Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all Plotting multiple graphs on a single ‘canvas’ is possible
Accessing SQLite Databases	The SQLite database system is directly available from within Python A database table and a pandas Dataframe can be considered similar structures Using pandas to return all of the results from a query is simpler than using sqlite3 alone

Glossary

FIXME