Python for data analysis: Glossary

Key Points

Before we start
  • python is an open source and platform independent programming language

  • the SciPy ecosystem for python provides the tools necessary for scientific computing

  • Spyder is a great IDE to code in and interact with python

  • with its large community it is easy to find help in the internet

Introduction to python
  • the console works like a fancy calculator

  • naming variables in python should be consistent, there are style files that can be followed

  • assigning a value to one variable does not change the values of other variables

  • functions have one or more arguments, some of them can be optional

  • modules increase the functionality of python

  • python knows numerical, text and logical data types

  • lists and numpy arrays are versatile data structures in python

  • subsetting uses square brackets and indexing starts at 0

Starting With Data
  • pd.read_csv() is used to import tabular data into python

  • methods to inspect dataframes are .dtypes, .shape(), .head(), .tail()

  • individual columns from a dataframe can be chosen using [‘ColumnName’]

  • methods for basic statistics on dataframes are .describe(), .min(), .max(), .mean(), …

  • .groupby() can be used to group a dataframe by categories

  • the datetime package offers functionality for datetime objects

Indexing, Slicing and Subsetting DataFrames in Python
  • use column labels in [] to access individual columns

  • indexing starts at 0, when choosing a range, the stop bound is one step BEYOND the row you want to select

  • using the = operator, like in y = x, does not create a copy of x, instead y refers to the same object as x, the .copy() method creates a true copy

  • use label based loc and index based iloc for subsetting rows and columns in dataframes

  • we can also using criteria with ==, >, <, != etc. in subsetting

  • missing values in form of NaN can be dropped with .dropna()

  • .to_csv saves a dataframe as a csv file

Manipulating DataFrames with pandas
  • the method .pivot_table() creates pivot tables or can just be used to reshape data from long to wide format

  • the method .melt() brings data back to long format

  • the function pd.concat() can be used to concatenate/stack two DataFrames

  • axis = 0 will stack vertically and axis = 1 horizontally

  • the function pd.merge() can be used to join two DataFrames and requires joining keys

  • pandas can perform inner joins, the default option in merge(), left joins, right joins and full joins

Data workflows and automation
Making Plots With Matplotlib
Accessing SQLite Databases Using Python & Pandas

Glossary

0-based indexing
is a way of assigning indices to elements in a sequential, ordered data structure starting from 0, i.e. where the first element of the sequence has index 0.
CSV (file)
is an acronym which stands for Comma-Separated Values file. CSV files store tabular data, either numbers, strings, or a combination of the two, in plain text with columns separated by a comma and rows by the carriage return character.
database
is an organized collection of data.
dataframe
is a two-dimensional labeled data structure with columns of (potentially) different type.
data structure
is a particular way of organizing data in memory.
data type
is a particular kind of item that can be assigned to a variable, defined by by the values it can take, the programming language in use and the operations that can be performed on it.
dictionary
is an unordered Python data structure designed to contain key-value pairs, where both the key and the value can be integers, floats or strings. Elements of a dictionary can be accessed by their key and can be modified.
docstring
is an optional documentation string to describe what a Python function does.
faceting
is the act of plotting relationships between set variables in multiple subsets of the data with the results appearing as different panels in the same figure.
float
is a Python data type designed to store positive and negative decimal numbers by means of a floating point representation.
function
is a group of related statements that perform a specific task.
integer
is a Python data type designed to store positive and negative integer numbers.
interactive mode
is an online mode of operation in which the user writes the commands directly on the command line one-by-one and execute them immediately by pressing a button on the keyword, usually Return.
join key
is a variable or an array representing the column names over which pandas.DataFrame.join() merge together columns of different data sets.
library
is a set of functions and methods grouped together to perform some specific sort of tasks.
list
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a list can be accessed by their index and can be modified.
loop
is a sequence of instructions that is continually repeated until a condition is satisfied.
NaN
is an acronym for Not-a-Number and represents that either a value is missing or the calculation cannot output any meaningful result.
None
is an object that represents no value.
scripting mode
is an offline mode of operation in which the user writes the commands to be executed in a text file (with .py extension for Python) which is then compiled or interpreted to run the program. Notes that Python interprets script on run-time and compiles a binary version of the program to speed up the execution time.
Sequential (data structure)
is an ordered group of objects stored in memory which can be accessed specifying their index, i.e. their position, in the structure.
SQL
or Structured Query Language, is a domain-specific language for managing data stored in a relational database management system (RDBMS).
SQLite
is a self-contained, public domain SQL database engine.
string
is a Python data type designed to store sequences of characters.
tuple
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a tuple can be accessed by their index but cannot be modified.