SQL for Data Analysis: Instructor Notes

Learning objectives

Understand concept of relational database
Be able to perform simple queries in SQL from a single table
Understand how to filter and sort results from a query
Use aggregate functions to combine data
Perform queries across tables

Data

As for all of the Data Carpentry ecology lessons, this lesson uses the Portal Project Teaching Database. The data is available at http://dx.doi.org/10.6084/m9.figshare.1314459 and the download includes a SQLite database file (portal_mammals.sqlite) as well as three .csv files (species.csv, plots.csv, surveys.csv) that can be imported into SQLite.

Note that the figshare download is an archive (.zip) file that rudely explodes all of the files into your current directory.

Instructor’s setup notes

By default SQLite Manager opens in a separate window and it is not possible to zoom in to enlarge the font so that it is more readable, especially for students in the back rows.

The way to fix this is to:

Open the SQLite Manager
Click on the Options button .
Chose Start SQLite Manager: in a new tab.

You can then use Ctrl - + to zoom just like any other web page.

Motivation and Framing

See this slide deck as a sample intro for the lesson: SQL Intro Deck

Key points:

Want to query data, instead of editing directly
Need a solution that is scalable and reproducible
Introduce the typical ways of dealing with rectangular data (subsetting, split-apply-combine)

If you’ve written up a diagram of the data analysis pipeline (raw data -> clean data -> import and analyze -> results -> visualization), it can be helpful to identify that you’re now somewhere between clean data and analysis.

Lesson outline

00-sql-introduction

Introduce relational databases, database management systems, the SQLite Firefox plugin, and the Portal dataset
Import .csv files into sqlite
Structuring data for database import
Discuss the different SQL data types

Tips

Changing data type: There is a step on import where you have to select the data type for the field. Some people select the wrong type or nothing at all and wonder what they can do to change it. If you right click or Control click on the column name in the Structure tab, then there’s an Edit Column option and you can type in a new data type. When you go to save it, it does warn you that this is dangerous behavior, so people are being warned, but it might be nice to show this option and also to say why it’s not generally a good idea to change data types.
Importing data: Note how cleanly the csv files import into SQL. If you have also taught the spreadsheet lesson, it would be a good idea to compare the format of the csv files with the messy spreadsheet and ask “Remember that messy spreadsheet? What would have happened if we tried to load that in to SQL?”

00-supplement-database-design.md

(optional) The first lesson includes a brief introduction to data design and choosing database systems. This material expands on the database design in the first section.

01-sql-basic-queries

Write basic queries using SELECT and FROM
Filter results using DISTINCT, WHERE
Change how results are displayed using ORDER BY and by doing calculations on results

02-sql-aggregation

Combine results using COUNT and GROUP BY
Filtering based on aggregation with HAVING
Saving queries using VIEW

03-sql-joins-aliases

combining data from two tables using JOIN, ON, USING
depending on time, could introduce different types of JOINs
using aliases with AS to simplify queries

Alternative activities

Queries on the board

As you teach the lesson, it can be helpful to pause and write up the query keywords on the board. This could look like this:

After 01-sql-basic queries ~~~ SELECT column FUNCTION(column)

FROM table

WHERE (conditional statement, applies to row values) (AND/OR)

ORDER BY column/FUNCTION(column) (ASC/DESC)


* After 02-sql-aggregation

SELECT column FUNCTION(column) AGGREGATE_FUNCTION(column) FROM table

WHERE (conditional statement, applies to row values) (AND/OR) (IS (NOT) NULL) GROUP BY column HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC)


* After 03-sql-joins-aliases

SELECT column FUNCTION(column) AGGREGATE_FUNCTION(column) FROM table JOIN table ON table.col = table.col WHERE (conditional statement, applies to row values) (AND/OR) (IS (NOT) NULL) GROUP BY column HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC)

As a bonus, if you can leave this on the board, it translates nicely into 
the `dplyr` portion of the `R` lesson, i.e.: 

SQL: dplyr:

SELECT column select(col) FUNCTION(column) mutate(col = fcn(col)) AGGREGATE_FUNCTION(column) summarize(col = fcn(col)) FROM table JOIN table ON table.col = table.col WHERE (conditional statement, applies to row values) filter(condition) (AND/OR) (IS (NOT) NULL) is.na() GROUP BY column group_by(col) HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC) arrange() ~~~

“Interactive” database

If you want to try something more active (esp. if you’re teaching SQL in the afternoon!), this is a an interactive activity to try.

Give each student six cards, with the following labels:
- name
- name
- height
- DoC
- height*2.54
- dept
Have students fill out their cards:
- Name: first name
- Height: height in INCHES
- Dept (department): close enough (just pick one if you don’t have a home department)
- DoC (Dog or Cat): Dog, Cat, Both, Neither, or leave blank
Each student is now a record in an interactive “students” database, where each of the cards they hold is a field in that database.
At various points in the lesson, stop and “query” the student database. To do this:
- On a slide (or in a text editor), show or type in a sample query. Something like:
```
  SELECT name, name FROM students WHERE height > 66
```
- If the query applies to a record (student), that student should stand, and display (hold up) the appropriate field (card)
- See the following slide deck for a list of sample queries. Sample queries