Accessing SQLite Databases Using Python & Pandas

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • How can we query databases from within Python?

Objectives
  • Use the sqlite3 module to interact with a SQL database.

  • Access data stored in SQLite using Python.

  • Describe the difference in interacting with data stored as a CSV file versus in SQLite.

  • Describe the benefits of accessing data using a database compared to a CSV file.

Python and SQL

When you open a CSV in python, and assign it to a variable name, you are using your computers memory to save that variable. Accessing data from a database like SQL is not only more efficient, but also it allows you to subset and import only the parts of the data that you need.

In the following lesson, we’ll see some approaches that can be taken to do so.

The sqlite3 module

The sqlite3 module provides a straightforward interface for interacting with SQLite databases. A connection object is created using sqlite3.connect(); the connection must be closed at the end of the session with the .close() command. While the connection is open, any interactions with the database require you to make a cursor object with the .cursor() command. The cursor is then ready to perform all kinds of operations with .execute().

import sqlite3

# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/rainfall_durban.sqlite")

cur = con.cursor()

# the result of a "cursor.execute" can be iterated over by row
for row in cur.execute('SELECT * FROM raingauge_data;'):
    print(row)

#Be sure to close the connection.
con.close()

Queries

One of the most common ways to interact with a database is by querying: retrieving data based on some search parameters. Use a SELECT statement string. The query is returned as a single tuple or a tuple of tuples. Add a WHERE statement to filter your results based on some parameter.

import sqlite3

# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/rainfall_durban.sqlite")

cur = con.cursor()

# Return all results of query
cur.execute('SELECT id FROM raingauges WHERE region_id=2')
cur.fetchall()

# Return first result of query
cur.execute('SELECT data FROM raingauge_data WHERE raingauges_id=5')
cur.fetchone()

#Be sure to close the connection.
con.close()

Accessing data stored in SQLite using Python and Pandas

Using pandas, we can import results of a SQLite query into a dataframe. Note that you can use the same SQL commands / syntax that we used in the SQLite lesson. An example of using pandas together with sqlite is below:

import pandas as pd
import sqlite3

# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("data/rainfall_durban.sqlite")
df = pd.read_sql_query("SELECT * from raingauge_data", con)

# verify that result of SQL query is stored in the dataframe
print(df.head())

con.close()

Storing data: CSV vs SQLite

Storing your data in an SQLite database can provide substantial performance improvements when reading/writing compared to CSV. The difference in performance becomes more noticable as the size of the dataset grows (see for example these benchmarks).

Challenge - SQL

  1. Create a query that contains rainfall data collected on 5th December for observations from regions “Northern” and “Southern” that includes observation’s raingauges_name and ward_name for the sample. How many records are returned?

  2. Create a dataframe that contains the total number of observations (count) made, and sum of rainfall for each day, ordered by ward_id.

Storing data: Create new tables using Pandas

We can also us pandas to create new tables within an SQLite database. Here, we run an excercise similar to one we did before with CSV files using our SQLite database. We first read in our rainfall data, then select only rainfall results for raingauge 1, and then save it out to its own table so we can work with it on its own later.

import pandas as pd
import sqlite3

con = sqlite3.connect("data/rainfall_durban.sqlite")

# Load the data into a DataFrame
rainfall_df = pd.read_sql_query("SELECT * from raingauge_data", con)

# Select only data for 2002
rainfall_g1 = rainfall_df[rainfall_df.raingauges_id == 1]

# Write the new DataFrame to a new SQLite table
rainfall_g1.to_sql("rainfall_g1", con, if_exists="replace")

con.close()

Challenge - Saving your work

  1. For each of the challenges in the previous challenge block, modify your code to save the results to their own tables in the rainfall database.

  2. What are some of the reasons you might want to save the results of your queries back into the database? What are some of the reasons you might avoid doing this.

Key Points