An Example of Using SQLite From R

Note: These are lecture notes from my undergraduate economic forecasting class. It might not be a self-contained post.

We had an introduction to SQL in a previous lecture:

We talked about what a database is, how it’s used, and why they’re so popular.
SQLite is an SQL database that’s embedded inside R. That means every recent installation of R has an SQL database inside it.
We saw how to do the basic operations to create and query data. Queries are typically done with calls to the SELECT statement.

In this lecture, we’re going to go beyond the basics a little bit, and we’ll use something closer to economic data.

Create a Database

Most of the SQL operations we’ll discuss in this lecture are not specific in any way to SQLite. The information should apply to other SQL databases such as Microsoft’s SQL Server, MariaDB, or PostgreSQL. The key difference from SQLite is that they’re not built into R. Since you’d have to install and set up a database server to work with the other databases, we’ll stick to SQLite. The data files can be found here. Save both of them in the directory you’re using to run this code.

library(sqldf)

This loads the sqldf package. You’ll need to install it if it’s not already installed.

sales1 <- read.csv("sales1.csv")
sales2 <- read.csv("sales2.csv")

You now have two data frames loaded into R. A point of confusion for some students is that they think the above code has somehow created tables in an SQLite database. This is incorrect. They are regular read.csv calls that create data frames in R.

db.sales <- dbConnect(SQLite(), dbname="sales.sqlite")

dbConnect will create a new database connection. The database connection tells R how to connect to the database to execute queries. The first argument, SQLite(), identifies the connection as an SQLite database rather than, say, a MariaDB database. The second argument gives the name of the database file that is stored on your disk. If no database by that name exists, it will be created as needed. Once that information is processed, the variable db.sales holds the information necessary to query sales.sqlite as an SQLite database.

Next, create the tables salesdb1 and salesdb2 inside the database, holding the data in sales1 and sales2, respectively.

dbWriteTable(conn=db.sales, name="salesdb1", sales1, overwrite=TRUE, 
row.names=FALSE)
dbWriteTable(conn=db.sales, name="salesdb2", sales2, overwrite=TRUE, 
row.names=FALSE)

The Simplest Query

The simplest possible query is to pull all of the data from a particular table. Queries from R return a data frame, which is the standard data structure in R, allowing you to run regressions or plot the data directly.

Query everything in salesdb1 and salesdb2:

dbGetQuery(db.sales, "select * from salesdb1")
dbGetQuery(db.sales, "select * from salesdb2")

This prints all the data from both tables to the screen. Most of the time, you’ll call the function dbGetQuery to do your query. The first argument, db.sales, is the name of the database. The second argument is the SQL query to be sent to SQLite. select * from salesdb1 returns all data from table salesdb1. This the same pattern you see over and over again with SQL queries: select [WHAT] from [WHERE]. In this example, the [WHAT] you want to retrieve is *, which means everything. The [WHERE] says where to get the data. It will typically be a table.

Queries return a data frame

When calling SQLite from R, the output will be converted to an R data frame. This is another point on which some students have been confused. It is true that SQLite does not create an R data frame. However, R captures the text that is output by SQLite when you call dbGetQuery, parses it, and stores it as a data frame. That means you can store the output and work with it like any other variable in R. Let’s save the output of the second query in a variable called s2:

s2 <- dbGetQuery(db.sales, "select * from salesdb2")

You can then do standard R operations on s2:

class(s2)
s2
plot(s2$price, type="l")

Restricted Queries

You wouldn’t typically query everything in a table. It’s a pointless operation unless you want to inspect the entire dataset, which only makes sense if you have a small amount of data. You’ll almost always want to restrict your query in some way. You do that by adding a WHERE clause to your query.

Suppose you want to view only the sales that were made in July 2020:

july.sales <- dbGetQuery(db.sales, "select * from salesdb1 where 
month = '202007'")
july.sales

The query is select * from salesdb1 where month = '202007'. As similar as this looks to the previous query, it is in practice rather different. Think of it as taking the form

select * from [TABLE]

Previously TABLE was salesdb1. Now TABLE is salesdb1 where month = '202007'. You’ve reduced that table to only observations from July.

Important Operators For the WHERE Clause

If you want TABLE to include all observations from June and July, all you have to do is add an OR to your WHERE clause.

jun_jul <- dbGetQuery(db.sales, "select * from salesdb1 where 
month = '202006' or month = '202007'")
jun_jul

Alternatively, since the month is a number, you can specify the equivalent of an OR with <=:

jun_jul <- dbGetQuery(db.sales, "select * from salesdb1 where 
month <= 202007")
jun_jul

Selecting Variables

It is common for database tables to contain many variables that are irrelevant to your query. You may want only the customer name, sales amount, and date of purchase. The sales database may have 850 different variables related to the location of the sale, time of day, weather, and so on. Rather than a select * query, to limit yourself to three variables, you can do this:

subset <- dbGetQuery(db.sales, "select month, customer, amount 
from salesdb1")
subset

Sorting

What if you’re doing this query so you can identify the largest sales you made? You could

Manually identify the largest sale. You’d get fired for lack of productivity if you do that too often.
Have R identify the largest sale.
Add an ORDER BY clause to your query to tell SQLite to do the work for you. This is the simplest way to handle it.

subset <- dbGetQuery(db.sales, "select month, customer, amount 
from salesdb1 order by amount")
subset

If you ran that code, you’ll notice that the purchase amount is sorted smallest to largest (ascending). The default is to do an ascending sort. If you want to identify the largest sales, you want the largest first (a descending sort. Add DESC after the ORDER BY clause:

subset <- dbGetQuery(db.sales, "select month, customer, amount 
from salesdb1 order by amount desc")
subset

Limits

Upon further thought, if all you want is the largest few sales, you don’t need the clutter of reporting all the other sales. You can add a LIMIT clause to the end of your query to report only the first four results (for example):

subset <- dbGetQuery(db.sales, "select month, customer, amount 
from salesdb1 order by amount desc limit 4")
subset

Note: It implements the limit clause after all the other stuff. It does the query like this:

select month, customer, amount from salesdb1
order by amount desc
limit 4

Group by

Upon even further thought, this still isn’t what we want. We don’t want the largest individual sales. We want the largest customer over this time period. We need to aggregate the data in some way. Let’s see how much each customer has purchased in total this year, sorted from biggest to smallest customer:

totals <- dbGetQuery(db.sales, "select customer, sum(amount) 
from salesdb1 group by customer")
totals

This query adds group by customer to do the aggregation. How does it know what to aggregate? Note that the select includes sum(amount). It will apply the sum function to each group, where the group is defined by the customer variable.

That raises another question. What if we do this without a group by clause?

totals <- dbGetQuery(db.sales, "select customer, sum(amount) 
from salesdb1")
totals

If you don’t have a group by clause, you get the sum over all records, and only the first customer. This isn’t a sensible query because you have only one sum and several customers.

Finally, we might want to clean it up a little bit, by sorting the data so the largest overall customer is first:

totals <- dbGetQuery(db.sales, "select customer, sum(amount) 
from salesdb1 group by customer order by sum(amount) desc")
totals

Joins

This is where SQL gets interesting. Advanced joins are beyond the scope of this class, so I’m going to give you one simple (but relevant!) example to help you understand what’s going on. You can go a little deeper with this tutorial.

The problem: Each salesperson gets a flat 6% commission on all sales. It’s your job to calculate the total commission due to each salesperson for the three months covered in the tables. You need to compute the sum of sales grouped by salesperson. In principle, this is trivial. Sum up the dollar value of sales by employee, and multiply by 6%.

To see why this isn’t trivial, let’s look closer at the data. Here’s sales1:

month, employee, customer, amount
202006, Mary, Acme, 2400
202006, Mary, Courier, 1756
202006, Phil, Courier, 243
202007, Mary, Felding, 4500
202007, Ted, Courier, 1200
202007, Ted, Courier, 2800
202007, Ted, Acme, 765
202008, Mary, Felding, 800
202008, Mary, Felding, 1750
202008, Ted, Acme, 4150

The amount variable is the quantity sold. The price data is in sales2:

month, price, salestax
202006, 45, 0.072
202007, 47, 0.072
202008, 47, 0.080

As you can see, the price changed in February. Since the dataset is so small, you might be tempted to open the original text file by hand and manually add the price to sales1. In real life, where there can be million of transactions, you need an automated approach. You need the following operation:

For each row in table1, find the corresponding price in table2.

That’s not sufficiently precise. What do we mean by “find the corresponding price”? It means

Create a new table, call it table3, with no data.

Then for each row in table1:

Find the row with the same month in table2. This works because the month in table2 is unique. No month appears in more than one row in table2. That means every row in table1 has a unique match to a row in table2.
Combine the information from that row of table1 with the information from that row of table2.
Create a new row of table3 holding the combined information from the previous step.

Here’s the code to create your new table holding the merger of sales1 and sales2.

sales3 <- dbGetQuery(db.sales, "select * from salesdb1 
join salesdb2 on salesdb1.month = salesdb2.month")
sales3

It’s important to understand that sales3 is a data frame in R, not a table in your SQLite database. The join prints output to the screen, dbGetQuery captures it, and the assignment sales3 <- stores it in an R data frame.

If you want to create a new table inside your database holding the output of your query, you can use the CREATE TABLE AS statement, like this:

dbExecute(db.sales, "create table salesdb3 as select * from salesdb1 
join salesdb2 on salesdb1.month = salesdb2.month")

We had to switch from dbGetQuery to dbExecute. The reason is that dbGetQuery is for capturing the output of a query as an R variable, while dbExectute is for executing an SQL statement and ignoring the output. You can read more about these functions, and many more, in the documentation for package DBI.

Let’s confirm that we have a new table in our database:

dbGetQuery(db.sales, "select * from salesdb3")

Why would you want to create a new table in the database rather than returning the output to R? In this example, if you’re going to do further queries on the merged data, you need the merged data as a table inside your database, not in R.

Transforming and Aggregating the Merged Data

table3 provides us with the information needed to calculate the commission for each salesperson:

comm <- dbGetQuery(db.sales, "select employee, sum(0.06*amount*price) 
as commission from salesdb3 group by employee")
comm

Note that it does the part inside parenthesis first, so it does the multiplication, then when it does the sum, it confirms that there is a group by clause, so it breaks the data into groups by employee, then it does the sum.

The last task is to calculate our sales tax obligation for those months. The sales tax rate also changed from 7.2% to 8.0% in the middle of our sample. As with the change in the price of the product, that doesn’t matter if we’re working with the merged data.

salestax <- dbGetQuery(db.sales, "select sum(amount*price*salestax) 
from salesdb3")
salestax

If you don’t specify a group, it returns the sum for all of the table.

The power of a good join

join queries are a powerful feature of an SQL database. Merging data from multiple tables is a very important skill. Most importantly, the database will do the merge efficiently. This is one area where you don’t want to mess around writing your own code. Merge operations that take hours or days with code you write yourself will complete in a few seconds with a properly written SQL query. Especially frustrating is when your self-written merge operation runs for 90 minutes and then you find out you did it wrong. (Ask me how I know.)

Learning More

There are many references out there for learning SQL. I taught myself SQL as I needed it, and that was many years back, so I’m not in any position to recommend resources. The library has a number of online books that you can read for free. Check out a few to see which you like.

I have never taken the course, but you can search for “Stanford database course” and there are materials for a free online course that goes pretty deep into the topic. It’s popular, and I’ve read good things about it, but I have no idea if it’s the right level for you.

If you just want to practice, you can use LibreOffice Base (the database in the LibreOffice suite). You can create simple tables and run SQL queries on them. I’ve even heard of some people using it for personal project management.