An Example of Using SQLite From R
Note: These are lecture notes from my undergraduate economic forecasting class. It might not be a self-contained post.
We had an introduction to SQL in a previous lecture:
- We talked about what a database is, how it’s used, and why they’re so popular.
- SQLite is an SQL database that’s embedded inside R. That means every recent installation of R has an SQL database inside it.
- We saw how to do the basic operations to create and query data. Queries are typically done with calls to the
SELECT
statement.
In this lecture, we’re going to go beyond the basics a little bit, and we’ll use something closer to economic data.
Create a Database
Most of the SQL operations we’ll discuss in this lecture are not specific in any way to SQLite. The information should apply to other SQL databases such as Microsoft’s SQL Server, MariaDB, or PostgreSQL. The key difference from SQLite is that they’re not built into R. Since you’d have to install and set up a database server to work with the other databases, we’ll stick to SQLite. The data files can be found here. Save both of them in the directory you’re using to run this code.
library(sqldf)
This loads the sqldf package. You’ll need to install it if it’s not already installed.
sales1 <- read.csv("sales1.csv")
sales2 <- read.csv("sales2.csv")
You now have two data frames loaded into R. A point of confusion for some students is that they think the above code has somehow created tables in an SQLite database. This is incorrect. They are regular read.csv
calls that create data frames in R.
db.sales <- dbConnect(SQLite(), dbname="sales.sqlite")
dbConnect
will create a new database connection. The database connection tells R how to connect to the database to execute queries. The first argument, SQLite()
, identifies the connection as an SQLite database rather than, say, a MariaDB database. The second argument gives the name of the database file that is stored on your disk. If no database by that name exists, it will be created as needed. Once that information is processed, the variable db.sales
holds the information necessary to query sales.sqlite
as an SQLite database.
Next, create the tables salesdb1
and salesdb2
inside the database, holding the data in sales1
and sales2
, respectively.
dbWriteTable(conn=db.sales, name="salesdb1", sales1, overwrite=TRUE,
row.names=FALSE)
dbWriteTable(conn=db.sales, name="salesdb2", sales2, overwrite=TRUE,
row.names=FALSE)
The Simplest Query
The simplest possible query is to pull all of the data from a particular table. Queries from R return a data frame, which is the standard data structure in R, allowing you to run regressions or plot the data directly.
Query everything in salesdb1
and salesdb2
:
dbGetQuery(db.sales, "select * from salesdb1")
dbGetQuery(db.sales, "select * from salesdb2")
This prints all the data from both tables to the screen. Most of the time, you’ll call the function dbGetQuery
to do your query. The first argument, db.sales
, is the name of the database. The second argument is the SQL query to be sent to SQLite. select * from salesdb1
returns all data from table salesdb1
. This the same pattern you see over and over again with SQL queries: select [WHAT] from [WHERE]
. In this example, the [WHAT]
you want to retrieve is *
, which means everything. The [WHERE]
says where to get the data. It will typically be a table.
Queries return a data frame
When calling SQLite from R, the output will be converted to an R data frame. This is another point on which some students have been confused. It is true that SQLite does not create an R data frame. However, R captures the text that is output by SQLite when you call dbGetQuery
, parses it, and stores it as a data frame. That means you can store the output and work with it like any other variable in R. Let’s save the output of the second query in a variable called s2
:
s2 <- dbGetQuery(db.sales, "select * from salesdb2")
You can then do standard R operations on s2
:
class(s2)
s2
plot(s2$price, type="l")
Restricted Queries
You wouldn’t typically query everything in a table. It’s a pointless operation unless you want to inspect the entire dataset, which only makes sense if you have a small amount of data. You’ll almost always want to restrict your query in some way. You do that by adding a WHERE
clause to your query.
Suppose you want to view only the sales that were made in July 2020:
july.sales <- dbGetQuery(db.sales, "select * from salesdb1 where
month = '202007'")
july.sales
The query is select * from salesdb1 where month = '202007'
. As similar as this looks to the previous query, it is in practice rather different. Think of it as taking the form
select * from [TABLE]
Previously TABLE
was salesdb1
. Now TABLE
is salesdb1 where month = '202007'
. You’ve reduced that table to only observations from July.
Important Operators For the WHERE Clause
If you want TABLE
to include all observations from June and July, all you have to do is add an OR
to your WHERE
clause.
jun_jul <- dbGetQuery(db.sales, "select * from salesdb1 where
month = '202006' or month = '202007'")
jun_jul
Alternatively, since the month is a number, you can specify the equivalent of an OR
with <=
:
jun_jul <- dbGetQuery(db.sales, "select * from salesdb1 where
month <= 202007")
jun_jul
Selecting Variables
It is common for database tables to contain many variables that are irrelevant to your query. You may want only the customer name, sales amount, and date of purchase. The sales database may have 850 different variables related to the location of the sale, time of day, weather, and so on. Rather than a select *
query, to limit yourself to three variables, you can do this:
subset <- dbGetQuery(db.sales, "select month, customer, amount
from salesdb1")
subset
Sorting
What if you’re doing this query so you can identify the largest sales you made? You could
- Manually identify the largest sale. You’d get fired for lack of productivity if you do that too often.
- Have R identify the largest sale.
- Add an
ORDER BY
clause to your query to tell SQLite to do the work for you. This is the simplest way to handle it.
subset <- dbGetQuery(db.sales, "select month, customer, amount
from salesdb1 order by amount")
subset
If you ran that code, you’ll notice that the purchase amount is sorted smallest to largest (ascending). The default is to do an ascending sort. If you want to identify the largest sales, you want the largest first (a descending sort. Add DESC
after the ORDER BY
clause:
subset <- dbGetQuery(db.sales, "select month, customer, amount
from salesdb1 order by amount desc")
subset
Limits
Upon further thought, if all you want is the largest few sales, you don’t need the clutter of reporting all the other sales. You can add a LIMIT
clause to the end of your query to report only the first four results (for example):
subset <- dbGetQuery(db.sales, "select month, customer, amount
from salesdb1 order by amount desc limit 4")
subset
Note: It implements the limit
clause after all the other stuff. It does the query like this:
select month, customer, amount from salesdb1
order by amount desc
limit 4
Group by
Upon even further thought, this still isn’t what we want. We don’t want the largest individual sales. We want the largest customer over this time period. We need to aggregate the data in some way. Let’s see how much each customer has purchased in total this year, sorted from biggest to smallest customer:
totals <- dbGetQuery(db.sales, "select customer, sum(amount)
from salesdb1 group by customer")
totals
This query adds group by customer
to do the aggregation. How does it know what to aggregate? Note that the select
includes sum(amount)
. It will apply the sum
function to each group, where the group is defined by the customer
variable.
That raises another question. What if we do this without a group by
clause?
totals <- dbGetQuery(db.sales, "select customer, sum(amount)
from salesdb1")
totals
If you don’t have a group by
clause, you get the sum over all records, and only the first customer. This isn’t a sensible query because you have only one sum and several customers.
Finally, we might want to clean it up a little bit, by sorting the data so the largest overall customer is first:
totals <- dbGetQuery(db.sales, "select customer, sum(amount)
from salesdb1 group by customer order by sum(amount) desc")
totals
Joins
This is where SQL gets interesting. Advanced joins are beyond the scope of this class, so I’m going to give you one simple (but relevant!) example to help you understand what’s going on. You can go a little deeper with this tutorial.
The problem: Each salesperson gets a flat 6% commission on all sales. It’s your job to calculate the total commission due to each salesperson for the three months covered in the tables. You need to compute the sum of sales grouped by salesperson. In principle, this is trivial. Sum up the dollar value of sales by employee, and multiply by 6%.
To see why this isn’t trivial, let’s look closer at the data. Here’s sales1
:
month, employee, customer, amount
202006, Mary, Acme, 2400
202006, Mary, Courier, 1756
202006, Phil, Courier, 243
202007, Mary, Felding, 4500
202007, Ted, Courier, 1200
202007, Ted, Courier, 2800
202007, Ted, Acme, 765
202008, Mary, Felding, 800
202008, Mary, Felding, 1750
202008, Ted, Acme, 4150
The amount
variable is the quantity sold. The price data is in sales2
:
month, price, salestax
202006, 45, 0.072
202007, 47, 0.072
202008, 47, 0.080
As you can see, the price changed in February. Since the dataset is so small, you might be tempted to open the original text file by hand and manually add the price to sales1
. In real life, where there can be million of transactions, you need an automated approach. You need the following operation:
For each row in table1, find the corresponding price in table2.
That’s not sufficiently precise. What do we mean by “find the corresponding price”? It means
- Create a new table, call it
table3
, with no data.
Then for each row in table1
:
- Find the row with the same month in
table2
. This works because the month intable2
is unique. No month appears in more than one row intable2
. That means every row intable1
has a unique match to a row intable2
. - Combine the information from that row of
table1
with the information from that row oftable2
. - Create a new row of
table3
holding the combined information from the previous step.
Here’s the code to create your new table holding the merger of sales1
and sales2
.
sales3 <- dbGetQuery(db.sales, "select * from salesdb1
join salesdb2 on salesdb1.month = salesdb2.month")
sales3
It’s important to understand that sales3
is a data frame in R, not a table in your SQLite database. The join
prints output to the screen, dbGetQuery
captures it, and the assignment sales3 <-
stores it in an R data frame.
If you want to create a new table inside your database holding the output of your query, you can use the CREATE TABLE AS
statement, like this:
dbExecute(db.sales, "create table salesdb3 as select * from salesdb1
join salesdb2 on salesdb1.month = salesdb2.month")
We had to switch from dbGetQuery
to dbExecute
. The reason is that dbGetQuery
is for capturing the output of a query as an R variable, while dbExectute
is for executing an SQL statement and ignoring the output. You can read more about these functions, and many more, in the documentation for package DBI.
Let’s confirm that we have a new table in our database:
dbGetQuery(db.sales, "select * from salesdb3")
Why would you want to create a new table in the database rather than returning the output to R? In this example, if you’re going to do further queries on the merged data, you need the merged data as a table inside your database, not in R.
Transforming and Aggregating the Merged Data
table3
provides us with the information needed to calculate the commission for each salesperson:
comm <- dbGetQuery(db.sales, "select employee, sum(0.06*amount*price)
as commission from salesdb3 group by employee")
comm
Note that it does the part inside parenthesis first, so it does the multiplication, then when it does the sum
, it confirms that there is a group by
clause, so it breaks the data into groups by employee, then it does the sum.
The last task is to calculate our sales tax obligation for those months. The sales tax rate also changed from 7.2% to 8.0% in the middle of our sample. As with the change in the price of the product, that doesn’t matter if we’re working with the merged data.
salestax <- dbGetQuery(db.sales, "select sum(amount*price*salestax)
from salesdb3")
salestax
If you don’t specify a group, it returns the sum for all of the table.
The power of a good join
join
queries are a powerful feature of an SQL database. Merging data from multiple tables is a very important skill. Most importantly, the database will do the merge efficiently. This is one area where you don’t want to mess around writing your own code. Merge operations that take hours or days with code you write yourself will complete in a few seconds with a properly written SQL query. Especially frustrating is when your self-written merge operation runs for 90 minutes and then you find out you did it wrong. (Ask me how I know.)
Learning More
There are many references out there for learning SQL. I taught myself SQL as I needed it, and that was many years back, so I’m not in any position to recommend resources. The library has a number of online books that you can read for free. Check out a few to see which you like.
I have never taken the course, but you can search for “Stanford database course” and there are materials for a free online course that goes pretty deep into the topic. It’s popular, and I’ve read good things about it, but I have no idea if it’s the right level for you.
If you just want to practice, you can use LibreOffice Base (the database in the LibreOffice suite). You can create simple tables and run SQL queries on them. I’ve even heard of some people using it for personal project management.