Connect to Microsoft Access Database in R using RODBC

10 min read
09 January 2020

When we talk about databases, some of the most popular names include MySQL, PostgreSQL, MongoDB and so on. These databases are excellent for large scale projects involving a high volume of users and processes. In fact, they are heavy duty commercial databases utilized by many large corporations. However, not everyone use them. Many smaller organizations, especially in academia and research, do not use any of the databases mentioned above. Given the smaller scale of the projects and resources being a limiting factor, simpler alternatives could be used instead such as Microsoft Access.

In specific situations, MS Access would indeed be a better suited option. Initial set-up is quite straight-forward and its user-friendly interface makes it accessible to a larger audience. The database itself can be shared as an attachment in an email or can simply be hosted on a server for users to access. If we disregard all the security, scalability and performance issues, MS Access can be the right tool for smaller projects. For now, we will save the in-depth analysis of the pros and cons for another time, and focus on MS Access.

How do we access the data?

Let's suppose we are tasked to do some analysis using R on some data stored in a MS Access database. Our first step would involve finding a way to load the data into R. There are two main options:

  1. Manually export the data from MS Access as a CSV file, save it locally and load it into R.
  2. Connect to MS Access through R, and load the data directly.

The first option is the simplest one. All we need to do is click a few buttons via the MS Access user-interface and we're done! This might be the best approach if and only if we need to load the data a single time. But what if we need to re-run our analysis with updated data a week later? What if we constantly need to update the data on a regular basis? On the long-run, this becomes time consuming and troublesome.

This is why the second option is better knowing that we will have to reload the data in the future. Once we set up the script connecting R to the MS Access database, we can avoid the hassle of manually extracting and loading the data. Even better, we can come up with a generic function that can be used for any future project. Not only is this a huge time saver, but it also minimizes human error that inevitably plague manual processes.

Install MS Access Drivers

Before we get down to the nitty-gritty looking at the code, we must first make sure the proper drivers are installed. Essentially, we need a specific set of components that will allow for the transfer of data between MS Office software and other data sources. Here's a link to the Microsoft Access Database Engine 2010 Redistributable2 (older version). Once downloaded, simply go through the installation process we're all set!

RODBC Library

We will be using RODBC1, a package allowing R to connect to ODBC APIs and access database management systems. Given a specific set of arguments and settings, these R functions are able to directly talk to the ODBC interfaces, such as MS Access.

NOTE: Depending on the configurations of the MS Access drivers, RODBC might only work with the 32-bit version of R.

										
											
## Import RODBC package
install.packages("RODBC")
library(RODBC)

## Set up driver info and database path
DRIVERINFO <- "Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
MDBPATH <- "C:/Users/leo/student-dummy.accdb"
PATH <- paste0(DRIVERINFO, "DBQ=", MDBPATH)

## Establish connection
channel <- odbcDriverConnect(PATH)

## Load data into R dataframe
df <- sqlQuery(channel,
"SELECT [student_id], [first_name], [last_name],
FROM [tbl-students]
ORDER BY [first_name];",
stringsAsFactors = FALSE)

## Close and remove channel
close(channel)
rm(channel)

The first step is to install the library and then load it in R. Then, we need to define some settings. DRIVERINFO contains the driver specifications. If we are working with older MS Access databases, then they usually end with the extension .mdb, whereas the newer versions end with .accdb. The variable MDBPATH specifies the absolute path where the database is found. PATH is essentially the concatenation of both into a single string. We will come back to the reason why we partition these elements with separate variables when we generalize our function. Once the configurations are defined, we use the odbcDriverConnect function to establish a connection between the MS Access database and R. Finally, the sqlQuery function takes as input the channel and SQL code in order to load the data into R. At this point, df should be a dataframe containing the columns student_id, first_name and last_name from the table tbl-students.

Password Protected Database

MS Access databases are equipped with password protection functionality. Basically, the software will prompt the user to enter a passcode before allowing access to the data. While interacting with a password protected database through R, the code shown above would fail. In order to successfully establish a connection with the MS Access database, we need to set the password in the configuration parameters.

										
											
## Specify password
PWD <- "password"
PATH <- paste0(DRIVERINFO, "DBQ=", MDBPATH, ";PWD=", PWD)

Generic Function to Load Data

After implementing the script above, we have successfully automated the process of loading data from a specific table in a given database. However, what if we want to load multiple tables from a database? Or data from different databases? We would have to repeat every step in the script above, which could become cumbersome and error-prone.

One solution to avoid the need to rewrite the code every time we load a different table is to wrap the script inside a function to which we pass a specific set of parameters. Doing so will not only allow us to load data more efficiently, it will also keep the code centralized in such way that we only need to make updates in a single location.

										
											
## Generic function for loading data
fn.importData <- function(MDBPATH, TABLES, DROP_VARS=c(), ORDER_BY=c(), PWD="") {

## Set up driver info and database path
library(RODBC)
DRIVERINFO <- "Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
PATH <- paste0(DRIVERINFO, "DBQ=", MDBPATH, ";PWD=", PWD)

## Establish connection
channel <- odbcDriverConnect(PATH)

## Iterate through list of tables
for (tbl in TABLES) {

## Retrieve all variable names from table tbl
tbl_vars <- sqlColumns(channel, tbl)["COLUMN_NAME"]
## Exclude variables based on input parameters
tbl_vars <- subset(tbl_vars, !(tbl_vars$COLUMN_NAME %in% DROP_VARS))
## Add brackets to each variable (ie. [variable]) to maintain ACCESS syntax
tbl_vars$COLUMN_NAME <- paste0("[", tbl_vars$COLUMN_NAME, "]")
## Transform dataframe column into string separated by comma
cols <- paste0(tbl_vars[1:nrow(tbl_vars), ], collapse=",")
## Create ORDER BY string
if (length(ORDER_BY) > 0) {
order <- paste0("ORDER BY", paste0(paste0("[", ORDER_BY, "]"), collapse=", "))
}
order <- ""
}

## Extract table of interest as dataframe
df <- sqlQuery(channel,
paste0("SELECT ", cols, " FROM [", tbl, "]", order, ";"),
stringsAsFactors=FALSE)

## Replace dash with underscore
new_tbl_name <- gsub("-", "_", tbl)

## Assign dataframe to environment
assign(new_tbl_name, df, envir=.GlobalEnv)
}

## Close and remove channel
close(channel)
rm(channel)
}

The function takes as input 5 arguments:

  • MDBPATH: absolute path specifying location of database.
  • TABLES: vector of table names to be loaded from Access.
  • DROP_VARS: vector of variables to exclude. Empty vector by default.
  • ORDER_BY: vector of variables determining how to sort table. Empty vector by default.
  • PWD: password string required to access protected database. Empty string by default.

Some might be wondering why we need to extract the names of each column instead of simply doing SELECT *. However, from my past experience using RODBC, it can sometimes lead to strange results. Also, some might have noticed that the function does not return anything. Instead, the dataframes are directly assigned to the global environment. In some situations, this might not be practical, since we might want to contain the objects within a certain scope and not let them leak out. The function shown above is just an example of what type of customization can be included (e.g., dropping columns, ordering by multiple variables, etc.) in the function generalization process. Much more can be added or modified, depending on the specific needs.

Here is an example of how to call the function:

										
											
## Function call to load data
fn.importData(MDBPATH="C:/Users/leo/db/student-dummy.accdb",
TABLES=c("tbl-schools", "tbl-students"),
DROP_VARS=c("timestamp"),
ORDER_BY=c("ID"),
PWD="admin")

Conclusion

We end up with a generic function which, given the path and the table name(s) of interest, can extract data seamlessly from any MS Access database. The RODBC library offers an array of functions allowing R to talk with any ODBC APIs. Writing our own customized function can not only help us automate the data loading process, but also reduce human errors. There is so much more than can be done with RODBC, this is only an example of its applications.