Thursday, 10 January 2013

Getting Access data into R


1. Introduction

These notes give the steps to configure a Windows machine so that R can communicate with Microsoft Access databases. It turns out that the same mechanism can be used to connect with Microsoft Excel workbooks, so the notes include R to Excel communication as well. In R, there are two main ways to connect with Access databases: using the ODBC (Open DataBase Connectivity) facility available on many computers; and using the DBI (DataBase Interface) package in R. These notes deal with ODBC only. The notes also include some details on how the set-up is different on a Mac, but none of the steps have been tested on a Mac. (It is worth mentioning a commercial product, Stat/Transfer (www.stattransfer.com), that simplifies the task of transferring data between data formats, including Access to R.)

ODBC allows a connection to a database to be opened, but that is only half the process. The second half of the process requires the use of SQL (Structured Query Language) to import database tables into R. Thus the notes also provide a brief introduction to SQL, and show how to formulate SQL requests within R and then to send the request through the open connection to the database.

For convenience, all of the R code and the data files are available on github.

These notes draw on Spector (2008), Ripley (2012), and Kabacoff (2011). See the accompanying R code for a convenient way to obtain Ripley (2012).

2. What are ODBC and SQL

ODBC (Open DataBase Connectivity) allows access to a variety of database systems via a common interface. ODBC is independent of the database system and the computer's operating system, but it does require a specific driver to be installed. The driver serves as a translation layer between ODBC and the database system. Usually, the driver is installed with the database system, but they are also available as separate installations.

SQL (Structured Query Language) is a language designed for managing data in a database system. Once the required package and driver are installed, and a connection to the database is opened, SQL queries can be sent through the open connection.

ODBC treats the various tables in an Excel workbook as database tables. Thus the ODBC connection and SQL queries can be used to access data from Excel workbooks in the same way as data are accessed from a database system.

3. Installing software and drivers

If R is not already installed, it is freely available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org. The easiest way to install R is to download precompiled binaries for one of the platforms (Windows, Mac, Linux). To install R, double click on the downloaded executable file. During installation, accept all the defaults except one: on Windows, it is better to install R in the C directory (C:\R), not in the default location (C:\Program Files\R).

The RODBC package (Ripley & Lapsley, 2012) is used to access ODBC's capabilities from within R. RODBC is not part of the base installation of R, and therefore it needs to be installed separately. To install RODBC, type the following command at the command prompt in the R console:


 install.packages("RODBC")


The next step is to ensure that the relevant ODBC drivers are installed. ODBC drivers for Access and Excel are usually packaged with Microsoft Office and thus they are installed when Office is installed. Nevertheless, the drivers can be installed separately if Office (or the latest version of Office) has not been installed. This means too that Access itself does not need to be installed. On Windows, the ODBC driver for Excel 2007 and Access 2007 is installed via the installer AccessDatabaseEngine.exe available from: http://www.microsoft.com/en-us/download/details.aspx?id=23734. Similarly, the driver for Office 2010 can be installed via the installer AccessDatabaseEngine.exe available from: http://www.microsoft.com/en-us/download/details.aspx?id=13255. On a Mac, ODBC drivers for Access are available at a cost from http://www.actualtech.com/product_access.php.

There are two ways in which information about the ODBC driver and the location of the database system is provided to ODBC. One way is to provide all the relevant information from within R each time the connection to the database system is established. An alternative is to establish a DSN (Data Source Name) so that the DSN describes the connection, and thus only the DSN needs to be provided each time a connection is made. Both methods are outlined in the next section.

On Windows, the ODBC Source Administrator is used to establish DSNs. The ODBC Source Administrator is accessed though Control Panel -> Administrative Tools -> Data Sources (ODBC). The connectors available on a particular machine and the names used to access them are given under the Drivers tab. A specific DSN is established by clicking on the Add button under the User DSN tab. Choose the appropriate driver from the list in the pop-up window, then click Finish. A dialogue specific to the database system being used will appear, and it is here that the additional information required for the connection is entered. The name selected for the specific DSN is used each time a connection is made.

By way of example, I have an Excel workbook in the XLSX format. On my machine, the path to the workbook is: C:\Users\Sandy\Documents\SST\Database\Test.xlsx. The steps to establish the DSN are:

  1. Click on the Add button under the User DSN tab in the ODBC Source Administrator;
  2. Select Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb) in the list of drivers in Names;
  3. Click Finish;
  4. In the dialogue that opens, enter a name (I used MyTestBook) for the Data Source Name, and a description (I used Test scores);
  5. Click Select Workbook to set the path to the workbook; click OK;
  6. Click OK;
  7.  Click OK.

As a second example, I have an Access database in the ACCDB format. On my machine, the path to the database is: C:\Users\Sandy\Documents\SST\Database\Database for playing.accdb. The steps to establish a DSN are:

  1. Click on the Add button under the User DSN tab in the ODBC Source Administrator;
  2. Select Microsoft Access Driver (*.mdb, *.accdb) in the list of drivers in Names;
  3. Click Finish;
  4. In the dialogue that opens, enter a name (I used ToyDB) for the Data Source Name, and a description (I used Toy Database);
  5. Click Select to set the path to the database; click OK;
  6. Click OK;
  7. Click OK.

On a Mac, the ODBC Administrator, found at Applications/Utilities, performs similar functions.

4. Accessing and importing Excel spreadsheets into R using ODBC

Spreadsheets are a common method for distributing data, and R provides several methods for accessing spreadsheet data. Arguably the simplest is to save a spreadsheet as a comma or tab separated file, then to import the data into R using the read.table() or read.csv() functions. Alternatively, there are at least three packages that allow Excel files to be accessed directly. First, Excel files can be accessed using the loadWorkbook() and readWorksheet() functions available in the XLConnect package. The package allows for the general manipulation of Excel files and the production of Excel reports with graphics without the need for Excel to be installed. Second, Excel files can be accessed using the read.xls() function available in the gdata package. The function depends on functionality from the scripting language perl, and therefore it requires perl (http://perl.org) to be installed. A third alternative allows Excel files in the XLSX format to be imported. It uses the read.xlsx() function available in the xlsx package. In most situations, any one of these methods is simpler and possibly preferable to using ODBC, but for the sake of completeness, I outline the ODBC method here.

When using ODBC, the first step is to establish a connection to the workbook. The two methods mentioned in the previous section (using the DSN established in the ODBC Source Administrator, and using the default name plus giving additional connection information from within R) are demonstrated. In an active R session, the following commands will open a connection using the DSN name MyTestBook (this is the name given to the DSN when it was established - see the previous section). The library() function is required once only during an R session. It loads required packages, in this case, the RODBC package. (Note that "loading" and "installing" are different operations. It is not necessary to load all installed packages, but an installed package needs to be loaded before its functionality becomes available.) The connection is opened using the DSN MyTestBook, and is assigned the name con.


 library(RODBC)
 con <- odbcConnect("MyTestBook")


The second method demonstrates how to specify connection information within the R commands. Note that the slashes in the path name are forward slashes, not the usual backward slashes (this is because the backward slash has a special function in R). In this example, the path, surrounded by quotes, is assigned the name wb. The name is used in the specification of the connection, which in turn is assigned the name con2.


 wb <- "C:/Users/Sandy/Documents/SST/Database/Test.xlsx"
 con2 <- odbcConnectExcel2007(wb)


Once a connection has been established, data can be transferred. Often a first step is to ask for the names of available sheets, then to import one or more of the sheets. Using the con connection from above, the following commands return the names of the sheets in the Test.xlsx workbook. Note that ODBC adds a $ (for system files) to the table names.


 tbls <- sqlTables(con)
 tbls$TABLE_NAME

 [1] "Class$"   "School$"   "Student$"


The sqlFetch() function is the simplest way to import a sheet into R. It requires two arguments: the connection, and the table name. The table is imported in the form of a data frame, and the data frame is given the name student. The str() function gives the structure of student data frame.


 student <- sqlFetch(con, "Student$")
 str(student)


A more versatile but more complex method is available using the sqlQuery() function. The method is complex because it requires a properly formatted SQL query (more on SQL later). But the method is versatile because queries can, among other things, select part of a table, they can manipulate the data before the data are imported into R, and they can request tables to be merged before the data are imported into R. The sqlQuery() function requires two arguments: the connection, and a SQL query. The query in the example below is formulated outside the sqlQuery() function, and is assigned the name qry. Note that the query is surrounded by quotes. The query is passed to the sqlQuery() function via its name qry. The query requests that the whole Student table be imported and so it does exactly the same as the sqlFetch() function in the previous example (some of the versatility of the sqlQuery() function will be demonstrated later). The table name (Student$) contains a special character ($ - other special characters include spaces and brackets), and therefore within the query, it should be surrounded by back quotes (`). The star (*) in the query means "all columns in the table", and so the query imports the complete Student table. As with the sqlFetch() function, the table is imported in the form of a data frame, and the data frame is given the name student. The str() function gives the structure of the student data frame.


 qry <- "SELECT * FROM `Student$`"
 student <- sqlQuery(con, qry)
 str(student)


At the end of an R session, it is a good idea to close connections.


 odbcCloseAll()


5. Accessing and importing Access tables into R using ODBC

A connection to the Access database, Database for playing.accdb, can be opened using the DSN (ToyDB) that was established earlier, or the connection can be opened with the connection information supplied in the R commands. Both methods are demonstrated here. Using the first method, the connection is opened using the DSN ToyDB, and is assigned the name con.


 library(RODBC)
 con <- odbcConnect("ToyDB")


Using the second method, the connection information is supplied within the R commands. The connection is assigned the name con2.


 db <- "C:/Users/Sandy/Documents/SST/Database/Database for playing.accdb"
 con2 <- odbcConnectAccess2007(db)


The names of the tables in the Access database are obtained using the sqlTables() function. Note that the function contains the connection (con2) and the expression tableType = "TABLE". Without the expression, the function would return not only the table names but also the names of system tables. The names of the system tables are not needed, so that is why they are omitted. (To see the names of the system tables, enter the command at the R prompt but leave out the tableType = "TABLE" expression.)


 sqlTables(con2, tableType = "TABLE")$TABLE_NAME

 [1] "class"   "school"  "student"


A database table can be imported into R using either the sqlFetch() function or the sqlQuery() function, in much the same way as Excel spreadsheets are imported. The "school" table is imported using sqlFetch, and the "class" table is imported using sqlQuery. In both cases, the table is imported in the form of a data frame. The first is given the name school, and the second is given the name class. The str() functions give the structures of the school and class data frames. The last command closes all connections.


 school <- sqlFetch(con2, "school")
 str(school)

 qry <- "SELECT * FROM class"
 class <- sqlQuery(con2, qry)
 str(class)

 odbcCloseAll()


6. Introduction to SQL

This section provides a quick introduction to the basics of SQL using the Database for playing.accdb database as an example. The database contains three tables: student, class, and school. The student table contains students' test scores for two tests, their gender, and an ID for their class, as well as a student ID. The class table contains information on the classes: the Size of the class and a school ID, as well as the class ID. The school table contains information on each of the schools: the Region, Authority, and Size of the school, as well as the school ID.

The workbook has a structure similar to that of a proper database, in that the tables are normalised. The aim of normalisation is to reduce redundancy in the information stored in the tables. For instance, the student table could have contained the student information plus the class information and the school information. But this is an inefficient way to store the information because the rows containing students common to a class all contain the same class information; similarly, the rows containing students common to a school all contain the same school information. That is, many rows would contain identical class and school information. It is more efficient to store the class and school information in separate tables. In a normalised student table, there is an id representing the class. The id is also found in the class table but only once. Similarly, the class table contains an id representing the school, and same id is found exactly once in the school table. Note that the id representing the schools is located in the class table. If the school id had been stored in the student table, each student would have the id. There are 100 student, and therefore 100 pieces of information would need to be stored. Storing the school id in the class table is more efficient: there are five classes, and therefore only five pieces of information need to be stored. At first, it might appear that students cannot be matched to their schools. They can, but the matching takes place via the class table.

The notes that follow assume a new R session. Therefore, the RODBC package needs to be loaded, a connection to the workbook needs to be opened, and, as a reminder, the table names are requested.


 library(RODBC)
 db <- "C:/Users/Sandy/Documents/SST/Database/Database for playing.accdb"
 con <- odbcConnectAccess2007(db)
 sqlTables(con, tableType = "TABLE")$TABLE_NAME

 [1] "class"   "school"  "student"


6.1 Getting variable names

The sqlColumns() function returns column names (i.e., variable names). The function requires two arguments: the connection and the table name. The three commands below return the column names for the student, class, school tables respectively.


 sqlColumns(con, "student")$COLUMN_NAME

 [1] "studID"  "classID" "Gender"  "Test1"   "Test2"

 sqlColumns(con, "class")$COLUMN_NAME

 [1] "classID" "schID"   "CLSize"

 sqlColumns(con, "school")$COLUMN_NAME

 [1] "schID"     "Region"    "Authority" "SchSize"


6.2 Importing part of a table

The SQL SELECT command was used in earlier sections to import the whole table. It can also be used to import part of a table. If selected variables are required, list the variable names separated by commas. The following command returns two columns corresponding to Test1 and Test2 from the student table.


 qry1 <- "SELECT Test1, Test2 FROM student"
 TestScores <- sqlQuery(con, qry1)
 str(TestScores)


To limit the rows that are returned, the WHERE operator is used. The following command imports all columns in the student table but limits the rows to those students who score greater than 50 on Test1.


 qry2 <- "SELECT * FROM student WHERE Test1 > 50"
 TestScoresLimit <- sqlQuery(con, qry2)
 str(TestScoresLimit)


6.3 Manipulating the data

It is possible to manipulate the data outside R, then to import the manipulations. In the example below, the difference in the two test scores is calculated, then the two test scores and the difference are imported. Furthermore, the difference is given the name Diff using the AS operator.


 qry3 <- "SELECT Test1, Test2, Test2-Test1 AS Diff FROM student"
 TestScoreDiff = sqlQuery(con, qry3)
 str(TestScoreDiff)


6.4 Aggregating data

Aggregated data can be imported using a range of SQL aggregation functions:

   AVG() for the mean;
   STDEV() for the standard deviation;
   MIN() for the minimum value;
   MAX() for the maximum value;
   COUNT() for the number of occurrences.

The following command returns a table of means, standard deviations and counts for the two tests. Note the use of the AS operator to re-name the aggregations.


 qry4 <- "SELECT
         AVG(Test1) AS mean1, AVG(Test2) AS mean2,
         STDEV(Test1) AS sd1, STDEV(Test2) AS sd2,
         COUNT(Test1) AS N1, COUNT(Test2) AS N2
         FROM student"

 sqlQuery(con, qry4)


In conjunction with the GROUP BY operator, aggregations can be obtained for groups in the data. The following returns means, standard deviations and counts for the two tests for males and females in each class.


 qry5 <- "SELECT Gender, classID,
         AVG(Test1) AS mean1, AVG(Test2) AS mean2,
         STDEV(Test1) AS sd1, STDEV(Test2) AS sd2,
         COUNT(Test1) as N1, COUNT(Test2) as N2
         FROM `Student$`
         GROUP BY Gender, classID"

 sqlQuery(con, qry5)


6.5 Merging tables

It is often the case that tables need to be merged; for instance, a mixed effects analysis investigating the effects of school level variables on student test scores requires the student-level data and the school-level data to be available in one data frame. The merging operation matches the data in two tables using common values in ID columns within the two tables. Suppose we want all the class information in the class table to be available in the student table. Both tables contain a classID column, and classID = 1 in the student table is the same class as classID = 1 in the class table. The SQL SELECT command below names the two tables to be merged (student and class); the type of merge to use (LEFT OUTER JOIN); and the columns on which the two tables are matched (classID in both the student and class tables). An outer join is used here because it does not require each record in the joined table to have a matching record in both the original tables. The LEFT OUTER JOIN always contains all records for the "left" table (student), even if there is no matching record for the "right" table (class). In other words, all the student information in the student table is retained; it's just that some of the students might be missing data on the class variables. If an INNER JOIN had been used, then rows would not be generated if there was no matching records in either of the tables. This means that students would be dropped from the merged table if the students belonged to a class for which there was not record in the class table. It is noted that there is no missing class data in either table, so either an outer join or an inner join would have worked.


 qry6 <- "SELECT * FROM
   student LEFT OUTER JOIN class
   ON student.classID = class.classID"

 student <- sqlQuery(con, qry6)
 str(student)


The next step is to merge the three tables so that the school and class information is available in the student table. One way to do this is to use subqueries. A subquery is surrounded by brackets, and the table that results is treated like any other table. There is one restriction though – subquery tables must be given an alias through the AS operator. In the command that follows, the subquery table is a merge of the class and school tables on the school id. The table that results (aliased AS X) is merged with the student table on the class id.


 qry7 <- "SELECT * FROM
      student LEFT OUTER JOIN
         (SELECT * FROM
         class LEFT OUTER JOIN school
         ON class.schID = school.schID) AS X
      ON student.classID = X.classID"

 student <- sqlQuery(con, qry7)
 str(student)


6.6 Other ways to manipulate Excel data

At this point, it is worth noting that the manipulations, aggregations and merges could have been carried out in R. The merging of three tables is demonstrated below using R syntax in place of SQL queries. The first three commands import three tables (student, class, and school) into R in the form of data frames. R's merge() function deals with the merging. First, the student and class data frames are merged to give a data frame called studentNew, then second, the studentNew data frame is merged with the school data frame, and the result is again given the name studentNew. The by = statements give the names of the variable on which the data frames are matched, and all.x = TRUE requests an outer left join (x refers to the x or left most data frame in the merge() function).


 student <- sqlFetch(con, "student")
 class <- sqlFetch(con, "class")
 school <- sqlFetch(con, "school") 
  
 studentNew <- merge(student, class, by = "classID", all.x = TRUE)
 studentNew <- merge(studentNew, school, by = "schID", all.x = TRUE)

 str(studentNew)
 odbcCloseAll()


The two merge commands could be collapsed to a single command containing a merge within a merge.


 studentNew <- merge(merge(student, class, by = "classID",
      all.x = TRUE), school, by = "schID", all.x = TRUE)

 str(studentNew)


Even so, R accommodates the R user who has a preference for SQL. The sqldf package (Grothendieck, 2012) allows SQL commands to be applied to data frames. The following assumes that the three data frames (student, class, and school) have already been imported into R.


 install.packages("sqldf")  # To install – Run once only
 library(sqldf)             # To load – Run each new R session

 studentNew = sqldf("SELECT * FROM
      student LEFT OUTER JOIN
       (SELECT * FROM
       class LEFT OUTER JOIN school
       ON class.schID = school.schID) AS X
      ON student.classID = X.classID")

 str(studentNew)


Note that the SQL SELECT command above is almost identical with the SQL SELECT command used earlier to merge three tables. One difference is in the names of the data frames / tables: above, the names refer to data frames; earlier, the names refer to database tables. Another difference concerns the site of the merging operation: above, the merge took place within R; earlier, the SQL SELECT command was sent through the open connection and the merge took place outside R in the database.

In most instances, it does not matter whether one uses the database (and thus SQL) or R (and thus R syntax or sqldf) to do the manipulating and merging. However, there are times when the R approach to merging could fail. R has limitations with handling very large data files. That is, it might not be possible to accommodate multiple large files in R, and therefore it makes sense to perform the merging in the database using SQL. Or, it is possible for a single file to be so large that it exceeds R's memory limit. In that case, it makes sense to import part of the table; importing either selected variables or a selection of cases, and thus, again, to perform the selections using SQL in the database.

References
Grothendieck, G. (2012). sqldf: Perform SQL selects on R data frames. R package version 0.4-6.4. http://cran.r-project.org/package=sqldf

Kabacoff, R. (2011). R in action: Data analysis and graphics with R. Shelter Island, NY: Manning.

Ripley, B. (2012). ODBC Connectivity. A vignette in the RODBC package.

Ripley, B. & Lapsley, M. (2012). RODBC: ODBC database access. R package version 1.3-6. http://cran.r-project.org/package=RODBC

Spector, P. (2008). Data manipulation with R. New York, NY: Springer.

5 comments:

  1. Incredible article,

    thank you so much ¡¡

    ReplyDelete
  2. When I try to use your example for opening an ms access mdb file using the newest R for windows as soon as I reach the following code:
    school <- sqlFetch(con2, "school")
    str(school)

    qry <- "SELECT * FROM class"
    class <- sqlQuery(con2, qry)
    str(class)

    No matter if I use sqlFetc or sqlQuery I recive the same error: R for windows gui front-end has stopped working...why does this happens? any ideas

    ReplyDelete
  3. superb article -- saved my day.

    ReplyDelete