Welcome to the second of three posts where we are walking through a data visualization task (read the first post here). The goal is to create a U.S. map for a company (John Doe Cleans), where each state in which the company has a presence is shaded and the states with no presence are left unshaded. The wrinkle is that there’s more than one way to interpret “presence.” John Doe Cleans has business locations, which can have territories and vehicles, all three of which have geospatial coordinates.
My goal was to do a quick and dirty solution, which means that code can be a little slapdash, as long as it gives me an accurate map at the end. In Part I, we pulled CSV files for Locations, Territories, Vehicles, and Geopoints into a spreadsheet, separated the Locations into four possible cases, and created a tab for each case.
In Part II, we’re going to move into RStudio, read in the tabs we need from the spreadsheet, and join each Location, Location-Territory, and Location-Vehicle with the Geopoints, so that we have geospatial coordinates for each one. Based on this, we will use a reverse geocoding service to identify the state for each record, based on its geospatial coordinates.
Joining Data in R
Next in the plan is to look up the coordinates for each Geopoint that matches one of our Locations, Territories, or Vehicles. This is where I jumped over to RStudio, so that I could take advantage of the join functions provided in dplyr. I used just a few packages for the task.
openxlsx– I find this package to have a friendly and intuitive function for reading and writing data in Excel spreadsheets.
rename.varsfunction is so easy to use, more even than the similar function in the Tidyverse.
dplyr– This is the Tidyverse package for transforming raw data into Tidy data in a grid. It is from this library that we’ll find the join function we’ll use.
When we worked on Part I, we found that each Location has to fall into one of four possible cases:
- Location has no Territories and no Vehicles
- Location has some Territories and no Vehicles
- Location has no Territories and some Vehicles
- Location has some Territories and some Vehicles.
Based on these four cases, we divided the Locations into four (mutually exclusive) tabs. Now, recall that each Location has its own Geopoint, separate from any Territories or Vehicles it may have. For example, if the Locations tab looks like this, then we will have one record on each of the four new tabs, and so we will have our Locations dataframe with 4 rows in it, and three more dataframes which each have one row in them. As a result, in Case 1, we actually need to look up the coordinates for all locations, and not just for the ones without a Territory or Vehicle.
Reading the Spreadsheet Data into RStudio
Now that we’ve figured out how we will proceed, it’s time to read the data into R. Using the
read.xlsx function, I need to specify the file location (my current folder), the name of the tab that I want to bring in, and an instruction to interpret the first row as column titles. There are many other options, which you can explore on the Help tab in RStudio.
geoPoints <- read.xlsx(xlsxFile = “./locations.xlsx”,
Sheet = “Geopoints”,
colNames = TRUE)
This function call takes the data from the Geopoints tab in the spreadsheet and puts it into a dataframe named geoPoints. Everything that was a row in the spreadsheet is a row in the dataframe, and same for columns. To make sure it has worked correctly, you can compare the number of rows in the spreadsheet to the number of rows in the new dataframe, and then review the new dataframe in RStudio.
Next I’ll read in the Locations tab, so that we can get the coordinates for each business location. Using the same read.xlsx function, I’ll also read in the Territories no Vehicles, Vehicles no Territories, and Territories & Vehicles tabs, so we’ll have everything we need. I named these dataframes.
Joining the Geopoints with Locations
Once all the data tabs are read into RStudio, we need to look at the
join functions from
dplyr. This set of functions works much like joining two tables in a (relational) database: we select two dataframes, we choose a column to join on, and we can specify other behaviors as well. The first join we are going to perform is between
geoPoints. We want the resulting list to include every record in
allLocations, whether it has a match in
geoPoints or not. Where it does have a matching
geoPoints record, we want the longitude and latitude columns to appear with the location. As a result, I’m going to assign allLocations to be my “left” dataframe and
geoPoints to be my “right.” The
left_join function will give us exactly what we want.
There is one final thing before we’re ready to join our dataframes. The column that we join on needs to be named exactly the same (case sensitive) in both dataframes. The column names in our example spreadsheet are the same (Geopoint ID), but they have a space in them. Just to be clear and avoid errors, let’s rename them to remove the space. The rename.vars function from gdata is very easy to understand; we give it the dataframe, a list of columns whose names need to be changed, and a parallel list of the new column names to be assigned. My code then looks like this.
allLocations <- rename.vars(data = allLocations,
from = c(“Geopoint ID”),
to = c(“GeopointID”))
geoPoints <- rename.vars(data = geoPoints,
from = c(“Geopoint ID”),
to = c(“GeopointID”))
We don’t have to use lists
(c()) if we are changing only one column name, but I tend to keep it in case I may need to add other columns to the rename. We will have similar calls for the other three dataframes.
Now we’re ready to perform our first join. We have identified our “left” and “right,” decided which type of join to perform, and ensured that our column names match.
geoAllLocations <- left_join(x = allLocations,
y = geoPoints,
by = “Geopoint ID”)
I get the new dataframe
geoAllLocations, which has the columns Location ID, geoPointID, Longitude, and Latitude. The new
geoAllLocations should have the same number of rows that the original
allLocations dataframe has. There may be some rows where the longitude and latitude are null or NA. Depending on what we’re doing with the dataframes next, we may want to change those values or filter out those rows. We’ll consider this later.
The “Territories no Vehicles” and “Vehicles no Territories” joins are very similar, with the difference that the new dataframe is likely have more rows than the original in both cases. Once we read in the tab from the spreadsheet and rename columns as needed, the join statements will look something like
geoTerritoriesNoVehicles <- left_join(x = territoriesNoVehicles,
y = geoPoints,
by = geoPointId)
geoVehiclesNoTerritories <- left_join(x = vehiclesNoTerritories,
y = geoPoints,
by = geoPointId)
For the Locations that have both Territories and Vehicles, we’ll pull in the data from the spreadsheet tab, split the data into Location+Territories and Location+Vehicles, look up the spatial coordinates for each row, and bind the rows together into one long dataframe.
bothLocations <- read.xlsx(xlsxFile = “./locations.xlsx”,
Sheet = “Territories & Vehicles”,
colNames = TRUE)
bothTerritories <- select(bothLocations, Location ID, Geopoint ID)
bothVehicles <- bothTerritories
This gives me two dataframes to work with. We can then use
geoPoints to add the coordinates to each, just as we did above.
Finding the State for Each Record
There are a number of R packages with spatial information about the 50 U.S. states. If you have more time than money, this is the way to go. I had limited time for my task, which involved countries in addition to the U.S., so I decided that I could accomplish my goal more quickly by using a service. To look up the state for each pair of coordinates, I used Geocodio, which is a ridiculously easy service for geocoding and reverse geocoding. You can upload a spreadsheet or CSV file, get a preview of some points on a map to help you make sure you uploaded the right spreadsheet, and then tell it to get to work. On a free account, you get up to 2,500 lookups (rows in your spreadsheet) per day, with an additional cost per 1,000 lookups. When Geocodio finishes processing your file, you’ll get a link for a new spreadsheet to download.
Right now, we have five dataframes in RStudio that have spatial coordinates in them. It will be easiest if we can give Geocodio just one file. The
rbindfunction for base R will put together all our dataframes, as long as we have the columns named identically in all five. If one of your dataframes has extra columns in it, then the
select function from
dplyr will let you choose just the columns you want to keep.
someLocations <- select(someLocations,
Once we have the right columns, with the same names, and in the same order in all five dataframes, we can do our row bind.
lookupPoints <- rbind(geoAllLocations,
Now we can export the lookupPoints dataframe to a spreadsheet file and upload it to Geocodio. The write.xlsx function works just in reverse of the read.xlsx function; we will supply the name of our dataframe, the path and file name, and an option for whether to overwrite or append to an existing file.
write.csv(x = lookupPoints,
file = “./lookup_points.xlsx”,
overwrite = TRUE)
Now we can go to https://geocod.io to upload our spreadsheet. After a few minutes, Geocodio will tell us it has finished processing, and we can download the new spreadsheet which includes street address (where available), country, city, state, county, and ZIP code.
This is it for Part II, so let’s review what we’ve done.
- We observed that even though we made a tab for the case where a Location has no Territories and no Vehicles, we will need to look up the Geopoint record for all Locations.
- In RStudio, we read in the data from the original Locations tab and the three tabs that have Locations with Territories and/or Vehicles.
- After adjusting column names as needed, we used
left_jointo add the longitude and latitude to each Location, Location-Territory, and Location-Vehicle.
- Then we combined our five dataframes into one, which we saved off as a spreadsheet and uploaded to the Geocodio service. Now we have a spreadsheet with a state for each “presence” John Doe Cleaning has in the U.S.
We’ll start in Part III by reporting states by points of presence, and finally, we’ll use another of my favorite tools, Venngage, to create our map for John Doe Cleaning.
We love data science here at Web Teks. If you have an idea for a project, app, or website, let us know how we can help!