News & Blog

Making a map from data: Part I

News & Blog

Data mapping

As the resident data nerd, I get called on from time to time to do a data analysis or visualization. Recently, I had the opportunity to work with some data that had an interesting twist. With certain details adjusted, the problem I faced was to determine in which states a housekeeping company has a presence. This may seem like an easy problem. I mean, you have a presence in a state if you are there, right? But we had to figure out what “there” means, and even what “you” means.

The Challenge

Consider that our housekeeping company (John Doe Cleans) has office locations in various states. These are easy because each location has an address, so let’s call it Location. Then we consider that there may be more than one location in an area, and they serve different territories. Here in coastal Virginia, we have a lot of military presence, so there may be one John Doe Cleans location that has access to military bases while another does not. Each Location could have one territory, several territories, or no territories; to simplify, we will define each Territory by its center point. Finally, each Location may own or manage Vehicles, and those vehicles continuously send their GPS coordinates to the location.

Blog diagrams

For the Location shown in the diagram, we could have as many as 6 geographical points, each of which can count as a presence of John Doe Cleans in the state. We’ll be generous and accept all of the geopoints for a Location in our analysis.

blog entities

It’s also safe to assume that we’re using a relational structure in our database, so each of these is basically an entity, and both the Territory and Vehicle entities use the Location ID to point up to the location they belong to. We have a separate entity that holds the GIS data, so each of our three primary entities also point to the Geopoints entity to get its spatial coordinates. The structure then looks like this, and we have a CSV export of each entity from the database.

The Plan

My goal was to do a quick and dirty solution. I needed to deliver a map in which a state is shaded if John Doe Cleaning has a presence in it or left unshaded if not. To make this map, I need a list of states and some value that tells me that John Doe Cleaning has a presence there. It’s a one-time task and not a code solution for reuse, so we don’t have to have pretty, meaningful code. So it’s time to make a plan.

Looking at the entities, I realized that each Location has to fall into one of four possible cases:

  1. Location has no Territories and no Vehicles
  2. Location has some Territories and no Vehicles
  3. Location has no Territories and some Vehicles
  4. Location has some Territories and some Vehicles.

If we were writing elegant code intended for reuse, we would want to break this down in code. But to produce a result more quickly, we can use a spreadsheet for a lot of this. (As a statistics and data science nerd, I know I’m supposed to say you should do it all in R, but there are some things that are just quicker for me in a spreadsheet.) My plan, then was to

  • Pull the CSV files into a spreadsheet
  • Split the Locations into the four cases
  • Get the coordinates for each row in the four cases
  • Find the state for each set of coordinates
  • Report out each state that shows up in that list
  • Make my map.

For today, we’ll look at those first three items in the plan, and next week we’ll wrap up with the remainder.

The Solution, Part 1

Our first step in the plan is to pull the CSV files into a spreadsheet, so we have them all in one place. Easy-peasy: we can check this one off right quick. Now we have a spreadsheet with four tabs: Locations, Territories, Vehicles, Geopoints.

Spreadsheet Optimization

For step 2, we need to divide our Locations tab into four separate tabs. How do we figure out which Locations belong to which tab? We’ll take advantage of two functions: COUNTIF and AND.

The COUNTIF function has two parts: a value and a range. Your spreadsheet will look at all the data located in the range and count up all the times your value occurs there. I used this to look up each Location and see how many times it occurred on the Territories tab and again on the Vehicles tab.

The AND function can take two or more arguments. Each argument should be a logical statement that can be either TRUE or FALSE. For example, if I used AND(1 = 2, 4 = 4), then the spreadsheet will look at the first statement (1 = 2) and say that it is FALSE. The second statement, (4 = 4) is TRUE. For the AND to give us a TRUE result, both of the statements must be TRUE. As a result, this formula would give us a FALSE in the spreadsheet cell.

Let’s see how these work together to do what we need. First, the COUNTIF functions. You can see the formulas on the Locations tab. In the first one, we look for the Location ID 11454 to see how many times it shows up in the Location ID column on the Territories tab, which is 3. Similarly, we look up 11454 on the Vehicles tab, and see that it occurs 2 times. Now we know our Location has both Territories and Vehicles, which means it falls into Case 4 as they are broken down above.

blog table 1.1.1

blog table 1.1.2

Because I was working for speed, I didn’t take the time to craft one beautiful spreadsheet formula, so I added four more columns to the Locations tab, to represent the four possible cases for each Location. Here you can see how the AND function translates each case into spreadsheet language. The first one, in column E, has no territories and no vehicles. Next in column F, we have some territories and no vehicles.

Blog tables 2.1

Now each Location has one TRUE and three FALSEs in these four columns. I used a filter (on the Data ribbon in Excel or the Data menu in Google Sheets) to select just the TRUE Locations for each case, and then pasted the resulting Locations onto a new tab in the Spreadsheet.

The Conclusion

This is it for Part I, so let’s review what we’ve done.

  1. Given CSV files for Locations of John Doe Cleaning, as well as Territories, Vehicles, and Geopoints, we pulled these together into one spreadsheet for some organization.
  2. Realizing that each Location has to fit one of four cases, we used COUNTIF and AND functions to tell us which case each Location fits.
  3. Based on those results, we split the Locations for each case into separate tabs.

Now that we’ve accomplished this much, we will shift from the spreadsheet into RStudio to continue our work. In Part II, we will complete the first two tasks below, and wrap them all up in Part III.

  • Pull these data into R
  • Join the geographical coordinates for each Location, Location-Territory, and Location-Vehicle
  • Find the state for each set of coordinates
  • Report out each state that shows up in that list
  • Make our map

Stay tuned for part 2. We love data science here at Web Teks . If you have an idea for a project, app or website, let us know how we can help!

 

 

We take processes apart, rethink, rebuild, and deliver them back working smarter than ever before.