This guide provides step-by-step instructions for downloading data from PolicyMap. PolicyMap is a fully web-based online data and mapping application that gives you access to over 15,000 indicators related to demographics, housing, crime, mortgages, health, jobs and more. Data are available at all common geographies (address, block group, census tract, zip code, county, city, state, Metropolitan area) as well as unique geographies like school districts and political boundaries.
In this guide, we will download PolicyMap census tract data for the City of Oakland. We will download median housing value. UC Davis provides full access to all PolicyMap tools for staff, students, and faculty. You can access the full site only if you are on campus or logged onto the UC Davis VPN Client. Download the PulseSecure VPN app using the directions outlined here. We will download data from the UCD PolicyMap portal and clean up the files in R.
We’ll be using the package VIM in this guide. Install it.
install.packages("VIM")
Then load it and the other packages we will be using in this guide.
library(tidyverse)
library(tidycensus)
library(VIM)
You should get a map that highlights Oakland’s boundaries.
Now your map should look like the following
the data type
and the geographic level.
Leave the defaults (Year: 2014-2018, Variable: Median dollars, and Shaded by: Census Tract, 2010).
Let’s download these data. At the top right of the site, click on the download icon .
A window should pop up. The first screen asks you what data to download - it should be “Estimated median value of an owner-occupied home, between 2014-2018” under Layer. Click on the Next, Select Location button
The next screen asks you to confirm the download - just click on Download CSV
After a minute or two, a screen like below (on a Mac laptop) should pop up (the file name will differ).
Save the file into an appropriate folder, such as below (Mac laptop)
Bring in the data using read_csv()
.
pm.file <- read_csv("##YOUR FILE NAME HERE")
We need to clean up the file to get it ready for analysis. The first issue with pm.file is the footnotes located at the bottom of the file. If you view the file in R, the footnotes look like
To remove the footnotes, use the function slice()
pm.file <- pm.file %>%
slice(-(114:n()))
The function removes all rows starting from row 114 to the last row in the tibble (last row is indicated by the function n()
). Note the negative sign that tells the function to remove the rows. Without the negative sign, the function will instead keep those rows.
We also need to rename variables and keep the ones that are relevant. First, let’s look at the pm.file’s column names
names(pm.file)
## [1] "Census Tract"
## [2] "FIPS Code"
## [3] "Formatted FIPS"
## [4] "Estimated median value of an owner-occupied home, between 2014-2"
The variable Estimated median value of an owner-occupied home, between 2014-2 contains median housing values. Let’s shorten the name because we would not want to have to type this long name out everytime we want to refer to this variable. Make the name simple and clear. Here, we rename it medval using the rename()
function.
pm.file <- pm.file %>%
rename(medval = "Estimated median value of an owner-occupied home, between 2014-2")
We had to put quotes around the original variable name because it had spaces in between.
Next, let’s keep the necessary variables: FIPS Code and medinc. FIPS Code is the tract GEOID.
pm.file <- pm.file %>%
select("FIPS Code", medval)
Let’s bring in some census data from the Census API. We covered how to use get_acs()
from the tidycensus package in Lab 2.
ca <- get_acs(geography = "tract",
year = 2018,
variables = c(tpopr = "B03002_001",
nhwhite = "B03002_003", nhblk = "B03002_004",
nhasn = "B03002_006", hisp = "B03002_012"),
state = "CA",
survey = "acs5")
ca <- ca %>%
select(-moe) %>%
spread(key = variable, value = estimate)
Merge in ca into pm.file. The linking variable is FIPS Code in pm.file and GEOID in ca.
pm.file <- pm.file %>%
left_join(ca, by = c("FIPS Code" = "GEOID"))
Missing values are a part of a social scientist’s life. You can’t avoid them.
You will notice that PolicyMap designates missing values as N/A. R designates missing as NA. Therefore, R reads N/A as a character. This means that R does not recognize medval as a numeric but a character.
class(pm.file$medval)
## [1] "character"
We need to replace N/A with NA and then convert it to numeric
pm.file <- pm.file %>%
mutate(medval = as.numeric(ifelse(medval == "N/A", "NA", medval)))
The code ifelse(medval == "N/A", "NA", medval)
says that if the variable medval equals “N/A”, replace it to “NA”, otherwise keep it its original value stored in medval. as.numeric()
converts medval to a numeric.
class(pm.file$medval)
## [1] "numeric"
The next step is to determine what percentage of your cases are missing data. The best function for doing this in R is aggr()
, which is in the VIM package. Run the aggr()
function as follows
summary(aggr(pm.file))
##
## Missings per variable:
## Variable Count
## FIPS Code 0
## medval 2
## NAME 0
## hisp 0
## nhasn 0
## nhblk 0
## nhwhite 0
## tpopr 0
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0 111 98.230088
## 0:1:0:0:0:0:0:0 2 1.769912
The results show two tables and two plots. The left-hand side plot shows the proportion of cases that are missing values for each variable in the data set. The right-hand side plot shows which combinations of variables are missing. The first table shows the number of cases that are missing values for each variable in the data set. The second table shows the percent of cases missing values based on combinations of variables. The results show that 2 or 1.8% of census tracts are missing values on the variable medval.
In any statistical analysis, you will need to deal with missing values. For example, if you wanted to find out the average median housing value in Oakland tracts, you would type in
mean(pm.file$medval)
## [1] NA
The mean is NA, which tells you that there are missing values in the variable medval that you need to deal with before R calculates a value. There are many ways that one can deal with missing data. One method is to just simply ignore or discard cases with a missing value. To do this in the mean()
function (and in many other R functions), you include the argument na.rm = TRUE
mean(pm.file$medval, na.rm = TRUE)
## [1] 589359.5
As long as a large proportion of your data set is not missing data, simply ignoring missing data is often acceptable. Just make sure you are transparent about what you did.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil