In Lab 3, you learned how to process spatial data in R, focusing primarily on reading in, wrangling and mapping areal or polygon data. In this lab, we will cover point data in R. More broadly, we will cover how to access and clean data from Open Data portals and OpenStreetMap. In Lab 4b, we will learn how to process Big Data using Twitter Tweets as a case example. The objectives of this guide are
To achieve these objectives, we will first examine the spatial distribution of homeless encampments in the City of Los Angeles using 311 data downloaded from the city’s open data portal. This lab guide follows closely and supplements the material presented in Chapters 2.4, 4.2, 7 and 8 in the textbook Geocomputation with R (GWR) and class Handout 3.
You’ll need to install the following packages in R. You only need to
do this once, so if you’ve already installed these packages, skip the
code. Also, don’t put these install.packages()
in your R
Markdown document. Copy and paste the code in the R Console.
install.packages(c("tidygeocoder",
"osmdata"))
You’ll need to load the following packages. Unlike installing, you
will always need to load packages whenever you start a new R
session. You’ll also always need to use library()
in your R
Markdown file.
library(sf)
library(tidyverse)
library(units)
library(tmap)
library(tidycensus)
library(tigris)
library(rmapshaper)
library(leaflet)
library(tidygeocoder)
library(osmdata)
We will need to bring in census tract polygon features and racial
composition data from the 2015-2019 American Community Survey using the
Census API and keep tracts within Los Angeles city boundaries using a
clip. The code for accomplishing these tasks is below. We won’t go
through each line of code in detail because we’ve covered all of these
operations and functions in prior labs. I’ve embedded comments within
the code that briefly explain what each chunk is doing, but go back to
prior guides (or RDS/GWR) if you need further help. The only new code I
use is rename_with(~ sub("E$", "", .x), everything())
to
eliminate the E at the end of the variable names for the
estimates.
# Bring in census tract data.
ca.tracts <- get_acs(geography = "tract",
year = 2019,
variables = c(tpop = "B01003_001", tpopr = "B03002_001",
nhwhite = "B03002_003", nhblk = "B03002_004",
nhasn = "B03002_006", hisp = "B03002_012"),
state = "CA",
survey = "acs5",
output = "wide",
geometry = TRUE)
# Make the data tidy, calculate percent race/ethnicity, and keep essential vars.
ca.tracts <- ca.tracts %>%
rename_with(~ sub("E$", "", .x), everything()) %>%
mutate(pnhwhite = nhwhite/tpopr, pnhasn = nhasn/tpopr,
pnhblk = nhblk/tpopr, phisp = hisp/tpopr) %>%
dplyr::select(c(GEOID,tpop, pnhwhite, pnhasn, pnhblk, phisp))
# Bring in city boundary data
pl <- places(state = "CA", year = 2019, cb = TRUE)
# Keep LA city
la.city <- filter(pl, NAME == "Los Angeles")
#Clip tracts using LA boundary
la.city.tracts <- ms_clip(target = ca.tracts, clip = la.city, remove_slivers = TRUE)
Point data are the building blocks of object or vector data. They give us the locations of objects or events within an area. Events can be things like crimes and car accidents. Objects can be things like trees, houses, jump bikes or even people, such as the locations of where people were standing during a protest.
Often you will receive point data in tabular (non-spatial) form. These data can be in one of two formats
If you have longitudes and latitudes, you have all the information you need to make the data spatial. This process involves using geographic coordinates (longitude and latitude) to place points on a map. In some cases, you won’t have coordinates but street addresses. Here, you’ll need to geocode your data, which involves converting street addresses to geographic coordinates. These tasks are intimately related to the concept of projection and reprojection, and underlying all of these concepts is the Coordinate Reference System.
Best case scenario is that you have a point data set with geographic coordinates. Geographic coordinates are in the form of a longitude and latitude, where longitude is your X coordinate and spans East/West and latitude is your Y coordinate and spans North/South.
Let’s bring in a csv data set of homeless encampments in Los Angeles
City, which was downloaded from the Los
Angeles City Open Data portal. I uploaded the data set on GitHub so
you can directly read it in using read_csv()
. If you’re
having trouble bringing this data in, I uploaded the file onto Canvas in
the Week 4 - Lab folder.
homeless.df <- read_csv("https://raw.githubusercontent.com/crd230/data/master/homeless311_la_2019.csv")
#take a look at the data
glimpse(homeless.df)
## Rows: 55,536
## Columns: 34
## $ SRNumber <chr> "1-1523590871", "1-1523576655", "1-152357498…
## $ CreatedDate <chr> "12/31/2019 11:26:00 PM", "12/31/2019 09:27:…
## $ UpdatedDate <chr> "01/14/2020 07:52:00 AM", "01/08/2020 01:42:…
## $ ActionTaken <chr> "SR Created", "SR Created", "SR Created", "S…
## $ Owner <chr> "BOS", "BOS", "BOS", "BOS", "BOS", "LASAN", …
## $ RequestType <chr> "Homeless Encampment", "Homeless Encampment"…
## $ Status <chr> "Closed", "Closed", "Closed", "Closed", "Clo…
## $ RequestSource <chr> "Mobile App", "Mobile App", "Mobile App", "M…
## $ CreatedByUserOrganization <chr> "Self Service", "Self Service", "Self Servic…
## $ MobileOS <chr> "iOS", "Android", "iOS", "iOS", "iOS", NA, "…
## $ Anonymous <chr> "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N",…
## $ AssignTo <chr> "WV", "WV", "WV", "WV", "WV", "NC", "WV", "W…
## $ ServiceDate <chr> "01/14/2020 12:00:00 AM", "01/10/2020 12:00:…
## $ ClosedDate <chr> "01/14/2020 07:51:00 AM", "01/08/2020 01:42:…
## $ AddressVerified <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ ApproximateAddress <chr> NA, NA, NA, NA, NA, "N", NA, NA, NA, NA, "N"…
## $ Address <chr> "CANOGA AVE AT VANOWEN ST, 91303", "23001 W …
## $ HouseNumber <dbl> NA, 23001, NA, 5550, 5550, NA, 5200, 5500, 2…
## $ Direction <chr> NA, "W", NA, "N", "N", NA, "N", "N", "S", "W…
## $ StreetName <chr> NA, "VANOWEN", NA, "WINNETKA", "WINNETKA", N…
## $ Suffix <chr> NA, "ST", NA, "AVE", "AVE", NA, "AVE", "AVE"…
## $ ZipCode <dbl> 91303, 91307, 91367, 91364, 91364, 90004, 91…
## $ Latitude <dbl> 34.19375, 34.19388, 34.17220, 34.17143, 34.1…
## $ Longitude <dbl> -118.5975, -118.6281, -118.5710, -118.5707, …
## $ Location <chr> "(34.1937512753, -118.597510305)", "(34.1938…
## $ TBMPage <dbl> 530, 529, 560, 560, 560, 594, 559, 559, 634,…
## $ TBMColumn <chr> "B", "G", "E", "E", "E", "A", "H", "J", "B",…
## $ TBMRow <dbl> 6, 6, 2, 2, 2, 7, 3, 2, 6, 3, 3, 2, 7, 3, 2,…
## $ APC <chr> "South Valley APC", "South Valley APC", "Sou…
## $ CD <dbl> 3, 12, 3, 3, 3, 13, 3, 3, 1, 3, 7, 3, 3, 4, …
## $ CDMember <chr> "Bob Blumenfield", "John Lee", "Bob Blumenfi…
## $ NC <dbl> 13, 11, 16, 16, 16, 55, 16, 16, 76, 16, 101,…
## $ NCName <chr> "CANOGA PARK NC", "WEST HILLS NC", "WOODLAND…
## $ PolicePrecinct <chr> "TOPANGA", "TOPANGA", "TOPANGA", "TOPANGA", …
The data represent homeless encampment locations in 2019 as reported through the City’s 311 system. 311 is a centralized call center that residents, businesses and visitors can use to request service, report problems or get information from local government. To download the data from LA’s open data portal linked above, I did the following
Viewing the file and checking its class you’ll find that homeless.df is a regular tibble (no geometry column), not a spatial sf points object.
We will use the function st_as_sf()
to create a point
sf object of homeless.df. The function
requires you to specify the longitude and latitude of each point using
the coords =
argument, which are conveniently stored in the
variables Longitude and Latitude.
homeless.sf <- homeless.df %>%
st_as_sf(coords = c("Longitude", "Latitude"))
Often you will get point data that won’t have longitude/X and latitude/Y coordinates but instead have street addresses. The process of going from address to X/Y coordinates is known as geocoding.
To demonstrate geocoding, type in your home street address, city and state inside the quotes below.
myaddress.df <- tibble(street = "", city = "", state = "")
This creates a tibble with your street, city and state saved in three
variables. To geocode addresses to longitude and latitude, use the
function geocode()
which is a part of the
tidygeocoder package. Use geocode()
as
follows
myaddress.df <- myaddress.df %>%
geocode(street = street, city = city, state = state, method = "census")
Here, we specify street, city and state variables. The argument
method = 'census'
specifies the geocoder used to map
addresses to longitude/latitude locations. In the above case
'census'
uses the geocoder provided by the U.S. Census to
find the address. Think of a geocoder as R going to the Census geocoder
website, searching for
each address, plucking the latitude and longitude of the address, and
saving it into a tibble named myaddress.df. Other geocoders
include Google Maps (requires an API key) and OpenStreetMap.
If you view this object, you’ll find the latitude lat and
longitude long attached as columns. Convert this point to an
sf object using the function
st_as_sf()
.
myaddress.sf <- myaddress.df %>%
st_as_sf(coords = c("long", "lat"))
Type in tmap_mode("view")
and then map
myaddress.sf. Zoom into the point. Did it get your home address
correct?
Let’s bring in a csv file containing the street addresses of homeless shelters and services in Los Angeles County , which I also downloaded from Los Angeles’ open data portal. This file is also on Canvas.
shelters.df <- read_csv("https://raw.githubusercontent.com/crd230/data/master/Homeless_Shelters_and_Services.csv")
#take a look at the data
glimpse(shelters.df)
## Rows: 182
## Columns: 23
## $ source <chr> "211", "211", "211", "211", "211", "211", "211", "211", "…
## $ ext_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cat1 <chr> "Social Services", "Social Services", "Social Services", …
## $ cat2 <chr> "Homeless Shelters and Services", "Homeless Shelters and …
## $ cat3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ org_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "www.catalystfdn.org"…
## $ Name <chr> "Special Service For Groups - Project 180", "1736 Family…
## $ addrln1 <chr> "420 S. San Pedro", "2116 Arlington Ave", "1736 Monterey …
## $ addrln2 <chr> NA, "Suite 200", NA, NA, NA, NA, "4th Fl.", NA, NA, NA, N…
## $ city <chr> "Los Angeles", "Los Angeles", "Hermosa Beach", "Monrovia"…
## $ state <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ hours <chr> "SITE HOURS: Monday through Friday, 8:30am to 4:30pm.", …
## $ email <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ url <chr> "ssgmain.org/", "www.1736fcc.org", "www.1736fcc.org", "ww…
## $ info1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ info2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ post_id <dbl> 772, 786, 788, 794, 795, 946, 947, 1073, 1133, 1283, 1353…
## $ description <chr> "The agency provides advocacy, child care, HIV/AIDS servi…
## $ zip <dbl> 90013, 90018, 90254, 91016, 91776, 90028, 90027, 90019, 9…
## $ link <chr> "http://egis3.lacounty.gov/lms/?p=772", "http://egis3.lac…
## $ use_type <chr> "publish", "publish", "publish", "publish", "publish", "p…
## $ date_updated <chr> "2017/10/30 14:43:13+00", "2017/10/06 16:28:29+00", "2017…
## $ dis_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
The file contains no latitude and longitude data, so we need to
convert the street addresses contained in the variables
addrln1, city and state. Use the function
geocode()
. The process may take a few minutes so be patient.
shelters.geo <- shelters.df %>%
geocode(street = addrln1, city = city, state = state, method = 'census')
Look at the column names.
names(shelters.geo)
## [1] "source" "ext_id" "cat1" "cat2" "cat3"
## [6] "org_name" "Name" "addrln1" "addrln2" "city"
## [11] "state" "hours" "email" "url" "info1"
## [16] "info2" "post_id" "description" "zip" "link"
## [21] "use_type" "date_updated" "dis_status" "lat" "long"
We see the latitudes and longitudes are attached to the variables lat and long, respectively. Notice that not all the addresses were successfully geocoded.
summary(shelters.geo$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 33.74 33.99 34.05 34.07 34.10 34.70 16
Several shelters received an NA
. This is likely because
the addresses are not correct, has errors, or are not fully specified.
You’ll have to manually fix these issues, which becomes time consuming
if you have a really large data set. You can also try another geocoder.
For the purposes of this lab, let’s just discard these, but in practice,
make sure to double check your address data (See the document
Geocoding_Best_Practices.pdf in the Other Resources folder on Canvas for
best practices for cleaning address data).
shelters.geo <- shelters.geo %>%
filter(is.na(lat) == FALSE & is.na(long) == FALSE)
Convert latitude and longitude data into spatial points using the
function st_as_sf()
.
shelters.sf <- shelters.geo %>%
st_as_sf(coords = c("long", "lat"))
In this section, you learned about geocoding using tidygeocoder. Here’s a badge for your R lapel!
Plot homeless encampments and shelters using functions from the tmap package, which we learned about in Lab 3. This is an example of a basic pin or dot map.
#turn mapping back to plot view
tmap_mode("plot")
#dot map
tm_shape(homeless.sf) +
tm_dots(col="red") +
tm_shape(shelters.sf) +
tm_dots(col="blue")
## Warning: Currect projection of shape homeless.sf unknown. Long-lat (WGS84) is
## assumed.
## Warning: Currect projection of shape shelters.sf unknown. Long-lat (WGS84) is
## assumed.
We get a map that looks correct. But, we did get two warnings. These warnings are not something to sneeze at - they tell us that we haven’t set a projection, which is no problem if we’re just mapping, but is no good if we want to do some spatial analyses on the point locations.
What we need to do is set the Coordinate Reference System (CRS). The CRS is an important concept to understand when dealing with spatial data. We won’t go through the real nuts and bolts of CRS, which you can read in GWR Chapters 2.4 and 7, but we’ll go through enough of it so that you can get through most of the CRS related spatial data wrangling tasks in this class. In addition to GWR, Esri also has a nice explanation here. This site also does a thorough job of explaining how to work with CRS in R. You can also read the document Coordinate_Reference_Systems.pdf on Canvas in the Other Resources folder.
The CRS contains two major components: the Geographic Coordinate System (GCS) and the Projected Coordinate System (PCS). A GCS uses a three-dimensional spherical surface to define locations on the earth. The GCS is composed of two parts: the ellipse and the datum. The ellipse is a model of the Earth’s shape - how the earth’s roundness is calculated. The datum defines the coordinate system of this model - the origin point and the axes. You need these two basic components to place points on Earth’s three-dimensional surface. Think of it as trying to create a globe (ellipse) and figuring out where to place points on that globe (datum).
The PCS then translates these points from a globe onto a two-dimensional space. We need to do this because were creating flat-paper or on-the-screen maps, not globes (it’s kind of hard carrying a globe around when you’re finding your way around a city).
You can find out the CRS of a spatial data set using the function
st_crs()
.
st_crs(homeless.sf)
## Coordinate Reference System: NA
When we used st_as_sf()
above to create
homeless.df, we did not specify a CRS. We should have. Working
with spatial data requires both a Geographic Coordinate System (so you
know where your points are on Earth) and a Projection (a way of putting
points in 2 dimensions). Both. Always. Like Peanut Butter and Jelly.
Like Sonny and Cher.
There are two common ways of specifying a coordinate system in R: via
the EPSG numeric code or via the PROJ4 formatted string. The
PROJ4 syntax consists of a list of parameters, each separated by a space
and prefixed with the +
character. To specify the PCS, you
use the argument +proj=
. To specify the GCS, you use the
arguments +ellps=
to establish the ellipse and
+datum=
to specify the datum.
How do we know which CRS to use? The most common datums in North
America are NAD27, NAD83 and WGS84, which has the ellipsoids clrk66,
GRS80, and WGS84, respectively. The datum always specifies the ellipsoid
that is used, but the ellipsoid does not specify the datum. This means
you can specify +datum=
and not specify
+ellps=
and R will know what to do, but not always the
other way around. For example, the ellipsoid GRS80 is also associated
with the datum GRS87, so if you specify ellps=GRS80
without
the datum, R won’t spit out an error, but will give you an unknown CRS.
The most common datum and ellipsoid combinations are listed in Figure 1
in the Coordinate_Reference_Systems.pdf document on Canvas.
When you are bringing in point data with latitude and longitude, the
projected coordinate system is already set for you. Latitudes and
longitudes are X-Y coordinates, which is essentially a Plate
Carree projection. You specify a PCS using the argument (in quotes)
+proj=longlat
. Let’s use st_as_sf()
again on
homeless.df, but this time specify the CRS using the argument
crs
.
homeless.sf <- homeless.df %>%
st_as_sf(coords = c("Longitude", "Latitude"),
crs = "+proj=longlat +datum=WGS84 +ellps=WGS84")
The CRS should have spaces only in between
+proj=longlat
, +datum=WGS84
and
+ellps=WGS84
, and no other place. Remember, you could have
just specified +datum=WGS84
and R would have known what to
do with the ellipsoid. What is the CRS now?
st_crs(homeless.sf)
We can see the PROJ4 string using
st_crs(homeless.sf)$proj4string
## [1] "+proj=longlat +datum=WGS84 +no_defs"
Instead of a PROJ4, we can specify the CRS using the EPSG associated
with a GCS and PCS combination. A EPSG is a four-five digit unique
number representing a particular CRS definition. The EPSG for the
particular GCS and PCS combination used to create homeless.sf
is 4326. Had we looked this up here (another
useful site here), we could have used
crs = 4326
instead of
"+proj=longlat +datum=WGS84"
in st_as_sf()
as
we do below.
homeless.sf2 <- homeless.df %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)
we verify that the CRS are the same
st_crs(homeless.sf) == st_crs(homeless.sf2)
## [1] TRUE
Let’s set the CRS for the homeless shelter and services points.
shelters.sf <- shelters.geo %>%
st_as_sf(coords = c("long", "lat"), crs = 4326)
Another important problem that you may encounter is that a shapefile
or any spatial data set you downloaded from a source contains no CRS
(unprojected or unknown). In this case, use the function
st_set_crs()
to set the CRS. See GWR 7.3 for more
details.
The above section deals with a situation where you are establishing the CRS for the first time. However, you may want to change an already defined CRS. This task is known as reprojection. Why would you want to do this? There are three main reasons:
Reason 1: All spatial data in your current R session should have the same CRS if you want to overlay the objects on a map or conduct any of the multiple layer spatial operations we went through in Lab 3.
Let’s check to see if homeless.sf and shelters.sf have the same CRS
st_crs(homeless.sf) == st_crs(shelters.sf)
## [1] TRUE
Great. Do they match with la.city.tracts
st_crs(homeless.sf) == st_crs(la.city.tracts)
## [1] FALSE
Oh no! If you map homeless.sf and la.city.tracts, you’ll find that they may align. R is smart enough to reproject on the fly to get them on the same map. However, this does not always happen. Furthermore, R doesn’t actually change the CRS. This leads to the next reason why we may need to reproject.
Many of R’s geometric functions that require calculating distances
(e.g. distance from one point to another) or areas require a standard
measure of distance/area. The spatial point data of homeless encampments
are in longitude/latitude coordinates. Distance in longitude/latitude is
in decimal degrees, which is not a standard measure. We can find out the
units of a spatial data set by using the st_crs()
function
and calling up units as follows
st_crs(homeless.sf)$units
## NULL
st_crs(la.city.tracts)$units
## NULL
Booo!. Not only do we need to reproject homeless.sf, shelters.sf, and la.city.tracts into the same CRS, we need to reproject them to a CRS that handles standard distance measures such as meters or kilometers. The Universal Transverse Mercator (UTM) projected coordinate system works in meters. UTM separates the United States in separate zones and Southern California is in zone 11, as shown in the figure below.
Let’s reproject la.city.tracts, homeless.sf and
shelters.sf to a UTM Zone 11 projected coordinate system. Use
+proj=utm
as the PCS, NAD83 as the datum and GRS80 as the
ellipse (popular choices for the projection/datum/ellipse of the U.S.).
Whenever you use UTM, you also need to specify the zone, which we do by
using +zone=11
. To reproject use the function
st_transform()
as follows.
la.city.tracts.utm <-la.city.tracts %>%
st_transform(crs = "+proj=utm +zone=11 +datum=NAD83 +ellps=GRS80")
homeless.sf.utm <- homeless.sf %>%
st_transform(crs = "+proj=utm +zone=11 +datum=NAD83 +ellps=GRS80")
shelters.sf.utm <- shelters.sf %>%
st_transform(crs = "+proj=utm +zone=11 +datum=NAD83 +ellps=GRS80")
Equal?
st_crs(la.city.tracts.utm) == st_crs(homeless.sf.utm)
## [1] TRUE
Units?
st_crs(la.city.tracts.utm)$units
## [1] "m"
st_crs(homeless.sf.utm)$units
## [1] "m"
st_crs(shelters.sf.utm)$units
## [1] "m"
“m” stands for meters.
Note that you cannot change the CRS if one has not already been
established. For example, you cannot use the function
st_transform()
on homeless.sf if you did not
establish the CRS when you used st_as_sf()
on
homeless.df.
Now, let’s map em all.
tm_shape(la.city.tracts) +
tm_polygons() +
tm_shape(homeless.sf.utm) +
tm_dots(col="red") +
tm_shape(shelters.sf.utm) +
tm_dots(col="blue")
Main takeaway points:
If you stick with these principles, you should be able to get through most issues regarding CRSs. If you get stuck, read GWR Ch. 2.4 and 7.
Another way to bring point data into R is to draw from the wealth of spatial data offered by OpenStreetMap (OSM). OSM is a free and open map of the world created largely by the voluntary contributions of millions of people around the world. Since the data are free and open, there are few restrictions to obtaining and using the data. The only condition of using OSM data is proper attribution to OSM contributors.
We can grab a lot of really cool data from OSM using their API. OSM serves two APIs, namely Main API for editing OSM, and Overpass API for providing OSM data. We will use Overpass API to gather data in this lab. What kinds of things can you bring into R through their API? A lot. Check them out on their Wiki.
Data can be queried for download using a combination of search criteria like location and type of objects. It helps to understand how OSM data are structured. OSM data are stored as a list of attributes tagged in key - value pairs of geospatial objects (points, lines or polygons).
Maybe were interested in the proximity of homeless encampments to
restaurants. We can bring in restaurants listed by OSM using various
functions from the package osmdata. Restaurants are
tagged under amenities. Amenities are facilities used by visitors and
residents. Here, key
is “amenity” and value
is
“restaurant.” Other amenities include: “university”, “music school”, and
“kindergarten” in education, “bus station”, “fuel”, “parking” and others
in transportation, and much more.
Use the following line of code to get restaurants in Los Angeles
data_from_osm_sf <- opq(getbb ("Los Angeles, California")) %>% #gets bounding box
add_osm_feature(key = "amenity", value = "restaurant") %>% #searches for restaurants within the bounding box
osmdata_sf() #download OSM data as sf
What you get is a list with a lot of information about restaurants mapped in OSM. Let’s extract the geometry and name of the restaurant.
#select name and geometry from point data for restaurants
resto_osm <- data_from_osm_sf$osm_points %>% #select point data from downloaded OSM data
select(name, geometry) #selecting the name and geometry to plot
We get an sf object containing restaurants in Los Angeles.
Finally, we can plot the restaurants using our comrade
leaflet()
, which we discovered in Lab 3.
#create a plot in leaflet
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircles(data = resto_osm)
A limitation with the osmdata package is that it is rate limited, meaning that it cannot download large OSM datasets (e.g. all the OSM data for a large city). To overcome this limitation, the osmextract package was developed, which can be used to download and import binary .pbf files containing compressed versions of the OSM database for pre-defined regions.
Other than mapping their locations (e.g. dot map), what else can we
do with point locations? One of the simplest analyses we can do with
point data is to examine the distribution of points across an area (also
known as point density). When working with neighborhoods, we can examine
point distributions by summing up the number of points in each
neighborhood. To do this, we can use the function
aggregate()
.
hcamps_agg <- aggregate(homeless.sf.utm["SRNumber"], la.city.tracts.utm, FUN = "length")
The first argument is the sf points you want to
count, in our case homeless.sf.utm. You then include in
brackets and quotes the variable containing the unique ID for each
point, in our case “SRNumber”. The next argument is the object you want
to sum the points for, in our case Los Angeles census tracts
la.city.tracts.utm. Finally,FUN = "length"
tells R
to sum up the number of points in each tract.
Take a look at the object we created.
glimpse(hcamps_agg)
## Rows: 1,001
## Columns: 2
## $ SRNumber <int> 110, 32, 35, 186, 7, 3, 6, 32, 2, 37, 1, 18, 31, 11, 28, 173,…
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((387834.3 37..., MULTIPOLYGON (((…
The variable SRNumber gives us the number of points in each of Los Angeles’ tracts. Notice that there are missing values.
summary(hcamps_agg)
## SRNumber geometry
## Min. : 1.00 MULTIPOLYGON :1001
## 1st Qu.: 10.00 epsg:NA : 0
## Median : 29.00 +proj=utm ...: 0
## Mean : 56.84
## 3rd Qu.: 69.00
## Max. :822.00
## NA's :28
This is because there are tracts that have no homeless encampments.
Convert these NA values to 0 using the replace_na()
function within the mutate()
function.
hcamps_agg <- hcamps_agg %>%
mutate(SRNumber = replace_na(SRNumber,0))
The function replace_na()
tells R to replace any NAs
found in the variable SRNumber with the number 0.
Next, we save the variable SRNumber from hcamps_agg
into our main dataset la.city.tracts.utm by creating a new
variable hcamps within mutate()
.
la.city.tracts.utm <- la.city.tracts.utm %>%
mutate(hcamps = hcamps_agg$SRNumber)
We can map the count of encampments by census tract, but counts do not take into consideration exposure (remember the discussion regarding counts vs rates in Handout 2). In this case, tracts that are larger in size will likely have more encampments. Let’s calculate the number of encampments per area.
To calculate the number of encampments per area, we’ll need to get
the area of each polygon, which we do by using the function
st_area()
. The default area metric is kilometers squared,
but we can use the function set_units()
from the
units package to set the unit of measurement to (the
U.S. friendly) miles squared value = mi2
. Use these
functions within mutate()
to create a new column
area that contains each tract’s area.
la.city.tracts.utm<- la.city.tracts.utm %>%
mutate(area=set_units(st_area(la.city.tracts.utm), value = mi2))
The class of variable area is units
class(la.city.tracts.utm$area)
## [1] "units"
We don’t want it in class units, but as class numeric. Convert it to
numeric using as.numeric()
la.city.tracts.utm <- la.city.tracts.utm %>%
mutate(area = as.numeric(area))
Then calculate the number of homeless encampments per area.
la.city.tracts.utm<- la.city.tracts.utm %>%
mutate(harea=hcamps/area)
Let’s create a choropleth map of encampments per area.
tm_shape(la.city.tracts.utm, unit = "mi") +
tm_polygons(col = "harea", style = "quantile",palette = "Reds",
border.alpha = 0, title = expression("Encampments per " * mi^2)) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(main.title = "Homeless encampments in Los Angeles Tracts 2019",
main.title.size = 0.95, frame = FALSE,
legend.outside = TRUE, legend.outside.position = "right")
What is the correlation between neighborhood encampments per area and
percent black? What about percent Hispanic? Use the function
cor()
.
cor(la.city.tracts.utm$harea, la.city.tracts.utm$pnhblk, use = "complete.obs")
## [1] -0.03363424
cor(la.city.tracts.utm$harea, la.city.tracts.utm$phisp, use = "complete.obs")
## [1] 0.01480164
Practice Exercise: Instead of encampments per area, map encampments per population tpop. What is the correlation between neighborhood encampments per population and percent black? What about percent Hispanic?
In Handout 2, I briefly describe the use of kernel density maps to show the spatial patterns of points. These are also commonly known as heat maps. They are cool looking and as long as you understand broadly how these maps are created and what they are showing, they are a good exploratory tool. Also, a benefit of using a kernel density map to visually present your point data is that it does away with predefined areas like census tracts. Your point space becomes continuous.
To create a heat map, we turn to our friend ggplot()
.
Remember that ggplot()
is built on a series of
<GEOM_FUNCTION>()
, which is a unique function
indicating the type of graph you want to plot. In the case of a kernel
density map, stat_density2d()
is the
<GEOM_FUNCTION>()
. stat_density2d()
uses
the kde2d()
function in the base MASS
package on the back end to estimate the density using a bivariate normal
kernel.
Let’s create a heat map with stat_density2d()
where
areas with darker red colors have a higher density of encampments.
ggplot() +
stat_density2d(data = homeless.df,
aes(x = Longitude, y = Latitude,
fill = ..level.., alpha = ..level..),
alpha = .5, bins = 50, geom = "polygon") +
geom_sf(data=la.city, fill=NA, color='black') +
scale_fill_gradient(low = "blue", high = "red") +
ggtitle("Homeless Encampments Heat Map") +
theme_void() +
theme(legend.position = "none")
Rather than the sf object homeless.sf.utm,
we use the regular tibble homeless.df, and indicate in
aes()
the longitude and latitude values of the homeless
encampments. The argument bins = 50
specifies how finely
grained we want to show the variation in encampments over space - the
higher it is, the more granular (use a higher value than 50 to see what
we mean). The argument alpha
establishes the transparency
of the colors (it ranges from 0 to 1). We add the Los Angeles city
boundary using the geom_sf()
, which is the
<GEOM_FUNCTION>()
for mapping sf
objects. We use scale_fill_gradient()
to specify the color
scheme where areas of low encampments density are blue and areas of high
encampments density are red.
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil