For the final project, you will be expected to download, wrangle and analyze a data set of your own choosing. You can use a data set that you’ve put together for your thesis/dissertation. You can also use publicly available data such as the United States Census. Or you may want to combine publicly available data with data you’ve collected on your own.
This guide is a reference tool describing online sources that provide data typically in a csv or shapefile format at a local scale. The data sources are organized by topic or theme.
Comprehensive neighborhood data sources
Decennial Census and American Community Survey
The Census represents the most comprehensive source for demographic and socioeconomic data at the census tract level. You can download tract level data from the following sources
You can download Census tract shapefiles (and other spatial data formats) at the following sites
If you want to evaluate tract characteristics over a long time period, you’ll need to account for changes in tract boundary definitions. Social explorer allows you to get historical census data in 2010 tract boundaries. Other resources for getting data normalized to a certain year’s boundary definition include
NHGIS is part the vast umbrella known as the Integrated Public Use Microdata Series (IPUMS). IPUMS provides census and survey data from around the world integrated across time and space. If you are interested in downloading individual level Census data (typically a 5% sample), check out IPUMS USA. Unsurprisingly, there is also an IPUMS CPS, which provides individual level data from the Current Population Survey. All the IPUMS brands can be found on their homepage. Similar to the Census API, R can tap into IPUMS data directly through the ipumsr package. Check out one of its vignettes here.
Other R packages for bringing in data
ICPSR: A lot of social science data are stored in the Inter-university Consortium for Political and Social Research at the University of Michigan. The R package icpsrdata allows you to grab ICPSR data sets directly through R.
OECD: If you need some international data, the Organization for Economic Cooperation and Development, an intergovernmental economic organization with 36 member countries, can help. And R has a package for getting data, aptly named OECD.
Health characteristics
The following datasets provide health related indicators at small scale geographies.
Department of Housing and Urban Development (HUD)
HUD offers a plethora of lower geographic scale datasets on a variety of housing, built environment, and socioeconomic indicators for the country or select Metropolitan Areas. The main data splash page for the HUD is located here. Many of the datasets provide indicators of HUD funding, such as tracts that qualify for Low-Income Housing Tax Credit. They also provide Fair Area Market Rents at the zip code level, and georeferenced data located on their eGIS open data portal, which includes point level information.
Work commuting patterns
Eviction rates
Gentrification
Opportunity Atlas
The Opportunity Atlas is an an interactive, map-based tool that can trace the root of outcomes, such as poverty and incarceration, back to the neighborhoods in which children grew up. The atlas, in a nutshell, shows “Which neighborhoods in America offer children the best chances of climbing the income ladder?” You can view the tool and download all the census tract data here.
Social Capital Atlas
The Social Capital Atlas dataset shows new insights on how communities are connected using data from Facebook friendships. The dataset constructs and analyzes measures of social capital (Connectedness, Cohesiveness and Civic Engagement) across counties, ZIP codes, high schools, and colleges in the United States. Explore the dataset here.
Los Angeles Neighborhood Data for Social Change
A data warehouse created by the University of Southern California that collects a bunch of health, demographic, built environment, and socioeconomic variables at the neighborhood level for the County of Los Angeles. Check the site out here.
CA Neighborhoods and Renter Vulnerability
This project focuses on identifying the broad vulnerabilities to COVID-19 and their disparities across neighborhoods in California.
Big Data
Airbnb: Provides csv files containing detailed information on data on airbnb hosts. The data are in longitude/latitude. They don’t provide historical data.
Bikesharing: Web sites providing public use data on bikesharing. Provides station-to-station data.
OpenStreetMaps. osmdata is an R package for downloading OpenStreetMaps data. The site provides a couple of vignettes on using the package.
Array of things. The City of Chicago installed modular sensor boxes around Chicago to collect real-time data on the city’s environment, infrastructure, and activity for research and public use. Other cities have followed.
Zillow. Provides housing price data at the metro, city and zipcode levels. R has a package for downloading Zillow data directly.
Yelp. A public use dataset put together by Yelp specifically for personal and educational purposes, but has been used in academic and applied research. You can use the Yelp API, and here is a tutorial, and another, but there are some restrictions, specifically getting an access ID and creating your own app. Here is another tutorial for a specific R package that uses the Yelp API.
Uber. A public use dataset that provides anonymized information on Uber usage in select US cities. You’ll need to sign up for an account or use your Facebook or Google account.
Twitter. Twitter provides access to a sample of their tweets. You’ll need to register for an API. Here are some guides to collect and manage tweets in R: here, here, and here.
Open data portals
Many city, county and even state governments maintain open data portals. These portals provide various data sets held and maintained by the public sectors. Some of the data are measured at a fine spatial scale, going doing to latitude/longitude.
There are a couple of sites that maintain open data portal directories, including
Here are links to various open data portals in US cities (updated 01/05/23)
California
Major Cities
Looking for more data?
Google has a site for searching datasets akin to Google Scholar, Images, Books and so on. Check it out here
Kaggle is a crowd-sourced platform for all things data science. This includes competitions, discussion forums, online tutorials, and most importantly, at least for the purpose of this guide, a repository of big data sources. A lot of these data are not pertinent to this class, but some are; specifically, those with geographic information that allows you to connect data to geographic locations. Check out their datasets here.
Esri provides a repository that many of its members use to store various big and open data all in shapefile format. Check out what’s available here.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil