Background
Over the last couple of years I have had the opportunity to work on projects that provide financial, demographic and indicator data from a geographical perspective so that people can find their location on a map and discover information about their locality and how they compare to their neighbors and ideally peers.
Geographies of data
In order to associate data like debt burden, population density, and such, we needed some field that identified the locality so that it would show up on the map, ideally in the correct spot. For datasets provided by state agencies, we would typically have the name of the geographic entity and sometimes the geography type like county, city and township. For data sourced from the Census or Bureau of Labor Statistics, a unique identifier like a FIPs code would be present in the dataset.
Points + Polygons
Choropleths consist of a shape file to show the outlines of a geography plus tabular data to provide the values for each shape which makes it easy to see the differences for an indicator across the same geography type. To get the point coordinates for a geography lots of tools and languages can generate a point within
or a centroid
for a polygon. Not knowing better, I thought that generating centroids for my polygons would be a safe bet and I duly joined those generated latitude and longitude coordinates to all of my tabular data by geography identifier.
Unintended Consquences of creating data
Upon review of customers, some of the values for certain entities were not being displayed and we discovered that these geographies had a shape where the centroid didn’t actually fall inside it, like a census tract that resembled the letter C or was adjacent to a body of water. As our customers happened to reside in Michigan and California, these scenarios were a legitimate concern.
Census + Census == Success?
After digging deeper into geospatial information provided by the Census, I learned about the existence of Gazatteer files that provide geographic attributes for all Census entities like land area, water area and Latitude and Longitude points! I had already known about the Cartographic Boundary and Tiger/Line shapefiles provided by the Census and typically used the Cartographic Boundary files for my shapes because they worked best on the Socrata platform due to lower resolution and number of vertices.
Turns out that San Francisco is kind of a big deal
For a project that included all California cities, our client noticed that no data was showing up for San Francisco and upon inspection of the latitude and longitude provided by the Census data on a map, it very clearly was located in the Pacific ocean. Why Census? Why?
To the Notebook!
To regain credibility, I dug even deeper into spatial matters and was referred to a python library, geopandas by a colleague. Being a huge fan of pandas, I had no doubt that I would love geopandas. My goal was to -
- Validata that the coordinates for all California cities fell within their expected polygon
- Identify those coordinates that did not so they could be replaced by more accurate coordinates
library(htmltools)
# jupyter nbconvert --to html ~/repos/blog/notebooks/point_polygon_validation.ipynb
includeHTML("../../notebooks/point_polygon_validation.html")
Spatial Validation for Census Points & Polygons¶
Latitude and Longitude coordinates provided by the Census should fall inside polygons provided by the Census, right?
%matplotlib inline
# required packages, also need run homebrew for spatial join to work!
# brew install spatialindex
#!pip3 install geopandas descartes matplotlib shapely rtree
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
from shapely import geometry, wkt
from shapely.geometry import Point
Load Gazetteer Locations¶
# CSV converted from https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2017_Gazetteer/2017_gaz_place_06.txt
census_gaz = pd.read_csv("data/2017_gaz_place_06.csv")
print(census_gaz.head())
# combine coordinates
census_gaz['Coordinates'] = list(zip(census_gaz.INTPTLONG, census_gaz.INTPTLAT))
census_gaz['Coordinates'] = census_gaz['Coordinates'].apply(Point)
# convert to Pandas Geospatial Dataframe
census_gaz_points = gpd.GeoDataFrame(census_gaz, geometry='Coordinates')
# plot the points!
census_gaz_points.plot(alpha=.7)
Load Cartographic Boundary Shapefile¶
https://www.census.gov/geo/maps-data/data/cbf/cbf_place.html
# downloaded from http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_06_place_500k.zip
census_cb = gpd.read_file("data/cb_2017_06_place_500k/cb_2017_06_place_500k.shp")
print(census_cb.head())
# plot the polygons!
census_cb.plot()
Load TIGER/Line® Shapefile¶
# downloaded/extracted from ftp://ftp2.census.gov/geo/tiger/TIGER2018/PLACE/tl_2018_06_place.zip
census_tl = gpd.read_file("data/tl_2018_06_place/tl_2018_06_place.shp")
print(census_tl.head())
# plot the polygons!
census_tl.plot()
# coordinate reference system warning averted!
census_gaz_points.crs = census_cb.crs
cenus_gaz_points_cb = gpd.sjoin(census_gaz_points, census_cb, how="left", op='intersects')
# check for non-joined geographies
cenus_gaz_points_cb_unjoined = cenus_gaz_points_cb[cenus_gaz_points_cb.GEOID_right.isnull()]
cenus_gaz_points_cb_unjoined
Merge TIGER/Line® Shapefile with Gazetteer¶
# coordinate reference system warning averted!
census_gaz_points.crs = census_tl.crs
cenus_gaz_points_tl = gpd.sjoin(census_gaz_points, census_tl, how="left", op='intersects')
# check for non-joined geographies
cenus_gaz_points_tl_unjoined = cenus_gaz_points_tl[cenus_gaz_points_tl.GEOID_right.isnull()]
cenus_gaz_points_tl_unjoined
What are the FIPs codes for Points not found in Cartographic Boundary shapefile?¶
for idx,row in cenus_gaz_points_cb_unjoined.iterrows():
geoid = row['GEOID_left']
name = row['NAME_left']
print(geoid,str(name))
Filter both shapefiles for 4 places¶
census_cb_test = census_cb
census_cb_test = census_cb_test.query('GEOID in ("0667000","0617918","0626000","0673262")')
census_cb_test
census_tl_test = census_tl
census_tl_test = census_tl_test.query('GEOID in ("0667000","0617918","0626000","0673262")')
census_tl_test
Transform for Socrata Spatial Success¶
- Convert Polygons to Multipolygons
- Save as GEOJSON files
#Handy function provided by https://github.com/geopandas/geopandas/issues/834
upcast_dispatch = {geometry.Point: geometry.MultiPoint,
geometry.LineString: geometry.MultiLineString,
geometry.Polygon: geometry.MultiPolygon}
def maybe_cast_to_multigeometry(geom):
caster = upcast_dispatch.get(type(geom), lambda x: x[0])
return caster([geom])
census_tl.geometry = census_tl.geometry.apply(maybe_cast_to_multigeometry)
census_cb.geometry = census_cb.geometry.apply(maybe_cast_to_multigeometry)
# Convert to GEOJSON
census_tl.to_file("~/data/tl_2018_06_place.json", driver="GeoJSON")
census_cb.to_file("~/data/cb_2018_06_place.json", driver="GeoJSON")
Visually inspect unjoined cities¶
- Plot point on polygon for the GEOID
- Get centroid of polygon
- Plot centroid
- Confirm spatial join between polygon and centroid
The map
- The map layers can be toggled on and off
- If the Tiger/Line layer is turned off the orange point on the left for San Francisco does not fall within the purple Cartographic layer.
- When the Tiger/Line layer is turned on, the green shape for San Francisco encompasses the orange point in the ocean
Lessons Learned
- The Census coordinates were not wrong rather my reliance on simplified shapefiles led me into deep waters
- Many fine folks have created wonderful libraries and tools just waiting to verify that everything is where we think it is and if not show us where it actually is on the map!
Resources
- Gazetteer Files - https://www.census.gov/geo/maps-data/data/gazetteer2017.html
- Cartographic Boundary shapefiles - https://www.census.gov/geo/maps-data/data/cbf/cbf_place.html
- TIGER/Line® shapefiles - https://www.census.gov/cgi-bin/geo/shapefiles/index.php
- Notebook - https://github.com/aliciatb/blog/blob/master/notebooks/point_polygon_validation.ipynb
- Data - https://github.com/aliciatb/blog/tree/master/notebooks/data