PyData 2017, held on the Microsoft campus in the summer of 2017 featured many awesome speakers sharing their visualization libraries. Here are some examples using my own data to take them out on a spin.
Visualization Notebook!
library(htmltools)
# jupyter nbconvert --to html ~/repos/blog/notebooks/visualizations_demo.ipynb
includeHTML("../../notebooks/visualizations_demo.html")
Visualizations¶
There were at least 3 tributes to Hans Rosling
- From Bloomberg teams bqPlot
- And during UW's Jeffrey Heer Interactive Data Analysis: Visualization and Beyond Keynote
- And also by Bokeh & friends
Pandas & Seaborn¶
Stephen Elston presented Exploring Data with Python using pandas and seaborn visualization libraries. What's nice about these python libraries is that not many lines of code are required to create compelling visualizations.
First, load data again, but subset this time¶
Can use Pandas or Dask so we'll start with Dask and then convert to Pandas for fun!
# !pip3 install bokeh matplotlib seaborn
trans_file = 'data/transactions.csv'
data_types = {'type': str,
'type_description': str,
'federal_action_obligation': float,
'description': str,
'award': str,
'fiscal_year': int,
'recipient_legal_entity_id': str,
'recipient_recipient_name': str,
'recipient_business_types_description': str,
'recipient_city_local_government': str,
'recipient_county_local_government': str,
'recipient_inter_municipal_local_government': str,
'recipient_local_government_owned': str,
'recipient_municipality_local_government': str,
'recipient_school_district_local_government': str,
'recipient_township_local_government': str,
'recipient_us_state_government': str,
'recipient_us_federal_government': str,
'recipient_federal_agency': str,
'recipient_federally_funded_research_and_development_corp': str,
'recipient_us_tribal_government': str,
'recipient_foreign_government': str,
'recipient_private_university_or_college': str,
'recipient_educational_institution': str,
'recipient_contracts': str,
'recipient_grants': str,
'recipient_receives_contracts_and_grants': str,
'recipient_location_location_id': str,
'recipient_location_country_name': str,
'recipient_location_state_code': str,
'recipient_location_state_name': str,
'recipient_location_state_description': str,
'recipient_location_city_name': str,
'recipient_location_city_code': str,
'recipient_location_county_name': str,
'recipient_location_county_code': str,
'recipient_location_congressional_code': str,
'recipient_location_zip5': str,
'recipient_location_location_country_code': str,
'place_of_performance_location_id': str,
'place_of_performance_country_name': str,
'place_of_performance_state_code': str,
'place_of_performance_state_name': str,
'place_of_performance_state_description': str,
'place_of_performance_city_name': str,
'place_of_performance_county_name': str,
'place_of_performance_county_code': str,
'place_of_performance_congressional_code': str,
'place_of_performance_zip5': str,
'place_of_performance_location_country_code': str,
'assistance_data_fain': str,
'assistance_data_uri': str,
'assistance_data_cfda_number': str,
'assistance_data_cfda_title': str,
'assistance_data_non_federal_funding_amount': str,
'assistance_data_total_funding_amount': str,
'assistance_data_face_value_loan_guarantee': str,
'assistance_data_original_loan_subsidy_cost': str,
'assistance_data_reporting_period_start': str,
'assistance_data_reporting_period_end': str,
'assistance_data_period_of_performance_start_date': str,
'assistance_data_period_of_performance_current_end_date': str,
'assistance_data_cfda_program_number': str,
'assistance_data_cfda_program_title': str,
'assistance_data_cfda_federal_agency': str,
'assistance_data_cfda_url': str,
'contract_data_piid': str,
'contract_data_naics_description': str
}
trans_cols = ['type','type_description','federal_action_obligation','description','award','fiscal_year',
'recipient_legal_entity_id','recipient_recipient_name','recipient_business_types_description',
'recipient_city_local_government','recipient_county_local_government','recipient_inter_municipal_local_government',
'recipient_local_government_owned','recipient_municipality_local_government','recipient_school_district_local_government','recipient_township_local_government',
'recipient_us_state_government','recipient_us_federal_government','recipient_federal_agency','recipient_federally_funded_research_and_development_corp',
'recipient_us_tribal_government','recipient_foreign_government','recipient_private_university_or_college',
'recipient_educational_institution',
'recipient_contracts','recipient_grants','recipient_receives_contracts_and_grants',
'recipient_location_location_id','recipient_location_country_name','recipient_location_state_code',
'recipient_location_state_name','recipient_location_state_description',
'recipient_location_city_name','recipient_location_city_code','recipient_location_county_name','recipient_location_county_code',
'recipient_location_congressional_code','recipient_location_zip5','recipient_location_location_country_code',
'place_of_performance_location_id','place_of_performance_country_name','place_of_performance_state_code',
'place_of_performance_state_name','place_of_performance_state_description','place_of_performance_city_name',
'place_of_performance_county_name','place_of_performance_county_code','place_of_performance_congressional_code',
'place_of_performance_zip5','place_of_performance_location_country_code',
'assistance_data_fain','assistance_data_uri','assistance_data_cfda_number','assistance_data_cfda_title',
'assistance_data_non_federal_funding_amount','assistance_data_total_funding_amount','assistance_data_face_value_loan_guarantee',
'assistance_data_original_loan_subsidy_cost','assistance_data_reporting_period_start','assistance_data_reporting_period_end',
'assistance_data_period_of_performance_start_date','assistance_data_period_of_performance_current_end_date',
'assistance_data_cfda_program_number','assistance_data_cfda_program_title','assistance_data_cfda_federal_agency',
'assistance_data_cfda_url','contract_data_piid','contract_data_naics_description']
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = dd.read_csv(trans_file, dtype=data_types, usecols=trans_cols)
# convert Dask dataframe to Pandas dataframe
pdf = df.compute()
pdf.head()
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker
# find the counts for each unique type of Contract Award
counts = pdf['type_description'].value_counts()
# define plot area
fig = plt.figure(figsize=(16,12))
# define axis
ax = fig.gca()
counts.plot.bar(ax = ax)
ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
grants = pdf[pdf['type'].isin(['02','03','04','05'])].sort_values(by='place_of_performance_state_code')
# check that sort worked looking at first 10 records
grants.groupby(grants.place_of_performance_state_code).federal_action_obligation.sum().head(10)
import seaborn as sns
# Set grid scale and font-size
sns.set(style="darkgrid",font_scale=2)
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(25, 40))
# Count of Awards by State broken down by Type of Grant
ax = sns.countplot(y="place_of_performance_state_code",
hue="type_description",
data=grants)
# Set Labels
ax.set(xlabel='Award Count', ylabel='Place of Performance')
ax.legend()
plt.show()
And now Bokeh!¶
Bokeh is an interactive Visualization Library that makes it easy to create rich visualizations without a single line of D3! Their roadmap includes integration with Altair which is another viz library created by Jake Vanderplas and also their JS Code migration to Typescript.
Award Amounts by Performance State example¶
Inspired by Texas Unemployment Rate, http://bokeh.pydata.org/en/0.11.1/docs/gallery/texas.html
from bokeh.io import push_notebook, show, output_notebook
from bokeh.models import (
ColumnDataSource,
HoverTool,
LogColorMapper
)
from bokeh.plotting import figure
from bokeh.sampledata.us_states import data as states
output_notebook()
Get the amounts by State for total Awards from previously loaded dataframe¶
state_amounts = pdf.groupby(pdf.place_of_performance_state_code).federal_action_obligation.sum()
Bokeh provide sample data which is great and makes it easy to join to State awards¶
This is cool because we don't have Lat & Long in USA Spending Data
state_xs = [state["lons"] for state in states.values()]
state_ys = [state["lats"] for state in states.values()]
state_names = [state['name'] for state in states.values()]
state_obligations = [state_amounts[state_id] for state_id in states]
# colorblind-safe palette courtesy of http://colorbrewer2.org
colors = ["#edf8fb", "#ccece6", "#99d8c9", "#66c2a4", "#2ca25f", "#006d2c"]
state_colors = []
# creating index for choropleth using log of Total Obligation
for state_id in states:
try:
obligation = state_amounts[state_id]
if obligation >= 0:
idx = int(np.log(obligation)/4)
else:
idx = 1
state_colors.append(colors[idx])
except KeyError:
state_colors.append("black")
source = ColumnDataSource(data=dict(
x=state_xs,
y=state_ys,
name=state_names,
obligation=state_obligations,
color=state_colors
))
Now you can customize the visualization components for amazing User Experience¶
TOOLS = "pan,wheel_zoom,box_zoom,reset,hover,save"
p = figure(
title="Federal Contract Award Obligations, 2017", tools=TOOLS, toolbar_location="left",
x_axis_location=None, y_axis_location=None, plot_width=2000, plot_height=500
)
p.grid.grid_line_color = None
p.patches('x', 'y', source=source,
fill_color='color', fill_alpha=0.7,
line_color="white", line_width=0.5)
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("Name", "@name"),
("Total Contract Award Obligation", "$@obligation{1,}")
]
show(p, notebook_handle=True)
Resources
- Notebook - https://github.com/aliciatb/blog/blob/master/notebooks/visualizations_demo.ipynb
- Data - available on request (1 gb)