Visualizing and BenchmaRking Open Data Catalogs

A couple of weeks ago I had the opportunity to co-present a session called Visualizing your data, the Power of BI with Seattle’s Open Data manager, Paul Alley at the 2019 Tyler Connect conference in Dallas. For our session we wanted to use Power BI to spin up reports on the health of Seattle’s Open Data Catalog because Power BI makes it very easy to connect to data sources, add calculations, visualize the results and drilldown into various facets of the data with an interactive report. As an opportunistic learner, I thought this would be a great opportunity to practice the Learning Launch method for my Design Thinking for Innovation final assignment.

The Score

Paul already had in mind a health calculation based on metadata attributes where in the first iteration of the report, the presence of these attributes would be weighted equally. Attributes would be assigned a value of 1 if it contained any value and 0 if no value existed for it.

Dataset Description + Category + Department Provider + Keywords + Publishing Frequency / 5

The Data

The first data source we chose is an internal dataset called the Asset Inventory that contains total views and downloads, creation and update timestamps and all basic and custom metadata fields. This data is retrievable as a downloaded file or via api call, and in this case we used the OData endpoint, https://data.seattle.gov/api/odata/v4/nymu-ywvk to connect to in our Power BI report.

The Report

Here’s the report with the data as of 4/20/2019. It was interesting to see that the majority of visits were on a handful of datasets including real time fire 911 calls, road weather information stations, sold fleet equipment and wage data. On the 2nd tab we visualized the views by metadata score to see if there was any relationship between better metadata and utilization.

BenchmaRking

Another common task in performance analysis is to benchmark oneself againgst ones peers. We were able to spin up a prototype of the Report as a R flexdashboard using a different datasource that would give us access to all of the public datasets for all data portals on the Socrata platform. The one thing that changed for this iteration of the report was to use only standard metadata fields for all portals and so instead of using Seattle’s custom fields for department provider and publishing frequency we selected source and any value for a custom field to calculate the score. We also added controls to allow changes to the weight of each metadata attribute and as always for my shiny apps, the Wes Anderson color palette package was used for the chart, specifically Zissou1 which I thought paired well with the cerulean theme.

Lessons Learned

A small number of datasets have the majority of visits
Datasets with similar names that may contain different years of data
Datasets with duplicate names (added group by u_id)
Variety in topics for top datasets (911, road stations and city wages)
Top datasets aren’t always updated frequently

Next Steps

Develop report with narrative for primary domain that includes action items and recommendations
Extract custom metadata elements for deeper inspection
Add trend line for popularity from last 30 days to last 7 days per dataset to spot rising popularity
Add data quality analysis (datatypes for analysis, data contents, data completeness)
Identify similar datasets by name and fields where they may contain different years of data that could be combined into single dataset
Flag duplicate datasets
Add data freshness score relative to expected publishing frequency (if that metadata is captured in a custom metadata field)

Resources

Presendation Deck and report - https://github.com/aliciatb/connect#visualizing-your-data-the-power-of-bi
Seattle Data Catalog - https://data.seattle.gov/browse
View of Public Data Asset Inventory - https://socratadiscovery.docs.apiary.io https://data.seattle.gov/dataset/Seattle-Public-Dataset-Asset-Inventory/nymu-ywvk
Discovery API - https://socratadiscovery.docs.apiary.io