Anomaly Detection @ Outcome Health

Created a Slack-based alerting system, which pushed alerts to over half of our Engineering-Product-Design org (~20 of 40 individuals), significantly reduced operational risk, and decreased time spent on repetitive tasks.

Organizations: Outcome Health

Collaborators: Shashin Chokshi, Kyle Gassert, Mike Thoun

Dates: 2018-2019

Focus: Healthcare, Data Science, Data Engineering

Synopsis

Outcome Health is a healthcare innovation company that showcases relevant content to patients, caregivers and healthcare professionals at the point of care.

With a fleet of over 100,000 devices in providers’ offices across the country, Outcome had a great need to optimize its device health monitoring system. I built an Anomaly Detection system to aid this effort.

The Opportunity

While working there, I saw obvious inefficiencies in how we were monitoring our KPIs for new software releases. Manual monitoring processes tend to be places ripe for improvement, and I saw a space where I could directly contribute to efficiency.

At Outcome, we would push software updates in phases, with each phase going to a larger percentage of our 100,000 device network. However, during each phase, an analyst would pull the KPIs for that software release manually. This was obviously a waste of time, but we hadn’t had the time to automate the process.

The Realization

I knew I could contribute by automating this manual process. But, beyond simply automating a data pull to save time, I realized 2 things:

  1. To further save time, I could push the data, rather than pulling it, to where the users were: Slack. This involved connecting to the Slack API and pushing a chart created in Python.

  2. I could automate some of the inference the human analyst was doing.

An example of a detected anomaly sent to Slack.
An example of a detected anomaly sent to Slack.

For example, we run ad campaigns on our network, and one of our main KPIs is the number of plays our ad campaigns have in aggregate in a day. This number shouldn’t vary much, except on weekends, where they should drop to close to zero.

Setting up a simple check on this value was simple, and saved the analyst the trouble.

The Work

One issue I began to run into as my prototype system evolved into a consistently-used tool was that the nature of the monitoring I was attempting to automate became more complex. Without time to chase down every possible issue, I began to provide general statements with a small data dump.

I quickly realized the data dump was not something my customers (analysts) would ever want to work with. And the customer is who matters.

The opposite of what people want.
The opposite of what people want.

The Slack channels I set up for these new channels began to be used less as the alerts became too general. My only choice was to create more specific messages and action steps.

I discovered a very effective method of doing this is to think through the exact resolution steps of an issue — for complex issues, this can be a difficult exercise in forethought. However, done properly, it makes action steps and needed messages crystal clear. And with clarity came the reuse of my channels and actions taken as a result!

The Outcome

At the time of my leaving the company, “Anomaly Detection” pushed alerts to over half of our Engineering-Product-Design org (~20 of 40 individuals) and several other stakeholders outside of that org. It was a large source of operational de-risking efforts, and new channels were continuously being added.

If you'd like more technical details on this project, feel free to check out this Medium article I wrote, which details more specifics on infrastructure, code examples, and more.

The article has memes and everything.
The article has memes and everything.

Or, if you'd like a more fun article on general learnings on data science I gained from this project, check out the separate Medium article above!