Anomaly Detection @ Outcome Health

Created a Slack-based alerting system, which pushed alerts to over half of our Engineering-Product-Design org (~20 of 40 individuals), significantly reduced operational risk, and decreased time spent on repetitive tasks.

Organizations: Outcome Health

Collaborators: Shashin Chokshi, Kyle Gassert, Mike Thoun

Dates: 2018-2019

Focus: Healthcare, Data Science, Data Engineering

Synopsis

Outcome Health is a healthcare innovation company that showcases relevant content to patients, caregivers and healthcare professionals at the point of care.

With a fleet of over 100,000 devices in providers’ offices across the country, Outcome had a great need to optimize its device health monitoring system. I built an Anomaly Detection system to aid this effort.

The Opportunity

While working there, I saw obvious inefficiencies in how we were monitoring our KPIs for new software releases. Manual monitoring processes tend to be places ripe for improvement, and I saw a space where I could directly contribute to efficiency.

At Outcome, we would push software updates in phases, with each phase going to a larger percentage of our 100,000 device network. However, during each phase, an analyst would pull the KPIs for that software release manually. This was obviously a waste of time, but we hadn’t had the time to automate the process.

The Realization

I knew I could contribute by automating this manual process. But, beyond simply automating a data pull to save time, I realized 2 things:

To further save time, I could push the data, rather than pulling it, to where the users were: Slack. This involved connecting to the Slack API and pushing a chart created in Python.
I could automate some of the inference the human analyst was doing.

`An example of a detected anomaly sent to Slack.`

For example, we run ad campaigns on our network, and one of our main KPIs is the number of plays our ad campaigns have in aggregate in a day. This number shouldn’t vary much, except on weekends, where they should drop to close to zero.

Setting up a simple check on this value was simple, and saved the analyst the trouble.

The Work

One issue I began to run into as my prototype system evolved into a consistently-used tool was that the nature of the monitoring I was attempting to automate became more complex. Without time to chase down every possible issue, I began to provide general statements with a small data dump.

I quickly realized the data dump was not something my customers (analysts) would ever want to work with. And the customer is who matters.

The Slack channels I set up for these new channels began to be used less as the alerts became too general. My only choice was to create more specific messages and action steps.

I discovered a very effective method of doing this is to think through the exact resolution steps of an issue — for complex issues, this can be a difficult exercise in forethought. However, done properly, it makes action steps and needed messages crystal clear. And with clarity came the reuse of my channels and actions taken as a result!

The Outcome

At the time of my leaving the company, “Anomaly Detection” pushed alerts to over half of our Engineering-Product-Design org (~20 of 40 individuals) and several other stakeholders outside of that org. It was a large source of operational de-risking efforts, and new channels were continuously being added.

If you'd like more technical details on this project, feel free to check out this Medium article I wrote, which details more specifics on infrastructure, code examples, and more.

Or, if you'd like a more fun article on general learnings on data science I gained from this project, check out the separate Medium article above!