May 5th, 2021
When we view a photo on Instagram, we can pick up on different signals from the photo that are relevant to us: whether the photo has a bird in it, contains mouthwatering food, or is a photo of a mountain range. At Instagram, we try to develop a variety of signals in order to better recommend content to people on our different surfaces, such as the Home feed and the Explore screen.
As the product has scaled, so has our organization, resulting in numerous teams working together for signal production (teams that focus on creating signals like in the above paragraph: a bird detector, or a mountain detector, for example) and signal consumption (e.g., Home feed, Explore screen) creating many-to-many relationships between them.
Pictured below: Example many-to-many relationships that can occur
These many-to-many relationships can cause a number of potential issues, like:
This is where the Signals Platform comes in. The Signals Platform sits in between this many-to-many relationship, centralizing signal publication and consumption by providing a unified API and infrastructure.
Pictured below: Adding a layer of indirection between signal producers and signal consumers
Let’s dive into a small view of what the needs and design of the signals platform looks like. We will cover two core areas: flow initiation and signal post processing.
Pictured below: A high-level overview of the Signals Platform flow
There are different times in which a team would like a signal to be calculated. In most cases, this turns out to be at upload time for a photo. However, we may want to delay calculation if the signal isn’t vital, or we may want to recalculate the signal at another time. Part of the responsibility of the signals platform is to provide selection of these entry points that key into the flow that sources the signal. In addition, the signals platform provides infrastructure to expand on the existing set of entry points and allow teams at Instagram to add further entry points as necessary.
Most of these signals come from machine learning models. What introduces an extra layer of complexity to the problem is that these machine learning models can often rely on the output of other machine learning models. As a result, flow initiation is not as simple as providing a callback for just photo uploads. We must provision an interface to wait for callbacks from these models which execute at photo upload time, and gather the outputs in a canonical form to pass into and execute downstream models.
For various reasons, signals may not be calculated for a portion of uploads. This may be due to errors in the system, or a signal has only recently been onboarded and has not run for a large portion of uploads yet. Consequently, a client of the signals platform may want to queue up an adhoc recalculation of a set of signals. We not only want to provide an entry point into the flow for this purpose, but build an interface on top of that to allow self-service.
After the flow executes, and we gather the signal, our job is not finished yet. As discussed earlier, the signal is leveraged in various ways, such as being sent to storage (to be retrieved later) or pushed to a real time stream (to utilize the information immediately). Depending on the specification by the client at signal onboarding, we may take either of these actions, both, or some other set of post processing actions we provide.
Pictured below: An example flow
In the end, we roll this up into a concise UI flow, allowing our clients to view, manage, and monitor their signals. In addition, we offer UI-based methods to onboard their signals, allowing for most use cases to seamlessly move from modelling completion to onboarding applications.
Along with providing the core flow of signal retrieval, and processing, there are three other problems to consider:
As a core provider of signals for vital downstream use cases, we have a commitment to reliability and low latency. This problem becomes more complex when we must consider the scale of Instagram traffic-wise, but also the vast internal signal development and downstream consumers. We must ensure our system is performant, but also layered appropriately to enable seamless integration with internal use cases across many different teams. As a result, we keep track of numerous metrics related to signal delivery, and latency and have invested in optimizing our infrastructure to meet our commitments with those metrics. Furthermore, we leverage shared infrastructure where possible to unlock wins in this space, allowing us to focus on domain specific issues that arise and additional optimizations.
A recent win in this space was the migration of our post-processing storage solution to a shared infrastructure. Initially we provisioned a bespoke storage solution for our use case, but it quickly became outdated, less efficient, and increasingly unreliable. As designing a storage solution is not a pressing or unique issue to our team (since our storage use case can generalize easily), we decided to migrate and leverage a cross-team storage solution. This allowed us to unlock efficiency wins made possible by the up-to-date dedicated infrastructure, ensure reliability through commitments made by the supporting team, and allow us to easily scale with Instagram.
Signal distribution monitoring is vital in ensuring the real time stability of model output and health of the signals passing through our platform. It is not enough to just track signal throughput. For example, if we notice regressions in realtime in our downstream consumers metrics, but throughput of signal publication to these consumers are the same, the cause may be a drastic shift in model behavior.
A method we have implemented in order to monitor signal distribution is leveraging the population stability index (PSI) to comprehensively measure signal score distribution change.
By normalizing the PSI, we are able to create a standard calculation across the signals served on the platform. This enables us to set thresholds for anomaly detection and ensure real time distribution of signals are stable.
The signal lifecycle consists of onboarding a signal to the platform, upgrading a signal, and deprecating a signal. Proper abstraction, automation, and compartmentalization is key in ensuring we stay true to the purpose of the platform.
A common issue encountered by the signal producers in onboarding and upgrading their models to downstream use cases is hyperparameter optimization of signal application. For example, a downstream use case may apply a simple function to the raw score output for their use case that utilizes several parameters. Based on offline evaluation we are able to derive values based on labels we have collected, but there still may be some gaps in between what we measure in online metrics via experimentation and the optimal values given the labels.
Consequently, we must run online experiments to determine, verify, and finalize parameters in upgrading or onboarding a signal. This turns into a time consuming manual process in order to hand tune these parameters. To alleviate this problem we leverage Bayesian optimization to achieve our desired online outcomes by automatically tuning these parameters. By hooking abstractions of our downstream use cases into automatic parameter tuning, we allow our signal producers to onboard, and upgrade signals automatically.
Designing a signals platform to serve all of Instagram requires tackling large, interrelated problems and teasing apart workable components in order to appropriately design sustainable and extendible infrastructure. We are committed to providing a reliable, quality, and comprehensive platform to enable teams at Instagram to deliver the best user value. If you are interested in joining one of our engineering teams, please visit our careers page.