November 1, 2021
Instagram first started building video ads in 2014. At the time, we focused on moving fast which got us an initial iteration productionized quickly but left a lot of room for improvement. Over the last two years, we redesigned the Instagram ads video pipeline for performance, which drove a 3.9% improvement in video ad watch time, and reduced memory errors by 98% when it launched in 2021.
For the first iteration of video ads, we reused as much of the existing Facebook Ads infrastructure as possible to save engineering time. This was a quick way to get an initial iteration completed, but the tradeoff of this approach was that the resulting pipeline had poor reliability, wasn’t optimized for Instagram, and was difficult to maintain for Instagram teams.
The Facebook Ads infrastructure already generated video encodings out of the box and stored them in Facebook’s blob storage service. The problem with reusing them directly was that both products have different deletion behaviors. What do you do if an Instagram ads user deletes their ad but it’s still running on Facebook? To solve this problem we would’ve had to spend extra engineering time to implement refcounting so that these encodings wouldn’t be deleted until all references to them were also deleted.
We didn’t have the time to do that though, so we opted to copy the Facebook encodings into separate storage. This was cheap to implement and avoided the complexities of shared storage across Facebook and Instagram. When we tried to copy the encodings, we realized that the blob storage service didn’t have an API to copy the files locally within the service. Again, we didn’t have time to wait for that to be implemented. To move fast, the Instagram Ads pipeline downloaded the encodings locally then re-uploaded them back to the blob storage service to copy the videos.
This approach had its problems. Videos are large files. For instance, the average Instagram ad original video is ~33 MB. The web servers handling this file copying were not built to handle long-lived requests that need this much memory. As a result, these requests would often fail for larger videos which led to lower reliability.
In addition, these copied videos were only tested and optimized for delivery on Facebook. Without Instagram-specific optimization, the delivered quality was poorer than ideal. We made several attempts to optimize the old pipeline for Instagram by trying to reprocess the FB encodings with Instagram specific encoding settings. These efforts did not improve delivered quality as much as anticipated because transcoding is a lossy process and processing encodings twice significantly degrades quality.
Lastly, the old pipeline was difficult for the Instagram ads teams to maintain since video processing was not their primary focus. It resulted in slower developer velocity, increased incident resolution times, and lowered overall engineering satisfaction.
To tackle the above problems, the Instagram Media Platform and Ads teams collaborated for two years to ship a redesigned Instagram video ads processing pipeline in 2021. This pipeline leverages the same Instagram-optimized encoding recipes that the non-ads traffic uses to produce videos.
As mentioned above, reliability was a big problem with the old design due to the long lived processes needed to copy the videos. To solve this, we built the new pipeline using a service that processes videos asynchronously. Now, the ads platform web servers just need to call the video processing service with the necessary inputs and the video will start processing in the background.
This changed a major assumption in our pipeline and therefore needed a lot of careful planning and testing. We had to re-write the ads platform code so that it knew what to do if the video wasn’t ready at publish time. For this we implemented a retry queue for the ads platform publish requests with a critical rate limiting mechanism. When the ads platform pings the video processing service, we add the ads publish request to a queue that periodically retries them until a maximum number of retries.
The important part here is that we don’t trigger the video processing service every time we see that the video isn’t ready yet. Otherwise, if the video processing service lags or goes down, it will receive more and more inbound requests, making matters worse. To prevent this, we only trigger the video processing pipeline after a substantial amount of time has passed since it was last triggered. Also, we have knobs we can tweak in prod to make the retry interval less aggressive if we ever have higher than expected retry volume.
As mentioned above, removing the synchronous assumption in the ads platform code added a lot of complexity. To ensure that the rollout was smooth we introduced several phases of testing to build up confidence along the way.
First, we leveraged a "double write" test. In this test, we triggered the new pipeline to generate test encodings only after ads were successfully published. This way, the test would not affect ads creation in any way.
At delivery time, some ads would have two sets of encodings available from this: one from the new pipeline and one from the old pipeline. For ads with two sets of encodings, we conducted an A/B test where half the users received the new encodings and the other half received the old encodings. This test confirmed that the new encodings performed much better than the copied encodings.
Now that we knew these new encodings were going to be a significant improvement, we needed to figure out how to swap out creation so we could get rid of the old pipeline. To do this, we first kept the synchronous assumption in place and tried to use the new pipeline. If the new pipeline’s encodings were available in time, we’d use them otherwise we’d fallback to the old pipeline. From our initial testing and logging, we saw that the new pipeline’s encodings were available in time for a majority of ads. This partial creation test allowed us to verify that the new flow could handle production load, gave us traffic for monitoring the new pipeline, and let us claim a portion of the improvements up front.
After the partial creation test gave us confidence, we then set up the final creation test to completely swap out the old pipeline for the new one. This test helped us verify the new retry queue’s correctness and the overall creation reliability of the new pipeline.
After we launched our new Ads video processing pipeline in 2021, we saw major improvements in production across performance, reliability, and developer efficiency. This new pipeline improved the stability of video ads publishing due to a 98% reduction in memory errors. Also, now that we process the original source video with Instagram-optimized encoding settings, we are seeing a 3.9% increase in overall ads watch time with similar magnitude impact to downstream ads product metrics. Finally, the pipeline is in a much better operational state now that folks with video domain expertise maintain it. This offloads a lot of burden from the IG Ads team and speeds up incident resolution times.
Over the years, Instagram has worked to improve the craft of its product offerings. Performance is one of the ways that we provide delightful experiences for our users. There is always potential for impact considering the scale of Instagram where many of our 1B+ users consume media every day. If this work sounds interesting to you, join us!
Many thanks to Jaspreet Kaur, Ding Ma, and Ephraim Park from the Ads org for this 2 year long collaboration. Also huge thanks to Ang Li, Haixia Shi, and Lukas Camra for feedback and discussion from the Instagram Media Platform side. Lastly thanks to our partners Zainab Zahid, and Mike Starr for supporting us with the infrastructure that powers the latest video processing pipeline.
Ryan Peterman is a software engineer on Instagram’s Media Platform team in San Francisco.