How to Encode a Video for Uploading
People upload hundreds of millions of videos to Facebook every solar day. Making sure every video is delivered at the best quality — with the highest resolution and every bit fiddling buffering as possible — means optimizing non merely when and how our video codecs compress and decompress videos for viewing, but also which codecs are used for which videos. Just the sheer volume of video content on Facebook also means finding ways to do this that are efficient and don't swallow a ton of computing power and resources.
To help with this, we employ a variety of codecs as well as adaptive bitrate streaming ( ABR ), which improves the viewing feel and reduces buffering by choosing the best quality based on a viewer's network bandwidth. But while more advanced codecs similar VP9 provide better compression operation over older codecs, similar H264, they also consume more than computing power. From a pure computing perspective, applying the nigh advanced codecs to every video uploaded to Facebook would be prohibitively inefficient. Which means there needs to exist a way to prioritize which videos need to be encoded using more advanced codecs.
Today, Facebook deals with its loftier demand for encoding loftier-quality video content past combining a benefit-cost model with a machine learning (ML) model that lets us prioritize advanced encoding for highly watched videos. By predicting which videos will be highly watched and encoding them first, we can reduce buffering, amend overall visual quality, and allow people on Facebook who may be limited by their information plans to watch more videos.
But this task isn't equally straightforward as assuasive content from the most popular uploaders or those with the most friends or followers to jump to the front of the line. There are several factors that have to be taken into consideration then that nosotros can provide the best video experience for people on Facebook while also ensuring that content creators yet have their content encoded fairly on the platform.
How we used to encode video on Facebook
Traditionally, once a video is uploaded to Facebook, the process to enable ABR kicks in and the original video is rapidly re-encoded into multiple resolutions (eastward.one thousand., 360p, 480p, 720p, 1080p). Once the encodings are made, Facebook'due south video encoding organisation tries to farther improve the viewing experience by using more avant-garde codecs, such every bit VP9, or more expensive "recipes" (a video industry term for fine-tuning transcoding parameters), such every bit H264 very slow profile, to compress the video file as much as possible. Different transcoding technologies (using different codec types or codec parameters) have different trade-offs between compression efficiency, visual quality, and how much computing power is needed.
The question of how to order jobs in a style that maximizes the overall experience for everyone has already been top of mind. Facebook has a specialized encoding compute pool and dispatcher. It accepts encoding job requests that have a priority value attached to them and puts them into a priority queue where higher-priority encoding tasks are processed kickoff. The video encoding system's job is and so to assign the right priority to each task. Information technology did so by following a list of simple, hard-coded rules. Encoding tasks could exist assigned a priority based on a number of factors, including whether a video is a licensed music video, whether the video is for a product, and how many friends or followers the video's owner has.
But there were disadvantages to this approach. As new video codecs became available, it meant expanding the number of rules that needed to exist maintained and tweaked. Since unlike codecs and recipes accept different calculating requirements, visual quality, and compression performance trade-offs, it is impossible to fully optimize the end user experience by a coarse-grained set of rules.
And, perhaps most important, Facebook's video consumption pattern is extremely skewed, meaning Facebook videos are uploaded by people and pages that accept a wide spectrum in terms of their number of friends or followers. Compare the Facebook page of a big company similar Disney with that of a vlogger that might have 200 followers. The vlogger can upload their video at the same time, but Disney's video is probable to go more sentinel time. Withal, any video can become viral even if the uploader has a minor following. The challenge is to support content creators of all sizes, not just those with the largest audiences, while also acknowledging the reality that having a large audition also likely ways more views and longer lookout man times.
Enter the Benefit-Cost model
The new model still uses a prepare of quick initial H264 ABR encodings to ensure that all uploaded videos are encoded at expert quality equally soon as possible. What'south inverse, however, is how nosotros calculate the priority of encoding jobs subsequently a video is published.
The Benefit-Cost model grew out of a few central observations:
- A video consumes calculating resources merely the get-go time it is encoded. In one case it has been encoded, the stored encoding tin be delivered as many times every bit requested without requiring boosted compute resources.
- A relatively small percentage (roughly one-third) of all videos on Facebook generate the majority of overall watch time.
- Facebook'due south information centers take limited amounts of energy to power compute resource.
- Nosotros get the near bang for our buck, so to speak, in terms of maximizing everyone's video feel within the available power constraints, by applying more than compute-intensive "recipes" and advanced codecs to videos that are watched the most.
Based on these observations, nosotros came up with following definitions for benefit, cost, and priority:
- Benefit = (relative compression efficiency of the encoding family at fixed quality) * (constructive predicted watch time)
- Toll = normalized compute toll of the missing encodings in the family
- Priority = Do good/Cost
Relative pinch efficiency of the encoding family at fixed quality: We mensurate benefit in terms of the encoding family unit's pinch efficiency. "Encoding family" refers to the set of encoding files that tin can be delivered together. For example, H264 360p, 480p, 720p, and 1080p encoding lanes make up one family, and VP9 360p, 480p, 720p, and 1080p make up some other family unit. Ane claiming here is comparison compression efficiency betwixt different families at the aforementioned visual quality.
To understand this, y'all first take to understand a metric we've developed chosen Minutes of Video at High Quality per GB datapack (MVHQ). MVHQ links compression efficiency directly to a question people wonder about their net allowance: Given 1 GB of data, how many minutes of high-quality video can we stream?
Mathematically, MVHQ tin be understood every bit:
For example, let's say nosotros have a video where the MVHQ using H264 fast preset encoding is 153 minutes, 170 minutes using H264 wearisome preset encoding, and 200 minutes using VP9. This ways delivering the video using VP9 could extend spotter time using 1 GB data past 47 minutes (200-153) at a loftier visual quality threshold compared to H264 fast preset. When calculating the do good value of this particular video, we use H264 fast equally the baseline. We assign one.0 to H264 fast, i.ane (170/153) to H264 wearisome, and 1.3 (200/153) to VP9.
The bodily MVHQ can exist calculated only in one case an encoding is produced, merely nosotros need the value earlier encodings are bachelor, so we apply historical information to guess the MVHQ for each of the encoding families of a given video.
Effective predicted lookout man time: As described further in the department beneath, we have a sophisticated ML model that predicts how long a video is going to be watched in the almost future across all of its audience. One time we have the predicted scout fourth dimension at the video level, we estimate how effectively an encoded family tin be applied to a video. This is to account for the fact that not all people on Facebook accept the latest devices, which tin play newer codecs.
For case, well-nigh twenty percent of video consumption happens on devices that cannot play videos encoded with VP9. So if the predicted scout time for a video is 100 hours the constructive predicted watch time using the widely adopted H264 codec is 100 hours while effective predicted lookout man fourth dimension of VP9 encodings is fourscore hours.
Normalized compute price of the missing encodings in the family: This is the corporeality of logical computing cycles nosotros need to brand the encoding family deliverable. An encoding family requires a minimum set of resolutions to be fabricated available earlier nosotros can deliver a video. For example, for a particular video, the VP9 family may crave at to the lowest degree iv resolutions. But some encodings have longer than others, meaning non all of the resolutions for a video tin can be made available at the same time.
As an example, let's say Video A is missing all four lanes in the VP9 family. We can sum up the estimated CPU usage of all four lanes and assign the aforementioned normalized toll to all 4 jobs.
If we are but missing two out of four lanes, as shown in Video B, the compute price is the sum of producing the remaining 2 encodings. The same toll is applied to both jobs. Since the priority is benefit divided by price, this has the effect of a task'due south priority becoming more urgent as more than lanes become available. Encoding lanes do not provide any value until they are deliverable, so information technology is important to become to a complete lane equally quickly as possible. For example, having one video with all of its VP9 lanes adds more value than 10 videos with incomplete (and therefore, undeliverable) VP9 lanes.
Predicting sentinel time with ML
With a new do good-price model in place to tell united states how sure videos should be encoded, the next slice of the puzzle is determining which videos should be prioritized for encoding. That's where nosotros now utilize ML to predict which videos will be watched the most and thus should be prioritized for advanced encodings.
Our model looks at a number of factors to predict how much lookout man time a video will get inside the next hr. It does this by looking at the video uploader'southward friend or follower count and the average watch time of their previously uploaded videos, every bit well every bit metadata from the video itself including its duration, width, meridian, privacy status, mail blazon (Live, Stories, Spotter, etc.), how old it is, and its past popularity on the platform.
But using all this data to brand decisions comes with several born challenges:
Watch time has high variance and has a very long-tail skewed nature. Fifty-fifty when we focus on predicting the next hour of watch time, a video'due south spotter time can range anywhere from zip to over 50,000 hours depending on its content, who uploaded it, and the video's privacy settings. The model must be able to tell not only whether the video volition exist popular, but also how pop.
The all-time indicator of next-hour spotter fourth dimension is its previous watch time trajectory. Video popularity is generally very volatile by nature. Different videos uploaded by the same content creator can sometimes take vastly different lookout times depending on how the community reacts to the content. After experimenting with multiple features, we found that past watch time trajectory is the best predictor of futurity watch fourth dimension. This poses two technical challenges in terms of designing the model architecture and balancing the training information:
- Newly uploaded videos don't have a watch time trajectory. The longer a video stays on Facebook, the more we can learn from its past lookout time. This ways that the most predictive features won't utilise to new videos. We want our model to perform reasonably well with missing information because the before the system can identify videos that will become popular on the platform, the more opportunity there is to evangelize higher-quality content.
- Pop videos take a trend to dominate preparation data. The patterns of the nearly popular videos are not necessarily applicable to all videos.
Watch time nature varies by video blazon. Stories videos are shorter and go a shorter lookout time on average than other videos. Live streams get most of their watch time during the stream or a few hours afterwards. Meanwhile, videos on demand (VOD) can have a varied lifespan and can rack upwardly picket time long after they're initially uploaded if people showtime sharing them after.
Improvements in ML metrics do not necessarily correlate direct to product improvements. Traditional regression loss functions, such as RMSE, MAPE, and Huber Loss, are not bad for optimizing offline models. But the reduction in modeling error does non always interpret directly to product comeback, such as improved user experience, more watch time coverage, or improve compute utilization.
Edifice the ML model for video encoding
To solve these challenges, we decided to railroad train our model by using scout time consequence data. Each row of our preparation/evaluation represents a decision point that the system has to make a prediction for.
Since our sentry time event data tin be skewed or imbalanced in many ways every bit mentioned, we performed data cleaning, transformation, bucketing, and weighted sampling on the dimensions we care almost.
Also, since newly uploaded videos don't accept a lookout time trajectory to describe from, we decided to build two models, ane for treatment upload-fourth dimension requests and other for view-time requests. The view-time model uses the iii sets of features mentioned above. The upload-time model looks at the performance of other videos a content creator has uploaded and substitutes this for past watch time trajectories. Once a video is on Facebook long enough to have some past trajectories available, we switch it to use the view-fourth dimension model.
During model development, we selected the best launch candidates by looking at both Root Mean Square Error (RMSE) and Mean Absolute Percent Error (MAPE). We use both metrics because RMSE is sensitive to outliers while MAPE is sensitive to small values. Our lookout time label has a high variance, then we apply MAPE to evaluate the performance of videos that are popular or moderately popular and RMSE to evaluate less watched videos. We also care about the model'due south ability to generalize well across different video types, ages, and popularity. Therefore, our evaluation volition always include per-category metric equally well.
MAPE and RMSE are good summary metrics for model pick, but they don't necessarily reflect direct product improvements. Sometimes when two models have a similar RMSE and MAPE, we likewise translate the evaluation to classification problem to empathize the trade-off. For example, if a video receives 1,000 minutes of watch time but Model A predicts 10 minutes, Model A'south MAPE is 99 percentage. If Model B predicts ane,990 minutes of watch time, Model B's MAPE volition be the same as Model A's (i.eastward., 99 percent), but Model B'due south prediction will result in the video more than likely having loftier-quality encoding.
Nosotros also evaluate the classifications that videos are given because we desire to capture the trade-off betwixt applying advanced encoding likewise oft and missing the opportunity to apply them when there would be a do good. For case, at a threshold of 10 seconds, nosotros count the number of videos where the actual video sentry time is less than ten seconds and the prediction is besides less than 10 seconds, and vice versa, in order to calculate the model's fake positive and false negative rates. We repeat the same calculation for multiple thresholds. This method of evaluation gives united states of america insights into how the model performs on videos of different popularity levels and whether information technology tends to suggest more encoding jobs than necessary or miss some opportunities.
The bear upon of the new video encoding model
In add-on to improving viewer experience with newly uploaded videos, the new model can place older videos on Facebook that should have been encoded with more advanced encodings and route more computing resources to them. Doing this has shifted a large portion of watch time to advanced encodings, resulting in less buffering without requiring boosted computing resources. The improved compression has also allowed people on Facebook with limited information plans, such as those in emerging markets, to watch more videos at better quality.
What's more, as nosotros introduce new encoding recipes, nosotros no longer have to spend a lot of time evaluating where in the priority range to assign them. Instead, depending on a recipe's do good and cost value, the model automatically assigns a priority that would maximize overall benefit throughput. For example, nosotros could introduce a very compute-intensive recipe that only makes sense to be applied to extremely popular videos and the model can identify such videos. Overall, this makes it easier for us to continue to invest in newer and more advanced codecs to give people on Facebook the best-quality video experience.
Acknowledgements
This work is the commonage upshot of the entire Video Infra team at Facebook. The authors would like to personally thank Shankar Regunathan, Atasay Gokkaya, Volodymyr Kondratenko, Jamie Chen, Cosmin Stejerean, Denise Noyes, Zach Wang, Oytun Eskiyenenturk, Mathieu Henaire, Pankaj Sethi, and David Ronca for all their contributions.
Source: https://engineering.fb.com/2021/04/05/video-engineering/how-facebook-encodes-your-videos/
0 Response to "How to Encode a Video for Uploading"
Post a Comment