At uStudio, we continuously evaluate new and evolving cloud solutions to provide the best enterprise video platform to our customers at competitive rates. Due to the changing state of enterprise cloud offerings across private, hybrid, and public infrastructure, we must also be flexible to our customers’ security and IT requirements. Recently, a few members of our engineering team were lucky enough to test a variety of our video processing services on Oracle’s new Bare Metal infrastructure.
A summary of our testing includes:
- A 36-core Oracle Bare Metal instance roughly equated to the throughput of a 50-core cluster of smaller VM instances in another popular cloud vendor, resulting in ~39% improvement “core-for-core”.
- Our cluster of Oracle bare metal instances displayed more reliable runtime guarantees, free from the “noisy neighbor” effect.
- A single bare metal instance was capable of running multiple concurrent transcoding tasks in “realtime” while demonstrating higher throughput (decreasing customer wait times) for single tasks.
- Ultimately Oracle Bare Metal offers a solid option for enterprises moving video workloads into the cloud.
Why Bare Metal for Video?
First, it is important to understand why video processing tasks might benefit from bare metal servers, especially when compared with traditional cloud-based virtual machines.
Video files are large and complex data sources, delivered in a wide variety of competing formats, containing dense, highly-compressed media “tracks” utilizing advanced codecs like as H.264, HEVC, VP9, ProRes, and others (not to mention a similar cast of audio codecs). Even when compressed, HD video files can range from hundreds of megabytes to dozens of gigabytes in size. In order to play video across the wide variety of devices, environments, and network conditions required by today’s enterprise software, computationally-intensive decoding and encoding operations (known as “transcoding”) are necessary, as well as other processing steps including audio and video correction, VR optimization, etc.
Subsequently, we expected our tests to show video processing benefits from having direct access to a system’s computational capabilities without the added cost of virtualization -- faster processing using the same infrastructure. Additionally, bare metal instances should prevent “noisy neighbors”, or the potential for other virtualized services running on the same machine to “steal” resources, slowing down your tasks.
Our Test Environment
Our first task was to setup the infrastructure necessary to run a variety of video-related tasks across a replicated cluster of equivalent machines.
At uStudio, we have standardized on Docker containers running in Kubernetes, an easy and modern interface to deploy, manage and scale services. While the particulars of our Kubernetes setup are outside the scope of this article, we were pleased with the ability to create a modern Kubernetes cluster atop the infrastructure elements provided by Oracle, even while Oracle had yet to roll out all the capabilities planned for their cloud offerings.
Inside our Oracle Bare Metal environment, we created a three node cluster using 36 core machines, each with 256 gigabytes of memory and local disk storage. Two of the nodes served as the “worker” nodes with minimal service overhead, while one served as the “master” node and exclusively ran orchestration and task management services, in order to measure performance consistently.
For competitive comparison, we used a forty-machine cluster of smaller dual-core VMs in a popular cloud provider. This is an important point: the intent of our test was not to compare apples-to-apples of one VM instance against one bare metal machine (especially since few if any cloud providers offer machines with Oracle’s performance) but instead to compare overall cluster throughput, reliability, and efficiency. Horizontal scaling of tasks across distributed infrastructure (while taking full advantage of the underlying hardware) is our primary goal when providing a cloud service to our customers, and our tests measure accordingly.
For our primary test, we performed a simple, linear transcode of a 1080p 23.98 fps “mezzanine” source video to a 1080p MP4 (using H.264 Main profile and AAC audio) encoded at 1.5Mbps, which is appropriate for delivery of 1080p content via the web. We chose this workload because it represents a common workload for much of our enterprise customer content. For this test, we loaded the source video and stored the results on disk, to isolate the CPU performance of the nodes.
The Oracle Bare Metal cluster showed significant performance improvements over the VM cluster, at various workloads:
- When transcoding a single 2 minute 8 second video, the smaller VMs took, on average, 342.1 seconds, while the Bare Metal nodes took 38.28 seconds, an almost 9x improvement in transcoding time.
- The Oracle Bare Metal VMs could transcode about 9 videos simultaneously in "realtime" (i.e. processing video at the same framerate as the source video)
- The peak "cluster FPS" of the Bare Metal cluster of two worker nodes was approximately 465 frames per second, occurring at 9 concurrent transcodes per node, for which each node was transcoding an average of 232.78 frames per second. This is roughly equivalent to the “cluster FPS” of a 50 node cluster using standard dual core virtualized machines, which averaged 8.98 FPS each.
These numbers highlight several benefits of using bare metal machines over standard VMs:
First, during times of reduced load, the video transcoding software is able to take advantage of the increased parallelism of each node, resulting in better performance and a faster turnaround time for users. While expected due to the discrepancy in size between the VMs and the bare metal instances, it’s valuable to point out that a larger machine results in improved user experience during lower cluster utilization.
Second, the bare metal nodes are capable of handling a large number of streams at realtime speeds, which is critical for delivering live video over the web. This performance is comparable to many expensive, dedicated hardware solutions for video transcoding. Since some workflows are not always optimized (or feasible) for horizontal scalability, the raw performance of a single instance provides compatibility for vertically-oriented tasks in a cloud environment.
Third, each bare metal node is able to handle the workflow of ~25x smaller VM nodes at peak utilization. When priced competitively, this could lead to significant cost savings for clusters handling large volumes of data.
A final note on performance is the reliability of the throughput numbers on the Bare Metal machines. At the maximum amount of parallelism we tested for each whole cluster (40 simultaneous transcodes on 40 VMs, 36 simultaneous transcodes on 2 Bare Metal nodes), the VMs showed significantly less reliable throughput:
- On the VMs, each with 1x concurrency, the average transcode duration was 343s, but the maximum was 664s
- On the Bare Metal nodes, each with 9x concurrency, the average duration was 242.1s, and the maximum was 300.3s.
The likely culprit of this degraded performance on the VMs was noisy neighbors: other VMs on the same physical host, possibly being used by other customers of the IaaS provider, using shared resources and reducing the performance of our VMs. This is a common occurrence in cloud VM environments, with a real impact: in this case, a customer waiting twice as long for a transcode to complete. On the Bare Metal nodes, there are never any noisy neighbors, resulting in a more consistent and reliable performance.
When delivering live video or providing real-time processing capabilities, it is even more crucial that you are able to predict the required resources for your tasks -- a “noisy neighbor” during a live stream could potentially cause delays processing the broadcast, affecting all viewers or even bringing down the stream entirely. For effective capacity planning and scaling, you must be able to accurately predict the throughput of your infrastructure.
Additional Use Cases
We ran a number of other tasks including chunked transcoding and VR video processing. While the ultimate benefits are roughly the same as described above, here are a few interesting results:
- For “chunked transcoding” (splitting the file into smaller chunks and encoding in parallel), we were able to transcode the 1080p 2m:08s video in a total time of 10s.
- This is demonstrated even more effectively with a 32 minute 4K (roughly 4x the resolution of a 1080p file) file, which was transcoded in around 5 minutes.
- We were able to process and transcode an equirectangular 30fps 4k 360º (VR) source video into a cube-mapped 1440p 6Mbps output file at more than 60fps (2x+ realtime). [Facebook approach to cubemapping]
These examples benefit from parallelization on a local file source, meaning a single machine may transfer the file locally once but process 10-20 segments simultaneously, reducing network utilization while vastly improving the overall time to process a single file.
Oracle is also aggressively pricing their outbound bandwidth for their cloud services. While we did not test their outbound capabilities for this article, the combination of bare metal and less expensive outbound bandwidth might present a compelling option for several video distribution use cases including:
- Video packaging origin behind global CDNs for streaming delivery
- Secure origin service supporting WAN optimization deployments (P2P, transparent caching, etc.) inside and across the corporate firewall
- “Just-in-time” transcoding service for serving dynamic versions of an asset for different browsers, network environments, etc.
When evaluating our configuration and tests, one might ask why a standard cloud deployment might have so many smaller machines instead of using fewer, larger virtual machines. We regularly evaluate overall throughput against the cost and maintainability of our infrastructure, and unfortunately, the failure rate of standard cloud-based virtual machines (whether due to noisy neighbors, virtualization upgrades and migrations, or underlying hardware failures) is relatively high. Since it is often more cost effective to simply delete the instance and restart any tasks running on it versus spending precious support time trying to resurrect a virtual instance, we run smaller instances to reduce the number of potentially affected tasks.
This is an important consideration when evaluating the appropriate cloud solution: dedicated bare metal instances will likely fail less frequently due to the lack of virtualization complexity, but any failures will likely have a much larger impact due to their increased capacity and throughput.
In closing, the Oracle Bare Metal Cloud Service offers a variety of benefits for video processing tasks (and similar “big data” computationally-intensive operations) alongside other Oracle Cloud services (such as the Virtual Cloud Network) which are designed to make it easy to extend and configure your on-premises deployments to work with the cloud environments.
However, when evaluating any cloud-based infrastructure solution, the first step is to understand the nature of the workloads (applications, data, and requirements) that must be migrated. As an enterprise video platform, uStudio works with a diverse set of customers on video management, distribution, measurement, and more, and we can help find the best solution for your video initiatives.
Contact us to find out how we can power any video use case using Oracle’s new cloud solutions.