How to Merge Videos lightning fast with FFMPEG and NVidia Cuda

After spending one and half year in an agency building the new website of the swiss meteorological agency, I recently made the decision to return to the startup world and accepted a position as a senior fullstack engineer at AirConsole.

AirConsole is a cloud-based video game console that allows players to use their smartphones as controllers. It offers a wide range of multiplayer games that can be accessed directly from the web browser, Android TV and soon cars.

One of the challenges we face is improving the quality assurance process for our games. In order to do so, we need to record at least two video streams simultaneously - from the player's phone and the screen. Then, we need to merge the videos to have understandable video report for the game devs to reproduce and fix the issues.

Camera

The first problem I needed to solve was "how to record two videos at the same time which would be synchronised"?

After some research we realised that dashcams are exactly doing that. They have two cameras, one rear and one front filming the road.

We decided to go for Vantrue N4:

The back camera (normally to film the drive) will film the tester. That way, he can explain the problem while it happens

The process

Now that we have a camera, we need to merge the 3 streams that the camera is producing. The filename are following this format:

20230215_174131_0696_E_A.MP4
20230215_174131_0696_E_B.MP4
20230215_174131_0696_E_C.MP4
YYYYMMDD_HHmmSS_<id>_<event or normal recording>_<stream_id>.MP4

First, create an accident

The event recording is used to record accidents: the dashcam has a gyroscope and can detect when a collision happens with another vehicule. This event can also be triggered by pressing a button on the camera.
At Airconsole, the camera is recording all the time and in case the tester sees a bug, he presses the accident button so that the filename is labeled as an accident. That way, it cannot be removed that easily from the SDCard and the tester finds the stream files easily at the end of the day to file the different issues.

Then, improve the footage

For the bug report, we do not want to watch 3 different videos at the same time. We want one video that we can send over to explain the issue. Therefore, we need to merge the video streams.

FFMPEG

Ffmpeg is an amazing tool to merge videos but ffmpeg by default runs on CPU and this is quite slow, only ~2x faster than the video itself. It is unacceptable to us.
But we have an unused gaming PC that has a NVIDIA RTX 1070 on it. So we will try to use it to merge our videos.

Installing the tools

NVidia tools are proprietary so you need to install the suite yourself on your PC. You first need to install the CUDA toolkit as stated in the doc:

To compile FFmpeg, the CUDA toolkit must be installed on the system, though the CUDA toolkit is not needed to run the FFmpeg compiled binary.

Then you need to recompile FFMPEG to enable the libnpp that allow to use nvidia GPU for FFMPEG. I found no way around recompiling ffmpeg, it is quite slow but achievable.

The best way to compile FFMPEG on Windows is to use the media-autobuild-suite and it has a section dedicated to CUDA: https://github.com/m-ab-s/media-autobuild_suite#notes-about-cuda-sdk

Merge the files

At the end, we want the following layout for our videos.

The command I came up with is the following:

ffmpeg 
-y \
-hwaccel cuda \
-hwaccel_output_format cuda \
-i 20230215_174131_0696_E_A.MP4 \
-hwaccel cuda \
-hwaccel_output_format cuda \
-i 20230215_174131_0696_E_B.MP4 \
-hwaccel cuda \
-hwaccel_output_format cuda \
-i 20230215_174131_0696_E_C.MP4 \
-filter_complex [0:v]scale_npp=960:-2:format=yuv420p,hwdownload,pad=w=1.5*iw:h=ih:x=0:y=0,hwupload_cuda,scale_npp=format=nv12[base];[1:v]scale_npp=480:-2:format=nv12[overlay_video];[2:v]scale_npp=480:-2:format=nv12[overlay_video2];[base][overlay_video]overlay_cuda=x=960:y=0:repeatlast=false[intermediate];[intermediate][overlay_video2]overlay_cuda=x=960:y=270:repeatlast=false \
-c:v h264_nvenc output.mp4

So there are some flags that are not very complicated to understand:

The "-y" option specifies to overwrite output files without asking.
The "-hwaccel cuda" option enables CUDA hardware acceleration for decoding video frames.
The "-hwaccel_output_format cuda" option specifies the output format for hardware acceleration.
The "-i" option followed by the input video file path specifies the input video files to process.
The "-c:v h264_nvenc" option specifies that the output video should be encoded using the H.264 codec with the NVIDIA NVENC encoder.

However the filter_complex is the most difficult part:

[0:v] - This selects the first video stream as the main/base video stream for processing.
scale_npp=960:-2:format=yuv420p - This scales the selected video to a width of 960 pixels, with the height automatically adjusted to maintain the aspect ratio, and converts the pixel format to yuv420p (format is needed to pass it back to cpu).
hwdownload - This downloads the scaled video frame from the GPU to system memory for further processing. We need to download because of the next command (pad)
pad=w=1.5*iw:h=ih:x=0:y=0 - This adds padding to the left and right sides of the video frame to achieve a width of 1.5 times the original width, and maintains the original height.
hwupload_cuda - This uploads the padded video frame back to the GPU for further processing.
scale_npp=format=nv12 - This converts the pixel format of the video frame to nv12. This conversion is necessary because the overlay_cuda filter that is used later in the filterchain requires the input video frames to be in the nv12 format.
[base] - This renames the output of the previous filterchain to "base" for later use. That will be our base video. base is the left part of the video.
[1:v] - This selects the second video stream for processing as an overlay on top of the base video stream.
scale_npp=480:-2:format=nv12 - This scales the selected video stream to a width of 480 pixels, with the height automatically adjusted to maintain the aspect ratio, and converts the pixel format to nv12. This conversion is necessary because the overlay_cuda filter that is used later in the filterchain requires the input video frames to be in the nv12 format.
[overlay_video] - This renames the output of the previous filterchain to "overlay_video" for later use. This will be the right top hand part of the video.
[2:v] - This selects the third video stream for processing as another overlay on top of the base video stream.
scale_npp=480:-2:format=nv12 - This scales the selected video stream to a width of 480 pixels, with the height automatically adjusted to maintain the aspect ratio, and converts the pixel format to nv12.
[overlay_video2] - This renames the output of the previous filterchain to "overlay_video2" for later use. This will be the right bottom hand part of the video.
[base][overlay_video]overlay_cuda=x=960:y=0:repeatlast=false - This combines the base and first overlay video streams by overlaying the "overlay_video" stream on top of the "base" stream, offset by a horizontal displacement of 960 pixels and a vertical displacement of 0 pixels. The repeatlast=false option is used to avoid repeating the last frame of the overlay stream when it ends before the base stream.
[intermediate] - This renames the output of the previous filterchain to "intermediate" for later use.
[intermediate][overlay_video2]overlay_cuda=x=960:y=270:repeatlast=false - This combines the intermediate stream (output from the previous filterchain) and the second overlay stream by overlaying the "overlay_video2" stream on top of the "intermediate" stream, offset by a horizontal displacement of 960 pixels and a vertical displacement of 270 pixels. The repeatlast=false option is used to avoid repeating the last frame of the overlay stream when it ends before the intermediate stream.

Performance

Because we're using GPU instead of CPU, we can run much more in parallel and with the current setup, we achieve speed of ~20x the speed of the video. So a 3 minutes bug report now needs 9 seconds to be encoded.
The limiting factor seems to be the decoding part of the GPU.

Result

Tinker fancy solutions

We do not want the tester to hold the rear camera above the phones all day long so I started my 3d printer, got some wood and built this masterpiece:

How to Merge Videos lightning fast with FFMPEG and NVidia Cuda

Laurent Meyer

Laurent Meyer

Camera

The process

First, create an accident

Then, improve the footage

FFMPEG

Installing the tools

Merge the files

Performance

Result

Tinker fancy solutions

Simple CMS with Vercel and Cloudflare functions + Cloudflare Workers KV

Building a 5G Router on a Budget

Generate PDF invoices with a Chrome extension (for eBay)

Playing around with Webassembly: Ghostscript

Automating my Xolo invoicing process