Multiple Gstreamer Video Input Switching and Compositing

I use two cameras for my mobile livestreaming setup for birding. I’ve got one wide camera, that I mainly use as I walk to show the general area, and a zoom camera that I used to focus in on a specific bird. I can switch between the two cameras, and sometimes I also want to display both.

I get asked to how it works, so I figured I’d write a high-level blog post on how it works, without getting into the nitty-gritty specific details.

Here’s an example, with some debug overlays enabled. The zoom camera is the main view, with the inset being the wide view of the area I’m birding in.

As I primarily stream from Nvidia Jetson devices, I’ll focus on those, but the following should be adaptable to anything that supports gstreamer and some sort of compositor (preferably hardware accelerated if on a SBC like the Jetsons).

Software Used

There are two main ingredients to make this all work: gstreamer and gst-interpipe. I use gstd to make things easier to automate.

Gstreamer is obvious, it’s required to run everything. gst-interpipe is used to split the pipelines into different pieces. I’ve structured things as a pipeline for each of the inputs, a pipeline for the compositing and a pipeline for the encoding and output. Why not split each input into it’s own pipeline, with audio and video separate? I had some strange audio/video synchronization issues with each in their own separate pipelines, and found that having them in the same pipelines kept their timestamps consistent with each other.

gstd is just a daemon that wraps gstreamer pipelines and allows them to be accessed by an API. I chose it because I didn’t want to get bogged down in writing code in C/C++ to use gstreamer directly.

The Pipelines

This is an example of an input pipeline for a V4L2 video device with an Alsa audio device. In this case, it captures raw (uncompressed) video and audio from a UVC capture device. There would be two of these inputs for two capture devices, and these inputs could be just audio or video. They needn’t be a V4L2 source either, they could be an srtsrc, rtpsrc or any other source that gstreamer supports. If the input is encoded, then it will need to be decoded to raw in the input pipeline. Each input needs a unique name property on the interpipesink.

I have found that putting the audio and video inputs in the same pipeline can help with synchronization issues that sometimes crop up. Note that Nvidia’s compositor doesn’t seem to like mixed framerates, so there is an explicit videorate to keep it happy. There isn’t any reason why all of these pipelines couldn’t all be put in one pipeline, but I split them in up for better separation of concerns.

alsasrc device=hw:<alsa src id> ! queue ! audioconvert ! audio/x-raw,format=S16LE,channels=2 ! volume volume=1.0 mute=False ! queue ! interpipesink name=audio-input-1 \
v4l2src device=/dev/video<v4l2 src id> ! videorate ! video/x-raw,framerate=<common framerate> ! queue ! interpipesink name=video-input-1

Next comes the compositor. In this example, there are two 1080p sources, with one resized to be 480 x 270 (1/4th 1080p) as a picture-in-picture (PiP) inset, located in the bottom left corner. The compositor works on RGBA only, so there’s an nvvidconv step to convert the two inputs to RGBA. On a non-Jetson system, this would be accomplished using the videoconvert and compositor elements (or similar).

interpipesrc format=time listen-to=video-input-1 block=false name=pip-in-1 ! nvvidconv ! 'video/x-raw(memory:NVMM),format=RGBA' ! comp.sink_0 \
interpipesrc format=time listen-to=video-input-2 block=false name=pip-in-2 ! nvvidconv ! 'video/x-raw(memory:NVMM),format=RGBA' ! comp.sink_1 \
nvcompositor name=comp sink_0::xpos=0 sink_0::ypos=0 sink_0::width=1920 sink_0::height=1080 sink_0::zorder=0 sink_1::xpos=10 sink_1::ypos=750 sink_1::width=480 sink_1::height=270 sink_1::zorder=1 ! 'video/x-raw(memory:NVMM),format=RGBA,width=1920,height=1080' ! interpipesink name=mixer-stream-pip

And finally, the encoder block. In this case, it’s encoding the video to HEVC (H265) and the audio to Opus. The video and audio is then muxed to MPEG-TS and finally sent over the network using SRT. In this case, the audio is the first input, and the video is the output of the compositor.

interpipesrc format=time listen-to=audio-input-1 is-live=true name=audio-stream-src ! volume volume=1.0 mute=False ! audioconvert ! opusenc ! opusparse ! queue ! mux. \
interpipesrc format=time listen-to=mixer-stream-pip block=false name=stream-in ! nvvidconv ! queue ! nvv4l2h265enc ! h265parse config-interval=-1 ! mux. \
mpegtsmux name=mux ! rndbuffersize max=1316 min=1316 ! srtsink uri=srt://<srt host>:<srt port>

Okay, But How Does it Work?

The “magic” in the switching is being able to dynamically link interpipesink elements to interpipesrc elements. Simply change the listen-to property on the interpipesrc to change inputs. If compositing (PiP) isn’t needed, the compositor pipeline can be omitted and the encoder/muxer/transmitter pipeline can be directly manipulated to switch inputs. With the compositor, the inset and main source can be swapped by changing the interpipesink each interpipesrc listens to.

Changing the layout of the compositor is similar, but takes more steps. Each property on each compositor sink needs to be changed. Care needs to be taken to make sure that the inset’s zorder property is a higher number than the main image, otherwise it’ll end up hidden. This is particular to gstd, as code using gstreamer directly can change more than one property at once.

There’s also no reason why more inputs can’t be used. A Jetson Nano, for instance, can do 4 simultaneous inputs, depending on the input. Simply add more input pipelines and more sinks to the nvcompositor, and set the properties appropriately.

Updates

07-Feb-2024: I made a small copy-paste error in one of the gstreamer pipelines, this has been fixed. I also made a couple edits/clarifications based around questions and feedback.

Leave a comment