Benchmarking Privacy-Preserving Motion Detection

privacy picture
Monday, February 1, 2021


The prototype is composed of three elements: a camera, a gateway, and a backend. This prototype is based on ffmpeg and is implemented as 3 bitstream filters, one for each element in the system.


The camera captures and encodes video in H.264 format. It extracts the motion vectors of the H.264-encoded stream and encrypts them using functional encryption thanks to the CiFEr library. Those encrypted motion vectors are then bundled as side data, alongside the AES encrypted video stream. This is possible because motion vectors are stored in H.264 Network Abstraction Layer (NAL) units of type H264_NAL_SEI and the image data of the video itself is stored in NAL units of type H264_NAL_SLICE and H264_NAL_IDR_SLICE. These different NAL units can therefore be encrypted differently without causing any problems. The SEI messages are additional messages that can carry any data format and must not be video-related. Therefore, we store the functionally encrypted motion vectors in these. The video image data is stored in the 2 other types of NAL units mentioned above. Symmetrically encrypting NAL units of that type is enough to make the video unreadable by anyone not possessing the encryption key.


The gateway uses the corresponding functional encryption key to evaluate whether there is motion or not within a group a video frames. For each group of frames, if there is motion, the gateway forwards the symmetrically encrypted group of frames to the backend. Thus, the amount of data transmitted over the wire is reduced to the interesting segments, where there is motion in the video. Since the video stream is encrypted, the gateway cannot determine anything about the image data inside the video, and therefore does not need to be trusted. The untrusted gateway can perform the computationally intensive motion detection on behalf of the potentially low-powered camera.


Unlike the gateway, the backend can decrypt the received AES encrypted video stream because it knows the symmetric encryption key. The decrypted video stream contains only frames of the original video where motion was detected by the gateway. The backend is able to play the video.


Frames per second

We measure how the number of motion vectors used affects the number of frames that can be processed per second on the camera side and on the gateway side. We also measure how turning on functional encryption of the motion vectors affects performance. In any given frame, there are a certain number of motion vectors. We do not need to use all of them but we are interested in knowing what is the performance impact of using a certain number of these.

Side-data size

We measure how the additional side-data size varies as the number of motion vectors used changes. Additionally, does functional encryption of the side-data have an impact on the output video sent to the gateway?

Backend video size

The gateway removes the segments in the video stream where no significant movement occurs. A segment is a group of pictures (GOP). We define a threshold maximum value for the sum of the motion vector norms and call it the GOP threshold. If the computed value exceeds the GOP threshold, the gateway considers that movement is detected. In that case, the segment is forwarded to the backend. Otherwise, it is removed. We measure the size of the stream received by the backend and compare it to the size of the original input video produced by the camera to see how much traffic can be spared.



The benchmarks were run on a single machine, running all three elements of the system (camera, gateway and backend). This machine is a five year old desktop computer with an Intel i7-6700k CPU clocked at 4.0 GHz, with 4 cores and 8 threads. It also has 32 GB of memory.


All measures were performed with a pre-recorded 1080p H.264 video at 30 frames per second as input for the camera. The camera outputs more frames per second for smaller numbers of motion vectors (up to 95). Then, the gateway and the camera appear to be able to output similar FPS rates, for larger numbers of motion vectors.The camera is still able to deliver a stable 30 frames per second with 40 motion vectors. The gateway drops below 30 frames per second if the number of motion vectors is greater than 13.

With a small, but large enough number of motion vectors of 3, the camera outputs 233 FPS and the gateway is able to output 67 FPS. Therefore, it would even be possible to process a 60 frame per second video in real-time without any slowdown, with encryption of the motion vectors.

Additionally, results of the same measures, but with motion vector encryption disabled were extracted. When motion vector encryption is disabled, the camera and gateway do not seem to be affected by the number of motion vectors and deliver a stable performance of 419 FPS on average. Motion vectors compose the side-data sent alongside the video stream from the camera to the gateway. For a small number of motion vectors (1 to 10), when encryption is disabled, the overhead size is roughly 10%, and in the 10%-20% range with encryption. This is still acceptable. However, as the number of motion vectors grows to greater numbers, such as 1500, the overhead without encryption is 41%, and 1431% with encryption. It is therefore suggested to use a small number of motion vectors to minimize the overhead. This is not a problem since motion detection can be performed properly with only 3 motion vectors. With 3 motion vectors, the overhead is only 13% with encryption and 9% without encryption. Thus, encryption adds very little overhead with a small number of motion vectors.

It was empirically observed that, as soon as the backend system has started playing the gateway stream, there is no significant delay other than the one due to network latency. If the number of motion vectors is increased, it may happen that the camera or the gateway becomes unable to process the stream fast enough. Indeed, since the source video is 30 frames per second, if the gateway cannot process frames at that rate, then the backend system will receive frames slower than the video playback speed. When that happens, the video may play slower than expected. As previously shown in Figure 1 above, we have seen that the maximum value for the number of motion vectors, for which processing can happen at least at 30 frames per second, is 40 for the camera and 13 for the gateway. However, only 3 motion vectors are sufficient for proper movement detection. There is therefore plenty of room for processing videos with higher frame rates.

Next, a 10MB video was used as input and measured the size of the output video sent to the backend with various values for the Group of Picture (GOP) threshold. The same measures were performed with 3, 6 and 9 motion vectors. As expected, the greater the number of motion vectors used, the greater the GOP threshold is required for the output stream size to start dropping. Indeed, since the sum of the norms of the motion vectors for a GOP is compared to the GOP threshold to decide whether to forward that GOP to the backend, the result simply confirms that motion detection is performed. The lines of the plot have another use, however. They clearly show the minimum GOP threshold that should be used so that movement is detected. For 3 motion vectors, the threshold should be set to at least 70. For 6 motion vectors, the threshold should be greater than 105. Finally, for 9 motion vectors, the GOP threshold should have a value greater than 125 for a motion to be detected. This also confirms that motion is properly detected with as little as 3 motion vectors.


Encryption of the motion vectors was successfully added to the prototype. We have shown that such encryption adds little overhead with small numbers of motion vectors and allows to stream a 30 fps 1080p video in real-time. There is even room left for streaming up to 60 frames per second without slowdowns according to our measures.