My PhD (before 10+ years) was in video search and one of the proposed methods for video comparison was using the shot durations of the video. Notice that with shots I refer to cuts in the camera flow, IIRC hollywood movies have a such a shot cut every 4 seconds as an average (for example when two people talk the camera will move from one person to the other).
I remember that I used two techniques for extracting scene cuts:
* Difference in the brightness (Y) histogram of the YUV video between frames; when that difference is more than a threshold there's a scene cut
* Counting the number of Intra macroblocks per frame on an H.264 encoded video; if that number was more than a threshold then there's a scene cut
Author of PySceneDetect here. The current implementation does exactly what you hint at, except instead of YUV, it considers deltas in the HSV domain (specifically differences in hue and color).
Other techniques being considered for future work include use of optical flow, background subtraction, and analyzing histograms.
From what I remember the Y (luma) component in a YUV video has more information than the other two components and it could also be extracted without the need to fully decompress the video (in mpeg compressed videos). Of course this info is more than 10 years old (I don't really do any video research any more) so I guess there should have been progress in that area.
This is indeed correct, I'm just using HSV instead of YUV, but the primary source of information is the luma/brightness component (although currently all 3 of the HSV components are averaged, so perhaps a better weighting may improve precision).
The image for an MPEG compressed video is splitted in 16x16 blocks. For each frame (excluding the 1st of course), the compressing algorithm tries to to matchthat particular block with a block in the previous frame (it searches the previous frame to see where there are the fewest differences). If it can do it it only encodes the differences and the position in the previous frame; this is called an inter(predicted) block. If however it can't match that block with the previous frame then it needs to re-encode from scrach; that's the intra macroblock. As you can understand after a shot cut there will be much more intra macroblocks.
I remember that I used two techniques for extracting scene cuts:
* Difference in the brightness (Y) histogram of the YUV video between frames; when that difference is more than a threshold there's a scene cut
* Counting the number of Intra macroblocks per frame on an H.264 encoded video; if that number was more than a threshold then there's a scene cut