True vs. Inferred Face Extraction

Today, I decided to look at how true face extraction measures up against inferred face extraction. The former (labeled as ‘true extraction’) simply scans the entirety of each frame and finds bounding boxes for all detected faces. The latter (labeled as ‘inferred extraction’) requires two steps. The method first searches the frame for bodies, which should be easier than finding faces simply based on size. Only after finding a body does the program scan the body’s bounding box for a face. Intuitively, this second method should take longer on a per frame basis since there are multiple detection stages. An assessment of the accuracy of these two methods is presented using ground truth from a short clip of surveillance video (the same clip detailed in my minNeighbors post).

The program I wrote uses Python3 (3.6.3) and OpenCV (4.1.0). The interactive graphs were generated using Plotly (3.10.0).

Set Up

Similar to this week’s earlier post, the following code sets up Haar cascades for body and face detection using OpenCV’s built-in functions. The xml files can be found at or

body_cascade = cv2.CascadeClassifier('haarcascade_fullbody.xml')
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

FPS Evaluation

To see the full program’s execution pipeline, please reference

In both cases (true extraction and inferred extraction), the FPS for each frame was calculated by taking the reciprocal of the time needed to process the frame: convert to gray, detect (bodies and) faces, draw bounding boxes, output to file. This process is detailed in my June 3rd post. The first frame of the video was omitted to discount “cold start latency.”

As evidenced by the below graph and table, true extraction processes (on average) ~2.4 more frames per second than the inferred extraction. As previously discussed, this matches our intuition. Perhaps a further analysis of the time dedicated to each stage of the inferred process will be the subject of a later post.

true inferred
mean 9.816337 7.397280
std 0.788478 0.873392
min 7.640524 5.090805
25% 9.244480 6.755842
50% 9.706700 7.348257
75% 10.302504 8.137864
max 12.303692 9.598059

Accuracy Evaluation

For this section, we needed some ground truth data. I constructed the data for the surveillance clip myself, so the frame numbers may be subject to a degree of user-induced variability. The video I used features two individuals walking up and down a tunnel hallway at separate times. While walking up the tunnel, the subjects’ faces are not visible. Upon turning around at the top of the hallway, the faces become visible until they exit through the bottom of the view.

This ground truth data was stored in the form of tracklets. Each line in the ground truth file (clip-Appearances.txt) is a comma-separated triple consisting of an actor ID, the first frame where his/her face is visible, and the last frame where the face is visible. For this clip, Actor 0’s face was visible in frames [107, 292] and Actor 1’s face was visible in frames [763, 1018].

The below code loops through the frames where an actor’s face is visible ([107-292] and [763-1018]), and counts the number of times that a face is actually detected for each method; I did this separately for each method by commenting out appropriate lines. ‘tp’ and ‘fn’ stand for ‘true positive’ and ‘false negative’, respectively.

tp = 0 # face there, face detected
fn = 0 # face there, face not detected

for actor in appearances.keys():
    for fnum in range(appearances[actor][0], appearances[actor][1]):
        # true_found = true_dict[fnum]['coords_list']
        # inferred_found = inferred_dict[fnum]['coords_list']
        if len(true_found)>0: # detected
            tp += 1
        else: # not detected
            fn += 1

The coords_list variable stores lists of 4-tuples (x, y, x+w, y+h) for each face found. Since the video I used doesn’t have any frames with more than 1 person present, a positive value means that a face was detected (when really more than 1 could be detected for any frame).

For the true extraction method, (tp, fn) = (257, 183), which roughly equates to an accuracy of 0.5841. For the inferred extraction method, (tp, fn) = (185, 255), which roughly equates to an accuracy of 0.4205. In my opinion, these accuracy values are quite lackluster. Increasing these will certainly be a topic of the next post.

The accuracy decrease from true to inferred can be explained by the fact that when the body detector fails, a face cannot be found whatsoever, since the face detection only occurs if a body is found. We can remedy this by using a more accurate body detector. Options might include YOLOv3 or Caffe models.

For curiosity’s sake, I decided to investigate the false negatives reported for inferred extraction. The motivation behind this investigation was to see which detection phase was failing: the body detection or the face detection.

The no_body_no_face variable counts how many times a body is not detected when a body is in the frame; the yes_body_no_face counts how many times a body is detected but a face is not.

no_body_no_face = 0 # needs better body detector
yes_body_no_face = 0 # needs better face detector

# loop over 'false negative' frames
for fnum in fn_frames:
    body = inferred_dict[fnum]['found_body']
    face = inferred_dict[fnum]['found_face']
    if (not body) & (not face): no_body_no_face += 1
    else: yes_body_no_face += 1

After running the above code, one observes that (no_body_no_face, yes_body_no_face) = (54, 201).

So, the body detector did not perform as poorly as I had expected. I expected to see more instances of no_body_no_face. About 21% (54/255) of the false negatives are a result of poor body detection. The other 201 instances are a result of poor face detection; this is less of a problem given the low resolution faces. I think the main takeaway from this experiment is that we need to explore better options for face and body detectors.

My full code is available at

Patrick Tinsley

Patrick Tinsley

My name is Patrick Tinsley, and I am a graduate student at the University of Notre Dame. My focus is computer vision for surveillance video.