Haar Face Revisited

After careful consideration, I think I have made a final decision for the type of model I will use for upcoming work in facial extraction: Haar cascades. As seen in earlier posts, the cascade method is simple, works well (high FPS) on CPUs, and is accurate even with smaller faces. False positives during face detection will be handled later, but for now I want to start diving into this method in more depth.

Among the things I want to investigate are:

  • different face cascade networks (haarcascade_frontalface_alt.xml, haarcascade_frontalface_alt2.xml, haarcascade_frontalface_alt_tree.xml); these can be found here
  • different face cascade parameters in the detectMultiScale function (namely scaleFactor)

The default face cascade I have been using thus far (in haarcascade_frontalface_default.xml) is described as a “stump-based 24x24 discrete(?) adaboost frontal face detector.” Though I don’t yet know exactly what this means, I intend to find out after some reading up. For now, I will design some experiments to run, and then I will return to this loaded description.

The nice thing about using Haar cascades (at least with the Python interface) is that GPU support is not necessary for fast, accurate results. In fact, there is very little information about speeding up Haar cascades with GPU with opencv-python. I stumbled across alexanderkoumis’ opencv-gpu-python wrapper (found here), but I could not install it correctly on the several machines I have at my disposal… so I ultimately decided to move on.

Experiments Part 1 - Different Networks

In total, there are 4 different cascade files for initializing face cascades. So far, I have used the default setting. Let’s use the three others and look at the results.

The below graph shows a close clustering of three of the networks with one reigning supreme in terms of FPS: the ‘tree’-based network. The default and alt2 networks show similar standard deviations, alt shows the lowest, and the tree shows the largest.

def alt alt2 tree
mean 9.938540 9.014827 9.525047 11.113886
std 0.786090 0.596222 0.721722 0.862095
min 7.565579 6.913331 7.827618 7.577237
25% 9.342537 8.595907 8.982168 10.474526
50% 9.861410 8.950025 9.462188 11.033925
75% 10.352830 9.387767 9.999914 11.604876
max 12.608040 11.015148 11.882319 13.671402

Though the above FPS results suggest that we use the ‘tree’ cascade, a quick skimming through of the output videos shows almost no drawn bounding boxes (low accuracy). Only for larger faces right next to the camera does the detector actually find faces. The ‘alt’ and ‘alt2’ cascades pick up smaller faces, but are spotty; they flicker from frame to frame whereas the default detector maintains more consistency between frames. Given that ‘tree’ can be ruled out almost immediately due to inaccuracy and the others due to frame-to-frame misses, I propose using the default xml file.

Experiments Part 2 - scaleFactor

So, we know that we want to use the default frontal face Haar cascade (haarcascade_frontalface_default.xml). We also know from my first post, we also know something about the minNeighbors parameter in the detectMultiScale function. We will use minNeighbors = 5, which was shown to be both accurate and fast. The next portion of experiments deals with the scaleFactor parameter.

By OpenCV definition, this parameter specifies how much the image size is reduced at each image scale. This was explained in a nice concise way in this article:

Suppose, the scale factor is 1.03, it means we’re using a small step for resizing, i.e. reduce size by 3 %, we increase the chance of a matching size with the model for detection is found, while it’s expensive.

The below graph shows a nice separation between each of the different FPS streams. As expected, the larger scaleFactor leads to an increase in FPS, since the image is reduced by a larger percent at each stage. Contrarily, the scaleFactor=1.03 takes longer and has a lower FPS. However, there seems to be a sizable difference in accuracy. The 1.03-scale finds smaller faces, but also reports more false positives; the 1.15-scale cannot find faces as small, but reports fewer false positives. Looks like I an accuracy evaluation is in order, which I’ll put below this graph.


The above table is an accuracy evaluation for each of the scaleFactors. However, without appropriate ground truth data for the clip (namely bounding boxes for the faces in each frame) my approach is rather flawed. The true positive counter only increases when the detector finds a face, as can be seen in the code below. Two problems there. Firstly, we do not actually know whether or not there is a face in the bounding box that the detector returns; false positives are a common critique of Haar cascades. Secondly, since the true positive counter increments when more than zero faces are found, counting false positives becomes tricky. Say the faces_found variable has length 2. Which detection actually has a face? Surely, not 2/2, but is it 1/2 or 0/2?

for d in dicts:
    tp = 0
    fn = 0
    for actor in appearances.keys():
        for fnum in range(appearances[actor][0], appearances[actor][1]):
            faces_found = d[fnum]['coords_list']
            if len(faces_found)>0: # detected
                tp += 1
            else: # not detected
                fn += 1     
    print('{},{},{}'.format(tp, fn, tp/(tp+fn)))

It seems there are two options available to deal with these flaws. The first would be to manually construct the ground truth data for the short clip, which would include bounding boxes for each of the actors faces in every frame; I’m leaning toward this since it might have to be done at some point anyway. The second option is to add a second face detector on top of the first to verify if a face actually is in the region. This seems like overkill and is counterproductive in terms of maximizing FPS; this can be seen in my earlier true_v_infer post.

That’s all for now. I’ll check back in later. -1:30

My full code is available at https://github.com/pgtinsley/haar_face_revisited.

Patrick Tinsley

Patrick Tinsley

My name is Patrick Tinsley, and I am a graduate student at the University of Notre Dame. My focus is computer vision for surveillance video.