Building an architecturally viable Facial Recognition System

Carrying out Advanced Facial Recognition that scales is no easy task

I bet you’ve probably seen dozens of blog posts, GitHub repos and research papers telling you about how to perform facial recognition on a dataset. I did too, and when I actually had to build a full scale model that could scale up to perform with both accuracy and speed, I was left in the dust.

So I decided to write this — call it a short set of notes, on how to scale up to build an actually architecturally viable model to perform Facial Recognition on a bunch of people in a room. [This is top-secret, government level stuff guys! I’m kidding, those systems are probably much more advanced than this :( but this should be a good place to start]

I would not suggest this as an introduction to Facial Recognition, as it assumes you have some pre-requisite knowledge. While this blog covers some of the basics, it’s more focused on building a full fledged system rather than make you familiar with the know-how of basic stuff. I would suggest reading one of many tutorials, blogs or even watching videos about the basics of how Face rec works and what happens behind the scenes.

First of all, before we even start, let’s look into how facial recognition even works under the hood. You may have done this multiple times, and if so here’s a quick revision! Do keep these processes in mind, as optimising all of them is what will help us achieve a perfectly viable system!

The whole process of Facial Recognition is divided into these 4 main parts:

  1. Preparing your Data set / Capturing and Pre-processing your data to input to your algorithms.
  2. Running Facial Detection Algorithm(s) to recognise and mark the locations of faces.
  3. Pre-processing the cropped images, passing them to the second model for feature extraction & storing the feature space.
  4. Repeating the above steps for a new image and computing the feature vectors to get comparisons.

Let’s have a look at what they are and how to do them efficiently one by one before we discuss on how to bring them all together using a tool called Redis.

This step will probably be one of the most important steps in the whole pipeline of building your complete system. “Why”, you ask? That’s because depending on how you have to preprocess this data, your “Training” (or ingesting existing faces) speed depends. And considering right after this is the step where you have to ingest tens of thousands of images (if not more), you need to ensure your data is ready to be processed at speeds that will allow your pipeline to work functionally well.

“This step will probably be one of the most important steps in the whole pipeline of building your complete system.”

There are multiple ways to do this depending on your case. Chances are if you’re building a prototype or a testing model, you’ll have a bunch of images of labelled faces on your disk. That’s well and good, and if that’s the case you just need to ensure all of them are in readable format by your standard OpenCV’s imread() function. Better yet, you’re best suited by converting all images to one single format, and group them into folders if you have multiple faces per person.

However, if you’re building a complex production-level model, chances are that like me, you have a slightly more complex dataset. For example, I had bunch of videos of the people we wanted in our database, and these 3 to 5 second videos comprised of a complete 180-degree view of their faces. This was done as to enable our model to recognise them from any possible angle on the field.

Gotta prepare those videos for ingestion!

If you have anything similar of the sort, I suggest you try and process such videos into frames and store them into folders representing the videos (or use proper nomenclature for your file names). You can use something like this below:

Code to break up your video into frames and saving them using OpenCV

If your dataset is too big, and slicing them into frames is only going to double your storage needs, read on! We’ll be soon addressing how to make all this a part of a pipeline that works autonomously without you needing to store the images, but directly processing them like a boss!

STEP 2: Detecting faces

Once you’ve loaded and preprocessed your dataset, this will be the most easiest step for you to carry out.

There are already tons of machine learning models that help you locate faces, but I’ll give a quick run down of your options and their pros and cons of a few well-known models.

  1. DLib based face_recognition library.
  2. MTCNN library, based on the research paper Zhang et al. (2016) [ZHANG2016].
  3. OpenCV’s face recognition using Haar Cascades

In comparison, MTCNN has shown to be the most accurate by far, but suffers from being extremely slow due to it’s multi-task CNNs. It has been shown to perform at a maximum of 8 FPS on CPUs, making it one of the slowest models to carry out face detection.

OpenCV’s Haar Cascades actually performs pretty well, and is quick as well, but DLib takes away the mantle with it’s Deep Neural Net implementation, that can be accelerated on GPUs to achieve 40+ FPS on HD images with a very viable accuracy.

TL;DR: Choose MTCNN for accuracy, OpenCV for speed and light system requirements, and choose DLib for the Sweet Spot in between the two.

This is how easy face detection can be with DLib’s Python wrapper face_recognition

Full Documentation and sample codes are all available on the links I’ve provided.

STEP 3: Processing Faces

This, this is where the music begins. This is where you get to see the magic happen that will finally lead you to the heart of facial recognition.

Once you have your faces detected and marked with bounding boxes, you can finally start sending them to your feature extraction model.

There are multiple ways and available models to do that, but these are few of the most well known ones:

  1. DLib’s Face Recognition Model (Python Implementation)
  2. FaceNet (A model proposed by Google Engineers, 2015)
  3. VGGFace (A feature extraction model based off VGG-16)
  4. VGGFace-2 (Based on the newer ResNet-50 Architecture, 2018)

As you can guess, the best performance amongst all of the above is by VGGFace-2. The reason behind it is simple, the ResNet-50 model trained on a dataset as huge as VGGFace-2, allows for an extremely diverse set of feature extraction. The feature space of VGGFace-2 is a 1 X 2048 vector as compared to 1 X 128 feature space of DLib. FaceNet comes close, but overall VGGFace-2 seems to dominate.

Let’s look at how you can quickly get started with implementing one of these:

Here’s a quick implementation in face_recognition library that we saw before-

Continuing from before, we can now load the faces positions we found and vectorize them and save them to file using the function

Alternatively, using the keras-vggface library, you can implement the VGGFace models like this -

Now that you have your feature vectors, it’s time to debate how we’re actually going to store them.

The answer to that question mostly depends on the way you’re building your architecture — Will you have abundant disk space and memory? The easiest way of course, is to save the numpy arrays as shown above, but that means you will manually have to maintain an index of which array belongs to which person, and then have to load all the vectors form the disk when you try to compare any new incoming face.

The better way, is to build a good SQL based database, with a table for person, and another for all face-vectors, with a One to Many relationship between the two. This way, you will also be able store any additional metadata about the person whose face you’ve recorded in the table!

STEP 4: Capturing new faces and running the comparison

Finally! We’ve made it till here!

This is by far the easiest step in the process of facial recognition! All you have to do is repeat the steps you’ve already done with a small change.

Load the face that you’re trying to compare to the ones already stored in your database, run the face locations script, convert it into the feature vectors and then get ready to compare!

Now, let us understand how we’re actually going to compare the face vectors that we have at hand. The easiest way is to compute the Euclidean Distance between the vector you just generated and all the vectors saved in your database. Some research papers also suggest Cosine Distance, as it works extremely well for features with larger dimensions.

You can do it performing a comparison like this -

Once you have the closest match, all you have to do is fetch the person’s details from your database!

Easy wasn’t it?

Now let’s have a look at how you can scale such a process.

Scaling Face-Rec to real-time use

So far we’ve seen how easy it is to perform facial recognition with small snippets like these.

Now, all we have to do is add them into one coherent pipeline, that doesn’t worry about Data I/O from disk again and again or gets choked up with a backlog of frames while processing input from a Video Stream.

Enter Redis!

What is Redis you ask? Redis is simply a in-memory key-value store, that can be used for multiple purposes such as pipe-lining, caching, pub-sub and even in-memory ML model execution!

We’ll be using something known as Redis Streams — an event triggered based system, that allows you to log data and enqueue it, which can then be picked up by another Redis Module known as RedisGears, which picks up this data and runs an in-memory script. And the best part ? It has native python library, making it seamlessly integrable onto our system!

And the best part ? It has native python library, making it seamlessly integrable onto our system!