Monday, April 23, 2018

Experiences with Microsoft’s Azure Face API

In the last few weeks I have been working with Microsoft’s Azure based Face API.

If you have never used the API you might well be surprised by how extensive the information about each face returned by the API can be. Here is just a small part of the information that comes back:
1. The coordinates of the face inside the scene.
2. The amount the face is tilted.
3. A guess of the person’s age.
4. How much the person is smiling.
5. Whether the person is wearing glasses or not.
6. Whether the person has facial hair or not.
7. Whether the person is male or female.
8. A guess at the emotional state of the person.

All the above as well as very detailed information about positions of features in the face can be obtained.

The way in which the API is used has been designed to be very straightforward.

To be able to recognize a face, the Microsoft engine in Azure needs to have some sample images of the face. The set of samples is called the training set, and the project I worked on started by sending a set of images to Azure for each of the people we wanted to recognize later in our project.

When the time came to recognize people, we set up a camera connected to a PC and every few seconds sent the current camera image to Azure asking the Face API to tell us if any faces were in the image.

If a single person walked up to the camera, the response would be that there is one face in the image we had sent. The Face API is quite capable of picking up many faces in a single image (for instance where the image shows a number of people seated around a table).

Once we know there are faces in an image, we need to use a different function in the Azure Face API where we send just the area around a face to Azure and ask whether that face belongs to someone in our training sets. The response we get back is not just a yes/no response, but a probability of how likely it is that the face we sent matches someone. Generally, we would choose the highest probability match (if there is one).

In our project we wanted a PC app to trigger an activity whenever someone the app knew came into range of the camera. In effect we would also know when they had left as we would stop seeing them through the camera.

The Face API made it easy for us to set up the project and begin testing. At that stage, we began to realize it was not all quite so simple.

The first sign was that people who walked past the camera in profile were not recognized. Actually, they weren’t even detected as faces! After some investigation it was possible to determine a list of circumstances that were likely to have an impact on whether someone was going to be matched.

The first step in getting a match as noted above is to detect that there is a face in an image. This step, we discovered can be affected by quite a few things. Here is a partial list:
1. A person’s head should not be turned away from the camera by more than about 45 degrees.
2. If the camera is positioned too far above the mid-line of the face, no face is detected. Similarly, even if the face and camera are at the same level but the person turns their face too far up or looks down too far, no face is detected.
3. If the face is tipped too far from vertical with respect to the camera, a face will not be detected.
4. The mouth should not be covered.
5. The nose should not be covered.
6. Each eye should be visible or at most obscured just by no more than a finger width.
7. Ears, forehead and chin do not need to be visible.
8. Placing a hand against the side of the head or chin does not prevent detection.
9. Beards, moustaches and glasses do not prevent detection.
10. Strong backlighting (e.g. A large window behind a person) can make detection impossible.

Even if a face is detected, the face may fail to match against the training set due to other problems:
1. If the place/camera where the training set was collected is different to the where the recognition is to be done, the success rate in matching may be lowered.
2. If the resolution of the cameras used for training and for recognition are very different, the success rate in matching may be lowered.
3. If the camera resolution is high (e.g. 1920x1080), matching is easily achieved at 2 metres distance from the camera. If the camera resolution is low (e.g. 640x480), matching at 2 metres from the camera becomes difficult.
4. If the facial expression at recognition time is too different to the expression used in the training set (e.g. mouth open at recognition, while the training images all had mouth closed), recognition may fail.

Achieving a reliable result in a project once you know more about the characteristics of the API becomes not just a matter of putting some code together. The project design may need to juggle with the position of the camera, perhaps using more than one camera. Some thought will also need to go into lighting and possibly devising techniques to compensate for perfectly normal face obscuring activities such as people simply turning their heads.

Peter Herman

No comments: