An Analysis of Mask R-CNNs

Veer Pandya
5 min readMar 4, 2022


Variations of Image Segmentation


Image classification has grown significantly within the last few years, and we are at a point now where everyday smartphones are able to perform complex segmentation tasks that produce accurate results. Convolutional Neural Networks (CNNs) lie at the heart of the modern classification algorithms. In this analysis, I’ll talk specifically about the Mask R-CNN algorithm including a brief overview of its origins, use cases, and beginner friendly programmatic introduction.

Technical Overview

CNNs are incredibly powerful when it comes to identifying an object within an image. However, their limitations become obvious when you try to extend the classification beyond a single object within a confined frame. This becomes an issue especially when we acknowledge that the vast majority of pictures don’t just contain a single object, and in fact they are filled with complex sceneries that traditional CNNs just can’t work well with. This is where Regional Convolutional Neural Networks (R-CNNs) come in.

R-CNNs introduce a concept known as a bounding box. These boxes are generated through the algorithm and attempt to identify the various objects within an image using a process called Selective Search. Selective Search works by viewing the image at differing sizes, and connecting pixels by their color or intensity to pick out potential objects. Once this is done, the algorithm processes each feature and adds a Support Vector Machine (SVM) on the final layer that can classify the potential objects.

Classifying Regional Features

While R-CNNs are effective, they are very slow and require a lot of processing time as the algorithm needs to parse every single potential object that is proposed to classify it. This is where the Fast R-CNN algorithm comes in. This method takes into consideration the fact that most of the proposed objects that are found have many overlapping segments. So instead of individually parsing each one, Fast R-CNN runs a CNN over the entire image, and then uses that data as a pool for regional selections using the Region of Interest Pooling (RoIPool) method. One more iteration over Fast R-CNN is Faster R-CNN. Faster R-CNN uses the initial CNN pass for the entire image and reuses it to create the region proposals, saving time by not having to run it separately as before.

RoIPooling with Faster R-CNN

At this point, our image segmentation process has become very efficient and can produce accurate results. One glaring issue however, is that the classification only occurs within a four walled box, and doesn’t perfectly match the exact dimensions of each individual object. This is where Mask R-CNN finally comes into play. This algorithm extends the logic of the previous iterations but brings the classification to a pixel by pixel level, resulting in extremely precise image classification that can detect objects down to their exact pixels within an image. By creating a branch from the Faster R-CNN process and running the algorithm again except only within the bounding boxes, Mask R-CNN can create binary masks covering each pixel by determining whether or not it is part of an object. To make this work however, researchers at Facebook AI Research had to make a modification known as RoIAlign that realigns the RoIPool to fit with the regions of the feature map that had become slightly misaligned from the initial image.

Framework for Mask R-CNN Segmentation

Now that you have a good understanding of the history and architecture of Mask R-CNNs, let’s talk about some of the advantages and use cases this advancement brings to the data science field. As I mentioned before, this algorithm is extremely efficient and accurate, and allows for detailed image segmentation. This is particularly useful when parsing image data for human behavior such as pose detection and body language. Or it can be applied towards satellite image data to detect different types of sporting fields. This algorithm can even be used for detecting biological features such as organs through CTs and MRIs. As you can understand, Mask R-CNN has incredible potential to better many different fields in our modern world.

Detecting an Individual Organ in a Multi-Organ Segmentation

Ethical Review

As algorithms continue to evolve, and automatic machine learning algorithms get more powerful, the issue of ethics becomes an ever growing question mark. As there are many advantages from Mask R-CNNs, there are also many potential privacy concerns to consider. This algorithm is powerful enough to classify exact people based on their shape. More controversially, this can be extended towards skin color as well. As advancements continue to be made, a conscious decision to consider the moral impacts must be had to avoid catastrophic implications into the future.

Programmatic Introduction

When it comes to actually trying out Mask R-CNN and learning how to technically use the algorithm, nothing is better than the official code located on Github. Here, you can find a detailed guide that uses a pre-trained model to detect and segment objects. If you’re inclined to dive even deeper, the repo has additional guides that will help you get up and running on training your own Mask R-CNN model for use in custom classification projects.

Mask R-CNN Demo with a Pre-Trained Model

Final Thoughts

In conclusion, Mask R-CNNs are incredibly versatile, efficient, and accurate. They have nearly infinite application uses that will lead to new innovations in many different professions. As always, this is just another step in the exciting and fast growing field of data science. As more people learn and become passionate about image classification, new algorithms will be created to solve the issues that arise while the world continues rapidly to progress.


Mask R-CNN for Ship Detection & Segmentation

A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

Fast R-CNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Mask R-CNN

Mask R-CNN: A Beginner’s Guide

An Improved Mask R-CNN Model for Multiorgan Segmentation

Mask R-CNN for Object Detection and Segmentation