Speedy Motion Detection

Different shaped and sized objects falling

In the video, we observe multiple objects falling one at a time. These objects are of various shapes, sizes, and colors. Our algorithm does a pretty good job of tracking medium and large-sized objects. However, when it comes to small-sized objects such as the roll of red nylon string, the detector is missing them altogether. Another issue we see with all the objects is that their shadow is included as part of the object. This makes sense, as the shadow causes areas of the image to change significantly. This is not desirable, however, because we only want to capture the moving object itself, not parts of the background. Another achievement we observe in this video is the ignoring or non-detection of the stick’s motion (the stick, being the pole that is knocking the objects off the shelf); this is a sign that our speed filter is working. Since all the objects that fall are initially part of the background, we know that our motion masking step is working, as the absence of the object is not being tracked.

Two objects falling simultaneously

In the second video, we observe two objects falling simultaneously. Both are being tracked as they fall, which demonstrates that multiple objects can be detected at the same time. One of the objects is missed for one frame, but overall, the result is good. The same issue can be seen with the inclusion of the shadow as part of the object.

Object moving slowly and then falling

In the third video, we observe a tissue box fall in the first half of the video. In the second half, a backpack enters the frame slowly and during the last few seconds of the video, is dropped, falling at a high speed. The detector captures the falling tissue box as expected. Nothing new is demonstrated here, other than the fact that the same algorithm is being used for both halves of the video. In the second half, we have our first object that enters the camera view instead of initially being part of the background. We see that the motion is not being tracked because it is moving at such a slow speed. After it begins falling, then, do we see that the detector registers two speedy objects, the falling bag and the stick which is flung up after the backpack had been released, for a brief moment of time. This proves that our algorithm can handle new objects coming into view and also the correctness of our speed filter.

Motivation

A classmate and I wanted to do a computer vision project that was fun but still applicable to real life. At first, we thought of doing something sports-related. Maybe we could track the trajectory of a basketball. However, we refrained from pursuing this idea because it was not very challenging to track an object whose shape, color, and size we knew. In other words, it was too specific of a problem—sports have strictly defined rules and regulations that tempt solving an easier problem (i.e. thresholding the color orange to find basketballs) at the loss of applicability to more general problems. But the idea of motion tracking for basketball prompted the question: what if we didn’t know what object we were tracking? Now, we had a problem that seemed too broad to formulate a solution before the quarter ended.

We thought of scenarios that motion tracking would be useful in. Child safety was the first thing that popped into mind. I remember thinking of all the new articles I’ve read about furniture falling on children and causing serious injury and sometimes death. This was frequent enough of an occurrence that public officials and furniture companies began pushing people to anchor furniture to the calls to prevent them from tipping over and crushing a child. However, this is a solution that requires a lot of foresight—and anchors. It doesn’t erase the inevitability of accidents occurring. What happens if the anchor fails? So the anchored shelf tips and doesn’t fall to the ground; however, none of the objects on the shelf are fixed in place. They will fall. What I’m getting at is—fast-moving objects pose a grave danger to humans, especially children. We expect to be safe in our homes, but the reality is that accidents will happen and when they do, a timely response can make the difference between life and death. In the past decade, the use of home surveillance systems and doorbell cameras has skyrocketed, mostly in developed countries such as the United States. Billions of surveillance cameras watch over areas all over the world. What if there was a way to use them for the purpose of tracking fast-moving objects in the interest of alerting people of potentially dangerous situations?

Mission

Our team set out with the goal of developing a prototype of an algorithm that can be used by the numerous, stationary cameras attached to the sides of buildings and corners inside suburban homes to detect fast-moving objects. The concepts explored in this project can be extended to a number of other applications, including the detection of falling people (down stairs, in elderly homes, etc.) and the tracking of projectiles and prediction of where they travel (useful in the defense and sports industry).

Problem Statement

As stated, we are only tackling fixed camera views, as the goal is to be able to integrate our software with existing surveillance cameras. The problem becomes daunting if the camera can be attached to a moving object. If the camera was moving at a speed higher than the selected threshold (think of a camera attached to a car), then all stationary objects in the background appear to be fast-moving in the perspective of the camera. Hence, it would be hard to pick out objects that may pose a danger versus objects that appear to be moving fast because the camera is moving relative to the background. Hence, we will assume that our camera view is stationary, with the background mostly static. It’s reasonable to assume a background that doesn’t change sporadically due to the nature of surveillance cameras usually placed to oversee rooms, entrances, and intersections.

For simplicity’s sake, we will assume that motion is occurring in a plane parallel to the camera perspective. This can also be thought of as keeping the camera in 1-point perspective. The reason why we favor this perspective is that we don’t have to worry about depth and keep track of how objects are changing in size. 1-point perspective makes calculating objects’ speed a much simpler task. However, our design can be used for motion that doesn’t occur parallel to the camera’s perspective; the caveat is that the speed calculated won’t be accurate and objects moving slower than the threshold may be categorized as fast and vice versa.

We should be agnostic to the shape, size, and color of the fast objects. Anything that is moving will be tracked. However, there should be the option for the user to specify that they don’t want to track objects under or above a certain size. For instance, this may be useful in an area that is inhabited by a lot of flying birds. The initial position of the object also doesn’t matter. The object can start within the view of the camera or can move into frame later on. We should be able to track multiple objects simultaneously. However, there is a limitation to this feature that I will discuss in a later section. Lastly, the output should only include objects that are moving with sufficient speed. This speed threshold should be a user-specified parameter. It is important to note here that the “speed” we are calculating has the unit pixel distance over time and is not the object’s speed in reality. Again, this is related to how we’ve simplified the problem to ignore depth in favor of simplicity.

Design Specifications

Fast: We want to be able to run the algorithm in real-time. This means no heavy computations or chunky time/space complexities.
Versatile: Works on any object, background, and lighting condition (although, tuning of parameters may be required for extreme cases). No prior knowledge or big data is needed. The user shouldn’t have to supply the algorithm with a lot of data or parameters for it to work. Tuning of parameters and prior knowledge (of, say, the size of the objects), may be helpful, but shouldn’t be a crux.
Robust: Works on cameras with low resolution or frames per second. The algorithm is resistant to vibrations or slight variations in camera position. We expect a stationary (with respect to the background) camera, but we also anticipate that the mount of the camera is not entirely rigid, nor is the structure it is attached to.
Intuitive: Built with minimal use of black box libraries. We understand that we likely are not inventing something novel. We remind ourselves that this project is for our learning purposes and should reflect how we think about the problem, not how others have done so. Furthermore, we want to avoid using these black box libraries because it is likely that any flaws we observe in the final result will be difficult to confidently trace back to its cause.
Transparent: Every step of the process should be visualizable. This ties back to being able to source the cause of problems we encounter. I will include examples of these in later sections, but technicians should have the option to view motion masks, moving object IDs, and the tracking of speedy objects as tools to diagnose problems users encounter.

Algorithm Overview

The algorithm can be broken down into four main steps, the first of which is motion masking. This is where differences between frames are used to construct motion masks to tell us areas where motion is occurring. The second step is connected component labeling where we take our motion masks and extract a list of moving objects. Each moving object will be described by the position of the leftmost pixel, the position of the topmost pixels, the width, and the height. The next step is object identification and tracking, where each object is assigned an identification number or score for tracking purposes. If an object in one frame has a similar enough ID as an object in another, then we can say they are the same object. Lastly, once we have information about objects and their position over time, then we can calculate their speeds and filter out the objects that are moving at a speed below the threshold.

Motion Masking

The idea behind motion masking is simple: compare two frames to see if the difference of a pixel’s value (i.e. hue, saturation, or value) is greater than a minimum threshold. If it is then, that pixel is included in the motion mask. However, the question of which two frames we wish to compare arises. I will discuss the benefits and disadvantages of using a static reference versus a dynamic one.

One of the choices that we can make is to compare the difference between the 0th frame with the Nth frame. This works really well if the 0th frame is just the background and all the moving objects come into the frame at a later time. However, there’s a problem that arises for objects that are initially part of the 0th frame but then begin to move: both the area in which the object in motion has entered and left have incurred significant change. The resulting motion mask is one that includes both the position of the object at frame N and the initial position of the object. In other terms, the absence of the object will also be tracked. The initial position of the object stays the same no matter which frame is held in comparison with the 0th frame, so the area of absence remains in the same place throughout the duration of time. The code below demonstrates how the static motion mask is made.

def detect_motion(self, prev_frame, curr_frame, threshold):
        return np.abs(curr_frame-prev_frame)/255.0 > threshold

static_motion_mask = self.detect_motion(self.color_video[0][:,:,channel], self.color_video[frame_idx+1][:,:,channel], threshold)

Motion Masking Using a Dynamic Reference

The other choice that we can make is to compare the difference between consecutive frames, that is the (N-1)th frame and the Nth frame. We encounter a similar problem with this method: the absence of the object is also included in the resulting motion mask. This time, both the position of the object at frame N-1, instead of frame 0, and the position of the object at frame N is captured. The area of absence follows the object in motion, creating a trailing effect (like a lagging mouse cursor). The code below demonstrates how the dynamic motion mask is made.

dyn_motion_mask = self.detect_motion(self.color_video[frame_idx][:,:,channel], self.color_video[frame_idx+1][:,:,channel], threshold)

Neither of the masks generated by using a static or dynamic reference only capture where the object is at the current frame. Both the static and dynamic motion masks capture the object in motion. Each captures a different area of absence. If we take the union of the two masks, we can extract the intersections of the mask and get rid of the areas of absence.

self.motion_mask = np.logical_and(static_motion_mask, dyn_motion_mask)

Union of Static and Dynamic Motion Masks

The three figures above depict an idealized version of the results we would get if we threshold individual pixels. In reality, we don’t get nice, complete boxes even in the simplest cases. To imagine why this might be the case, say a falling box has two identical stripes running horizontally across. In the current frame, we record the color of the pixels at the bottom stripe. In the next frame, the box has moved down such the top stripe is now in the position of the bottom stripe in the previous frame. The color of the pixels is recorded and the difference in colors is negligible since the stripes are the same color. The pixels composing the rest of the object have appeared in the mask since the rest of the box has a slight color gradient, but the pixels of one of the stripes have not. Now, apply this idea to individual pixels instead of stripes, and you got the reason why the motion masks are pixelated and full of noise. The solution to this problem is to perform a series of morphological operators. The order of these operations was determined through testing. We found the best results were obtained by first opening with a small kernel and then closing with a large kernel. This removes noise in the background and fills the objects in motion without changing their size and shape by much.

Connected Component Labelling

Once we have obtained a nice-looking motion mask, we needed a way to extract objects from the mask (a 2-dimensional array of 1s and 0s). We determined the locations of the isolated regions by performing connected component labelling, opting to use OpenCV’s implementation, which has been optimized for speed, rather than our own. The result was a list of moving objects with the following attributes: position of the leftmost pixel, position of the topmost pixels, width, and height.

self.motion_objs = cv.connectedComponentsWithStats(self.motion_mask, 4, cv.CV_32S)[2][1:]

The next step is optional, but one that is provided to the users. We filter out objects with areas less than a user-specified threshold. This helps to eliminate some background noise, but the main purpose of the step is so that if the user has prior knowledge about the objects that will be in motion, then they can specify the size of objects they want to include and/or exclude. Again, no prior knowledge of the objects should be needed as stated in the design specifications, so the algorithm runs fine without specifying a minimum area.

self.motion_objs = [obj for obj in self.motion_objs if obj[2]*obj[3] > min_area]

I will note that the labels that are assigned to each object aren’t useful. This is because there isn’t a guarantee that the same object will have the same label between two different frames. The labeling is used to distinguish the regions, not to draw any connections between them.

Object Identification and Tracking

In order to keep track of a moving object’s trajectory, we needed a way to tell what objects were the same in two different frames. Our approach was to calculate a score for each object. If the score was similar enough between two objects, then we could say they’re the same. The way we calculated scores was admittedly crude and meant as a placeholder for a more refined and precise method of determining object similarity. We processed the pixels of the object in the HSV color space, averaging the value of each channel. We multiplied the averages by weights, which we manually determined through extensive testing, and summed the result to get a score for the object that lay between 0 and 1.

One of the issues we encountered came from our treatment of each object as all the pixels that lay inside their bounding box. The background pixels would be processed as part of the object, skewing the object’s score. We devised a simple but incomplete solution to this problem: we weighted the pixels at the center of the object higher than those at the edges. The idea was that most of the objects were likely located in the center of the bounding box.

Of course, we cannot expect an object’s score between different frames to be the same. This is partially due to our flawed scoring system, and also due to the fact that the object could be rotating while it is in translational motion. Hence, slight variations in scores of the same object between different frames are expected. We needed to a way to delineate how similar two objects’ scores must be to be considered the same object. Our approach was to quantize the scores by rounding them to the nearest bin.

However, issues can arise if we choose too small or large of a quantization. This will be discussed in more detail in a later section. Once we could determine what objects were the same between different frames, we organized all the objects’ data into a dictionary, with the score being the key and the value being a list of the objects’ time and position. The choice of this data structure made it easy to determine whether an object has moved from a previous position in the camera view (if the object’s score is a member of the dictionary) or if the object had just begun moving and had yet to be tracked (if the object’s score wasn’t a member of the dictionary). There’s another complication that I will discuss later, but in short, I also made the assumption that objects cannot have the same score within the same frame, or time step. Put together, the code implementation of the scoring and tracking of objects looks like the following:

def similiarity_score(self, frame_idx, tracked_obj, wH, wS, wV):
    return wH*self.calculate_average(frame_idx, tracked_obj, 0)/179.0+wS*self.calculate_average(frame_idx, tracked_obj, 1)/255.0+wV*self.calculate_average(frame_idx, tracked_obj, 2)/255.0

def update_obj_hist(self, frame_idx, time, wH, wS, wV, similarity):
    unique = {}
    for obj in self.motion_objs:
        raw_score = self.similiarity_score(frame_idx, obj, wH, wS, wV)
        score = round(self.round_nearest(raw_score, similarity), 2)
        if score not in unique:
            if score in self.obj_hist:
                new_obj = (time, self.obj_hist[score][-1][1], obj, score)
                self.obj_hist[score].append(new_obj)
            else:
                self.obj_hist[score] = [(time, self.id_counter, obj, score)]
                self.id_counter += 1
            unique[score] = True
        self.motion_objs_stamped.append((time, 0, obj, score))

Speed Filter

The last step of the algorithm is quite simple. Every time an object has a score that is a member of the dictionary, we can calculate its velocity by dividing the pixel distance by the change in time. If the object is traveling with a speed greater than the user-specified threshold, then we append the object’s data to a speedy object list. To visualize our results, we drew green bounding boxes around these speedy objects.

Explore Full Code

Analysis

In our design specification, we stated our design for the implementation to be fast, so we tested it in real-time. We mocked up a second set of code with minor adaptations so that the feed from a camera instead of images from an existing video was fed into the algorithm. Testing the live version of our design, we observed some lag when moving objects were being tracked. However, we thought it was good enough for our first iteration, given how no time was spent making the code as efficient as possible. The shadow problem was less apparent when testing the live version because we didn’t use a white backdrop.

Explore Live Demo Code

Also in our design specifications, we stated that our design should work for any camera and be resistant to slight vibrations/changes in camera position. While we didn’t perform any tests to measure metrics that could tell us if we had achieved this, we can reason why our implementation does in fact achieve this specification. When we process the videos and camera feeds, we are lowering the resolution of the image being processed in software. We are also lowering the frames per second. Hence, it doesn’t matter how good of a camera, to a reasonable extent, is being used; the algorithm will choose a resolution that we found results in the least amount of noise being captured and the least amount of pixels to process before small objects are neglected. We also have confidence that our algorithm is resistant to vibrations or slight variations in camera position because minor shifts in camera position will cause edges to be captured, but since we are opening and closing the motion masks, these edges should be removed.

Lastly, we specified that our design should be transparent. That is, every step should be visualizable. We added the option, that users can simply toggle on or off, to view the motion masks or the objects in motion, including ones moving below the speed threshold, along with their identification scores. See the image below for an example.

Identification Score (left) and Motion Mask (right)

There could be two reasons why smaller objects are being missed, the first of which is the eroding of sparse areas in the motion masks. Sometimes cleaning out noise causes the erasure of small objects. Another reason could be related to the inclusion of shadow in the object’s bounding box. The shadow of the smaller object is approximately the same size, if not larger, as the object itself. Hence, when the score is calculated for smaller objects, the sampling of pixels is evenly distributed between the object and background pixels, resulting in highly variable scores for the same object in different frames. Although these issues are most apparent with smaller objects, they also could be the cause of why large objects are being missed in single frames. Viewing the scores of the objects, I would say that the majority of our problems come from our flawed scoring of objects to determine similarity.

Another issue also arises from our crude method of scoring. If two objects have the same score in the same frame, our algorithm chooses to ignore one of them arbitrarily. The reason why we do this is that, without a smarter way of thinking about the similarity of objects between frames, the occurrence of so implies that the same object is in two places at one time. We never tested having two of the same objects move in the camera’s view at the same time, so this shouldn’t have been a problem. However, this problem arose because our scoring was flawed (background pixels contribute to the score) and, on occasion, two different objects would have the same score in the same frame, causing one of them to be ignored. This is why one of the stuffed animals is being missed for a single frame as it falls simultaneously with another stuffed animal. This also highlights one of the limitations of our algorithm: we aren’t able to track similar objects (like two red balls) moving at the same time.

Improvements

There are three major areas of improvement that would fix the issues we encountered in our demonstrations. First, there should be an effort made to eliminate the capture of the objects’ shadows. There can be some thresholding we do to create a mask for areas without shadows. Then, we can overlay that mask onto our motion mask (take the union between them).

Second, there should be an additional step we take when calculating an object’s score. Right now, we are iterating through all the pixels in the bounding box, which includes some of the background. We proposed using an edge detector on the area within the bounding box to extract the object’s contour. Then, we can calculate the object’s score only using the pixels inside the contour.

Last, we suggest incorporating machine learning into our object-scoring step. Right now, we are manually adjusting the weights of the object’s attributes (average of each channel in the HSV color space). We can use machine learning to determine these weights for us. The training data would be composed of two sets of object pairs: objects that are the same but may be rotated and objects that are different. We can also include more object attributes such as the straightness of the object’s contours or how large the object is into the scoring. We can even apply a Scale Invariant Feature Transform (SIFT) to determine an object’s key points and use that to track similarities between objects in different frames.