Ankit Khare

Street Parking using Deep Learning and Computer Vision

2019-01-29T14:20:00+00:00

Pre-trained Mask-RCNN from Matterport can be easily used to detect cars in a parking. In order to utilize it I recorded a video of the parking near my apartment. Even with my hands shaking due to cold, the overall prototype successfully detect an available parking space vacancy.

Pardon me for shaking hands. It was cold outside

Observe the change of color in the other parking spots. It is primarily due to moving camera while recording, the car parked in the area gets out from the marked spot. Using Twilio API, we can easy generate a number and use it to send a custom message to our own cell phone whenever there’s a vacancy available to park. There’s a great medium post here which describes the process flow. The underlying assumption is that, the first frame will determine the parking spots and no car in the first frame should be a moving one.

Assumption: The first frame will determine the parking spots and no car in the first frame should be in motion

This is very inconvenient. We can’t expect to take our cell phone out and get bluffed by a moving car just because it was in the first frame. So, we need to think of something better. What about identifying the static cars by observing them for 5 seconds and assuming that they are parked in the authorized parking area only. This way, no moving cars would hamper our system.

Observe the passing by car at the beginning of the video. Our new method is working great!

The approach is pretty simple. I just took two frames and compared them for a possible motion using frame subtraction. Next I eroded the area occupied by the moving vehicle so that MASK-RCNN would not capture it.

This frame makes the operations performed in the above code very intuitive I guess

    while video_capture.isOpened():
        success, frame = video_capture.read()

        if not success:
            print("couldn't read video")
            break

        elif counter<40:
          #create another video reader object to compare the two frames   and verify the possibility of motion
          success, frame2 = video_capture.read()
          d = cv2.absdiff(frame, frame2)  
          grey = cv2.cvtColor(d, cv2.COLOR_BGR2GRAY)
          blur = cv2.GaussianBlur(grey, (1, 1), 0)
          ret, th = cv2.threshold( blur, 20, 255, cv2.THRESH_BINARY)

          #perform these morphological transformations to erode the car which is moving so that it is not detected by MASKRCNN. Take the erosion levels to be high. 
          dilated = cv2.dilate(th, np.ones((30, 30), np.uint8), iterations=1 )
          eroded = cv2.erode(dilated, np.ones((30, 30), np.uint8), iterations=1 )

          #fill the contours for even a better morphing of the vehicle
          img, c, h = cv2.findContours(eroded, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
          frame2 = cv2.drawContours(frame2, c, -1, (0,0,0), cv2.FILLED)

For full code, check park_clever.ipynb by visiting the Git repo hereThis frame makes the operations performed in the above code very intuitive I guess

Let’s check how well our system performs in night, just for the fun :)

Credits to Mask RCNN, works pretty well even at night with a bad quality input video

What if we use IPhone 7 plus ? let’s see:

Far better! It's funny how the leftmost car gets identified by MASK-RCNN with full confidence as soon as the headlights of the 'Camry' focus on it

Easy, right! Now all I have for you guys is to check my other Yolo repository to see how we can speed up the process using batch-processing. Essentially, I am asking you to read multiple frames, keep them in buffer and then send them for processing to GPU at once to maximize GPU utilization. This way you might be able to take advantage of colab’s 12 GB of free K80. Have fun and do let me know if you come up with any further cool ideas! You can find the code on my Git. The code is runnable on Google Colab.

Intro to CNN

2018-04-11T11:20:00+00:00

Introduction

Convolutional Neural Networks, the three words together sounds like a weird combination of biology and math with a little flavor of computer science sprinkled in, but these networks have been some of the most influential inventions in the field of Computer Vision. The year 2012 was the first year that neural nets grew to prominence. Alex Krizhevsky used them to win that year’s ImageNet competition (the annual Olympics of computer vision in simple terms), dropping the classification error record from 26% to 15%, an astounding improvement at the time. Ever since, a plethora of companies have been using deep learning at the core of their services. The classic, and arguably most popular, use case of these networks is for image processing. Within image processing, let’s take a look at how to use these CNNs for image classification.

The Problem Space

Image classification is the task of taking an input image and outputting a class (cat, dog, etc.) or a probability of classes that best describe the image. For humans, this task of recognition is one of the first skills that we acquire from the moment we are born and is something that comes as adults naturally and without effort. Without even thinking twice, we’re able to identify the environment we’re in, quickly and seamlessly. When we see an image or just look at the world around us, most of the time we can immediately characterize the scene and give each object a label, all without even noticing it consciously. These skills of being able to recognize patterns quickly, generalize from previous knowledge, and adapt to various image environments are those we do not share with our fellow machines.

Inputs and Outputs to the network

When a computer views an image (takes an image as input), an array of pixel values will appear. Depending on the image resolution and size, an array of numbers 32 x 32 x 3 (the 3 refers to RGB values) will appear. Just to drive the point home, let’s say we have a JPG-form color image and its size is 480 x 480. The representational array is set to be 480 x 480 x 3. Each of these numbers is given a value from 0 to 255 that describes the intensity of the pixels at that point. These numbers are the only inputs available to the computer, though they are meaningless to us when we perform image classification. The idea is to give this array of numbers to the computer and output numbers will describe the likelihood of the image being a certain class (.70 for cat,.25 for dog,.05 for bird, etc.).

What We Want our Computer to Do

Now that we know the issue and the inputs and outputs, let’s think about how to approach it. What we want the computer to do is to be able to distinguish between all the images it is given and to figure out the unique characteristics that make a dog, a dog or cat, a cat. That is the process that also goes on subconsciously in our minds. Looking at a dog’s picture we can classify it as such if the picture has identifiable features like paws or 4 legs. Similarly, by searching for low-level features such as edges and curves, the computer is able to perform image classification, and then build up to more abstract concepts through a series of convolutional layers. This was a general overview of what CNN does. Now, let’s get into the specifics.

Biological Connection

But first, let’s take some history lessons. It’s not going to be boring at all, I promise. You may have been thinking about something related to neuroscience or biology when you first heard of the term Convolutional Neural Networks, and guess what? You would be right. CNNs borrow their structural construct from the visual cortex which we would call as an inspiration coming from biology. The visual cortex has tiny regions of cells that are sensitive to specific visual field regions. A fascinating experiment by Hubel and Wiesel in 1962 (Video) expanded this idea, where they showed that some individual neuronal cells in the brain responded (or fired) only in the presence of edges of a certain orientation. Hubel and Wiesel found that all of these neurons were arranged in a columnar architecture and were able to generate visual perception together. This concept of specialized components inside a system with different tasks (neuronal cells in the visual cortex looking for similar characteristics) is often used by machines and is the basis behind CNNs.

Structure

Back to the particulars. A more detailed overview of what CNNs are doing would be taking the image, passing it through a series of convs, nonlinear, pooling (downsampling), and fully connected layers, and getting an output. As we said earlier, the output can be either a single class, or a class probability that best describes the image. The hard part now is the understanding of what each of these layers is doing. So let’s start with the most significant one.

First Layer – Math Part

In a CNN the first layer is always a Convolutional layer. First thing to make sure is that you remember what the layer of input to this conv (I will use that abbreviation a lot) is. As we already mentioned, the input is a pixel value array of 32 x 32 x 3. Now, the best way to explain a conv layer is to imagine a flashlight that shines above the image at the top left. Let’s say the light which this flashlight shines covers an area of 5 x 5. And now, let’s imagine this sliding flashlight across all areas of the input image. In terms of machine learning, this flashlight is called a filter (or sometimes referred to as a neuron or kernel), and the region it shines over, is called the receptive field. Now this filter is an array of numbers as well (these numbers are called weights or parameters). A very important note is that the depth of this filter must be the same as the depth of the input (this ensures the math works out), so this filter’s dimensions are 5 x 5 x 3. Now let’s take, for example, the first position the filter is in. That would be the left corner at the top. Since the filter slides over the input image, the values in the filter are multiplied by the original pixel values of the image (aka computing element wise multiplications). These multiplications are all summed up (in mathematical terms, that would be a total of 75 multiplications). So, you got a single number now. Remember, that number only represents when the filter is at the top left of the image. Now, for every location on the volume of input we repeat this process. (The next step would be to move the filter in 1 unit to the right, then in 1 again to the right and so on). Every single location produces a number on the input volume. After sliding the filter across all the locations, you’ll find out that what you’re left with is, an array of 28 x 28 x 1 that we call an activation map or feature map. The reason you get a 28 x 28 array is that a 5 x 5 filter can fit on a 32 x 32 input image with 784 different locations. Those 784 numbers are mapped to an array of 28 x 28.

(Quick Note: Some of the images, including the one above I used, came from this awesome book, Michael Nielsen’s “Neural Networks and Deep Learning.” Strongly recommended by the way.) Let’s say now that we’re using two 5 x 5 x 3 filters instead of one. Our volume of output would then be 28 x 28 x 2. We are able to preserve the spatial dimensions better by using more filters. This is mathematically what happens in a convolutional layer.

First Layer – High Level Perspective

Let’s talk about what this convolution actually does on a high level. One can think of each of these filters as feature identifiers. I’m talking about things like straight edges, simple colours, and curves when I say features. Think about the simplest features all the images have in common. Let’s say that our first filter is 7 x 7 x 3, and will be a detector of curves. (In this section, let’s overlook the fact that the filter is 3 units deep and only consider the filter’s top depth slice and the image, for simplicity.) The filter will have a pixel structure as a curve detector in which there will be higher numerical values along the area that is a curve shape.

Now let’s get back to mathematically visualizing this. When we have this filter at the top left corner of the input volume, multiplications are computed between the values of the filter and the pixel in that region. Now let’s take an example of an image we’d like to classify, and put our filter in the top left corner.

Remember, what we need to do is multiply the values in the filter by the image’s original pixel values.

Basically, if there is a shape in the input image that generally resembles the curve this filter represents, then all of the multiplications summed up together will result in a large value! Now let’s see what happens with our filter when it moves.

The value is a lot smaller! This is because the image section contained nothing that responded to the curve detector filter. Remember, an activation map is the output of this conv layer. So, in the simple case of a one-filter convolution (and if that filter is a curve detector), the activation map shows the areas where curves in the picture are most likely to occur. In this example the top left of our activation map 26 x 26 x 1 (26 due to the 7x7 filter instead of 5x5) will be 6600. This high value means that there is probably some kind of curve in the volume of the input which caused the filter to activate. In our activation map, the top right value will be 0 because there was nothing in the input volume that caused the filter to activate (or, more simply, there was no curve in that area of the original image). Remember, this is for one filter only. This is just a filter that will detect lines curving outwards and to the right. For lines curving to the left or for straight edges we may have other filters. The more filters, the greater the activation map depth and the more information we have about the volume of inputs.

Disclaimer: The filter I described in this section was simplistic to describe the math that is going on during a convolution. In the image below, you will see some examples of actual visualizations of the filters of a trained network’s first conv layer. The main argument remains the same, nonetheless. The filters on the first layer converge around the input image and “activate” (or calculate high values) when input volume is the specific feature it is looking for.

(Quick Note: The above image was taken from Stanford’s CS 231N course taught by Andrej Karpathy and Justin Johnson. Recommend to anyone seeking a deeper understanding of CNNs.)

Going Deeper Through the Network

Now, there are other layers in a traditional convolutionary neural network architecture that are interspersed between these layers. I would strongly encourage those interested to read about them and understand their function and effects, but in general, they do provide nonlinearities and dimensional preservation that help improve network robustness and control overfitting. So would look like a classic CNN architecture.

However, the last layer is an important one which we will go into later. Just take a step back and review what we have learned up to now. We have talked about what the filters are designed to detect in the first conv layer. They detect features of low levels, such as edges and curves. As one would imagine, we need the network to be able to recognize higher-level features such as hands or paws or ears to predict whether an image is a type of object. So let’s ponder what the network output is after the first conv layer. It would be a volume of 28 x 28 x 3 (assuming we will use three 5 x 5 x 3 filters). The output of the first conver layer becomes the input of the 2nd conv layer when we go through another conv layer. Now, that’s a bit more difficult to visualize. When we spoke of the first layer, the input was simply the original image.When we talk about the 2nd conv layer, though, the input is the activation map(s) that result from the first layer. Thus each input layer basically describes the locations in the original image for which certain features of the low level appear. Now, if you apply a set of filters on top of that (pass it through the 2nd conv layer), the output will be activations representing features of higher levels. Types of these features might be semicircles (combining a curve and a straight edge) or squares (combining several straight edges). As you pass through the network and more conv layers, you get activation maps that represent increasingly complex features. You may have some filters at the end of the network that activate when handwriting occurs in the image, filters that activate when viewing pink objects etc. If you want more information about filter visualization in ConvNets, Matt Zeiler and Rob Fergus had an excellent research paper on the topic. Jason Yosinski also has a YouTube video which gives a great visual representation.Another interesting thing to note is that as you go deeper into the network, the filters start to have a larger and larger receptive field which means they can consider information from a larger area of the original input volume (another way to put it is that they are more responsive to a larger area of pixel space).

Fully Connected Layer

Now that we can detect these high-level features, a fully connected layer is attached to the network end by the icing on the cake. This layer basically takes an input volume (whatever the output is of the preceding conv or ReLU or pool layer) and outputs a N dimensional vector where N is the number of classes from which the program must choose. For example, if you wanted a program for digit classification, N would be 10, because there are 10 digits. Each number in this N dimension vector represents the likelihood of some class. For example, if the resulting vector for a digit classification program is [0 .1.1.75.0 0 0 0 0 0.05], then this represents a 10 percent probability of the image being a 1, a 10 percent probability of the image being a 2, a 75 percent probability of the image being a 3, and a 5 percent probability of the image being a 9 (Side note: there are other ways you can represent the output, but I’m a 3) The way this fully connected layer works is by looking at the output of the previous layer (which, as we remember, should represent the high-level activation maps) and determining which features are most correlated to a particular class.For example, if the program predicts that some image is a dog, the activation maps will have high values that represent high-level features like a paw or 4 legs etc. Similarly, if the program predicts that some image is a bird, the activation maps will contain high values that represent high-level features like wings or a beak, etc. Basically, an FC layer looks at what high-level features most closely correlate with a particular class and has particular weights so you get the right probabilities for the different classes when you calculate the products between the weights and the previous layer.

Training (AKA:What Makes this Stuff Work)

Now, this is the one aspect of the neural networks that I have not yet deliberately mentioned and that is probably the most important part. You may have had lots of questions while reading. How do the filters know to look for edges and curves in the first conv layer? How does the layer that is fully connected know which activation maps to view? How do the filters know what values to have in every layer? The way the computer can adjust its filter values (or weights) is through a process called backpropagation training.

Before we get into backpropagation, first we have to take a step back and talk about what a neural network needs to work with. Our minds were fresh at the moment we were all born. We didn’t know what it was that cat or dog or bird. In a similar way, the weights or filter values are randomized before the CNN starts. The filters are unfamiliar with looking for edges and curves. In the higher layers the filters don’t know how to look for paws and beaks. However, as we grew older our parents and teachers showed us various images and pictures and gave us a corresponding label. This idea of being given an image and a label is the process of training CNNs undergo. Before we get into it too, let’s just say we have a training set with thousands of pictures of dogs, cats and birds and each of the pictures has a label of what that picture is like. Return to Backprop.

Thus backpropagation can be divided into 4 separate sections, forward pass, loss function, backward pass and weight update. You take a training image during the forward pass which, as we recall, is a 32 x 32 x 3 array of numbers and passes it through the entire network. In our first example of training, since all the weights or filter values have been initialized randomly, the output is likely to be something like [.1.1.1.1.1.1.1.1.1], basically an output that does not give preference to any number in particular.With its current weights, the network is unable to search for those low-level features or is therefore unable to draw any reasonable conclusion as to what the classification might be. This relates to the backpropagation part of loss function. Remember that Training data is what we are using right now. That data has an image as well as a label. For example, let’s say the first training image you input was a 3. The image label would be set to [0 0 0 1 0 0 0 0 0]. A loss function can be defined in many different ways, but MSE (mean squared error) is a common one, which is squared 1⁄2 times (actual-predicted).

Let’s say this value is equal to the variable L. As you can imagine, the loss for the first pair of training images will be extremely high. Now, let’s just think intuitively on this. We want to get to a point where the predicted label (ConvNet output) is the same as the training label (this means our network got its prediction right).We want to minimize the amount of loss we have to get there. Visualizing this as a simple problem of optimization in calculus, we want to find out which inputs (weights in our case) contributed most directly to the network’s loss (or error).

This is the mathematical equivalent of a dL / dW in which W is the weights at a given layer. What we want to do now is to carry out a backward pass through the network, which determines which weights have contributed the most to the loss and find ways to adjust it so that the loss decreases. Once we calculate this derivative, we proceed to the final step, which is the weight update. This is where we take all of the filters’ weights and update them so they change the gradient in the opposite direction.

The learning rate is a parameter selected by the programmer. A high learning rate means that the weight updates take bigger steps and therefore, it may take less time for the model to converge on an optimal set of weights. However, an overly high learning rate could result in jumps that are too large and not accurate enough to reach the optimum point.

One training iteration is the process of forward pass, loss function, backward passage, and parameter update. For each set of training images the program will repeat this process for a fixed number of iterations (commonly called a batch).Once you finish updating the parameter on the last example of the training, hopefully the network should be trained well enough so that the layers’ weights are correctly tuned.

Testing

Finally, to see if our CNN works or not, we have a different set of images and labels (can’t double dip between training and testing!) and the images are passed through the CNN. We compare the outputs with the reality on the ground and see if our network is working!

How Companies Use CNNs

Data, figures, data. The firms that have lots of that magic 4 letter word are those that have an inherent advantage over the rest of the competition. The more training data you can give to a network, the more training iterations you can make, the more weight updates you can make, and when it goes to production, the better the network tuned. Facebook (and Instagram) can use all the pictures of the billion users it currently has, Pinterest can use information of the 50 billion pins on its website, Google can use search data and Amazon can use data from the millions of products that are purchased daily. And now you are aware of the magic behind how they use it.

Disclaimer: Although this post should be a good beginning to understand CNNs, it is by no means a comprehensive overview. Things that are not discussed in this post include the nonlinear and pooling layers as well as network hyperparameters such as filter sizes, steps, and padding. Topics such as network architecture, batch standardization, fading gradients, dropout, initialization techniques, non-convex optimization, bases, loss function choices, data increase, regulation methods, computational considerations, backpropagation modifications, and more were also not discussed (still).

Collection of Material to understand BackProp

2017-10-10T15:55:52+00:00

I went through all of the links which I’ve given below and find them to be a comprehensive guide to understanding backpropagation. Please consider going through below mentioned awesome tutorials and articles for getting a solid grip on backprop as it is the backbone of neural networks.

My notes on GitHub!

2017-09-28T12:55:52+00:00

Here are a few interesting videos which would help you get started with GitHub:

https://www.youtube.com/watch?v=SWYqp7iY_Tc - Intro
https://www.youtube.com/watch?v=HVsySz-h9r4 - Intro
https://www.youtube.com/watch?v=xuB1Id2Wxak - Intro + few details
Playlist - Intro + a lot of details

Let’s start with some one-liners:

Git is a version control tool
Github Inc is a organization that provides web-based hosting services for distributed version control
Github Inc. is git based and they have their own features as well
Git is open source
Git is written in C and hence it is fast
Git is lightweight (it doesn’t use a lot of space or processing power of your computer)
Git does lossless compression of files to store them on local repository or central repository
Git is secure and it follows SHAI encryption
Git is follows a non-linear structure called Directed Acyclic Graph (DAG)

Setup:

ssh-keygen
cat {path of ssh key}
Add the key to your github settings. After this, you will be able to push or pull from your central repo.

Common commands and their use:

git clone {url} {folder where to clone}
git init
git status
git remote add origin {git url}
git remote -v

git add -A
git commit -m “msg”
git commit -a -m “msg” adds the files to the staging area and then commits them.
git checkout <last 8 digit of your commit hash id>

git log
git diff
check the difference between working tree (your files in the project folder) and the local repository

git pull origin master
git push -u origin master

Common Branching commands:

git Branch
git branch {brname} The branch will contain everything in the master branch

git checkout {brname} move to the branch specified
git checkout -b[ branch_name]
This command will create a new branch and checkout the new branch at the same time.
git push -u origin {brname}
push branch to central repo. Be on the branch which you are pushing.

Few other useful commands:

git remote get-url origin
git remote set-url origin <git@github.com:ankit1khare/Img-Cap.git>
git reset HEAD – A little tricky. I’ll have to explain this in another post
git rm –cached
Suppose you added a file to staging area. But now you want to remove it since it is not needed anymore but might be needed later on. Use this command to remove a file from index before commit. You have files indexed before commit but you want to remove one of them from index so that you don't commit it accidentally. Changes to the file remain intact. This doesn't apply to "untracked" files.
git merge
Suppose you want to merge your new branch with master. Then you must be on master branch and the branch name in above command must be your new branch. So, be on destination branch where merging is happening.
git branch –merged
Displays all the merges occured so far.
git rebase
You are on master. And you have a branch ahead of master. Now, when you rebase the branch it will copy all the new content from your branch to master and set the head of your branch to the tip of your master linearly. So, master will have everything that was extra in new branch but it would seem like you developed all this linearly.
git branch -d
git push origin –delete
git branch -a
display names of all branches with their locations (remote or local)

Getting started with Jekyll on Windows platform

2017-09-07T16:20:00+00:00

There are many blogs and videos on how to set up Jekyll on Windows, but it can be confusing and complex, especially for those who are unfamiliar with Linux and Jekyll. This is due to version incompatibility and the fact that Jekyll is not officially supported on the Windows platform. A lot of the relevant material is also more than a year old. When using templates and themes, it can be difficult to deal with gems and bundles, especially if you are not familiar with Ruby and Rails. This post aims to help you use this powerful static website development tool and set it up on Windows without any hassle.

To start, I recommend following this simple video tutorial: https://www.youtube.com/watch?v=BTX_uh_v99I

There are also many templates available that you can use to develop a beautiful website with powerful functionality. Check them out here: http://themes.jekyllrc.org/

Here are some tips for saving time during the setup process:

Follow all the steps carefully, and remember to restart PowerShell as an administrator every time it is mentioned.

2.If the bundle is not the latest version and you try to use the already existing templates (may be from http://themes.jekyllrc.org/) you may get a “gem not found” error. In this case, force the bundle to switch to the latest version.

Here’s another useful video if you are having problems setting up templates: https://www.youtube.com/watch?v=bty7LHm14CA The most common problems that I have noticed are related to relative path and slash ‘/’.
It’s worth understanding the basic functionality of Jekyll and how it is used with GitHub, especially if you are a developer or even if you just want to create a basic portfolio: https://www.youtube.com/watch?v=SWVjQsvQocA
If your localhost (which is 4000 by default with Jekyll) shows a blank page, try clearing the cache and restarting your computer before attempting to reinstall Ruby or gems.
If you encounter an error with the ‘jekyll serve’ command due to version mismatch, try using ‘bundle exec jekyll serve’ instead.

I hope this helps!