Introduction to Data Science - Unit : 2 - Topic 4 : CASE STUDY: DISCERNING DIGITS FROM IMAGES

 CASE STUDY: DISCERNING DIGITS FROM IMAGES

One of the many common approaches on the web to stopping computers from hacking into user accounts is the Captcha check—a picture of text and numbers that the human user must decipher and enter into a form field before sending the form back to the web server.


A simple Captcha control can be used to prevent automated spam being sent through an online web form.

With the help of the NaΓ―ve Bayes classifier, a simple yet powerful algorithm to categorize observations into classes that’s explained in more detail in the sidebar, you can recognize digits from textual images. These images aren’t unlike the Captcha checks many websites have in place to make sure you’re not a computer trying to hack into the user accounts. Let’s see how hard it is to let a computer recognize images of numbers.

Our research goal is to let a computer recognize images of numbers (step one of the data science process).

The data we’ll be working on is the MNIST data set, which is often used in the data science literature for teaching and benchmarking.

With the bit of theory in the sidebar, you’re ready to perform the modeling itself. Make sure to run all the upcoming code in the same scope because each piece requires the one before it. An IPython file can be downloaded for this chapter from the Manning download page of this book.

The MNIST images can be found in the data sets package of Scikit-learn and are already normalized for you (all scaled to the same size: 64x64 pixels), so we won’t need much data preparation (step three of the data science process). But let’s first fetch our data as step two of the data science process, with the following listing.

Step 2 of the data science process: fetching the digital image data

The following code demonstrates this process and is step four of the data science process: data exploration.


Step 4 of the data science process: using Scikit-learn

 



Figure 3.4 shows how a blurry “0” image translates into a data matrix.

Figure 3.4 shows the actual code output, but perhaps figure 3.5 can clarify this slightly, because it shows how each element in the vector is a piece of the image.

Easy so far, isn’t it? There is, naturally, a little more work to do. The NaΓ―ve Bayes classifier is expecting a list of values, but pl.matshow() returns a two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a list, we need to call reshape() on digits.images. The net result will be a one-dimensional array that looks something like this:

array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,

0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,

0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,

0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])



The previous code snippet shows the matrix of figure 3.5 flattened (the number of dimensions was reduced from two to one) to a Python list. From this point on, it’s a standard classification problem, which brings us to step five of the data science process: model building.

Now that we have a way to pass the contents of an image into the classifier, we need to pass it a training data set so it can start learning how to predict the numbers in the images. We mentioned earlier that Scikit-learn contains a subset of the MNIST database (1,800 images), so we’ll use that. Each image is also labeled with the number it actually shows. This will build a probabilistic model in memory of the most likely digit shown in an image given its grayscale values.

Once the program has gone through the training set and built the model, we can then pass it the test set of data to see how well it has learned to interpret the images

using the model. The following listing shows how to implement these steps in code.

Unsupervised learning

It’s generally true that most large data sets don’t have labels on their data, so unless you sort through it all and give it labels, the supervised learning approach to data won’t work.

Many techniques exist for each of these unsupervised learning approaches. However, in the real world you’re always working toward the research goal defined in the first phase of the data science process, so you may need to combine or try different techniques before either a data set can be labelled, enabling supervised learning techniques, perhaps, or even the goal itself is achieved.

Comments

Popular posts from this blog

How to Get a Job in Top IT MNCs (TCS, Infosys, Wipro, Google, etc.) – Step-by-Step Guide for B.Tech Final Year Students

Common HR Interview Questions

How to Get an Internship in a MNC