Introduction to Data Science - Unit : 2 - Topic 4 : CASE STUDY: DISCERNING DIGITS FROM IMAGES
CASE STUDY: DISCERNING DIGITS FROM IMAGES
One of the many common approaches on the web to stopping computers from
hacking into user accounts is the Captcha check—a picture of text and numbers
that the human user must decipher and enter into a form field before sending
the form back to the web server.
A simple Captcha control can be used to prevent automated spam being sent through an online web form.
With the help of the NaΓ―ve
Bayes classifier, a simple yet powerful algorithm to categorize
observations into classes that’s explained in more detail in the sidebar, you
can recognize digits from textual images. These images aren’t unlike the
Captcha checks many websites have in place to make sure you’re not a computer
trying to hack into the user accounts. Let’s see how hard it is to let a
computer recognize images of numbers.
Our research goal is
to let a computer recognize images of numbers (step one of the data science
process).
The data we’ll be
working on is the MNIST data set, which is often used in the data science
literature for teaching and benchmarking.
With the bit of theory in the sidebar, you’re ready to perform the
modeling itself. Make sure to run all the upcoming code in the same scope
because each piece requires the one before it. An IPython file can be
downloaded for this chapter from the Manning download page of this book.
The MNIST images can be found in the data sets package of Scikit-learn
and are already normalized for you (all scaled to the same size: 64x64 pixels),
so we won’t need much data preparation (step three of the data science
process). But let’s first fetch our data as step two of the data science
process, with the following listing.
Step 2 of
the data science process: fetching the digital image data
The following code
demonstrates this process and is step four of the data science process: data
exploration.
Step 4 of the data science process: using Scikit-learn
Figure 3.4 shows how a blurry “0” image translates into a data matrix.
Figure 3.4 shows the
actual code output, but perhaps figure 3.5 can clarify this slightly, because
it shows how each element in the vector is a piece of the image.
Easy so far, isn’t
it? There is, naturally, a little more work to do. The NaΓ―ve Bayes classifier
is expecting a list of values, but pl.matshow() returns a two-dimensional array
(a matrix) reflecting the shape of the image. To flatten it into a list, we
need to call reshape() on digits.images. The net result will be a
one-dimensional array that looks something like this:
array([[ 0., 0., 5.,
13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,
0., 3., 15., 2., 0.,
11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,
0., 5., 8., 0., 0.,
9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,
0., 2., 14., 5., 10.,
12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
The previous code
snippet shows the matrix of figure 3.5 flattened (the number of dimensions was
reduced from two to one) to a Python list. From this point on, it’s a standard
classification problem, which brings us to step five of the data science
process: model building.
Now that we have a
way to pass the contents of an image into the classifier, we need to pass it a
training data set so it can start learning how to predict the numbers in the
images. We mentioned earlier that Scikit-learn contains a subset of the MNIST
database (1,800 images), so we’ll use that. Each image is also labeled with the
number it actually shows. This will build a probabilistic model in memory of
the most likely digit shown in an image given its grayscale values.
Once the program has
gone through the training set and built the model, we can then pass it the test
set of data to see how well it has learned to interpret the images
using the model. The
following listing shows how to implement these steps in code.
Unsupervised
learning
It’s generally true
that most large data sets don’t have labels on their data, so unless you sort
through it all and give it labels, the supervised learning approach to data
won’t work.
Many techniques exist
for each of these unsupervised learning approaches. However, in the real
world you’re always working toward the research goal defined in the first phase
of the data science process, so you may need to combine or try different
techniques before either a data set can be labelled, enabling supervised
learning techniques, perhaps, or even the goal itself is achieved.
Comments
Post a Comment