Introduction to Data Science - Unit : 2 - Topic 3 : MODELLING PROCESS FOR FEATURE ENGINEERING
MODELLING
PROCESS FOR FEATURE ENGINEERING
The modelling phase
consists of four steps:
1.
Feature
engineering and model selection
2.
Training
the model
3.
Model
validation and selection
4.
Applying
the trained model to unseen data
Before you find a good
model, you’ll probably iterate among the first three steps. The last step isn’t
always present because sometimes the goal isn’t prediction but explanation
(root cause analysis). For instance, you might want to find out the causes of
species’ extinctions but not necessarily predict which one is next in line to
leave our planet.
It’s possible to chain
or combine multiple techniques. When you chain multiple models, the
output of the first model becomes an input for the second model. When you
combine multiple models, you train them independently and combine their results.
This last technique is also known as ensemble learning.
MODEL
SELECTION
With engineering features, you must come up with and create possible
predictors for the model. This is one of the most important steps in the
process because a model recombines these features to achieve its predictions.
Often you may need to consult an expert or the appropriate literature to come
up with meaningful features.
Certain features are the variables you get from a data set, as is the
case with the provided data sets in our exercises and in most school exercises.
In practice you’ll need to find the features yourself, which may be scattered
among different data sets. In several projects we had to bring together more
than 20 different data sources before we had the raw data we required. Often
you’ll need to apply a transformation to an input before it becomes a good
predictor or to combine multiple inputs. An example of combining multiple
inputs would be interaction variables: the impact of either single
variable is low, but if both are present their impact becomes immense. This is
especially true in chemical and medical environments.
When the initial features are created, a model can be trained to
the data.
Training your model
With the right predictors in place and a modeling technique in mind, you
can progress to model training. In this phase you present to your model data
from which it can learn.
The most common modeling techniques have industry-ready implementations
in almost every programming language, including Python. These enable you to
train your models by executing a few lines of code. For more state-of-the art
data science techniques, you’ll probably end up doing heavy mathematical
calculations and implementing them with modern computer science techniques.
Once a model is trained, it’s time to test whether it can be
extrapolated to reality: model validation.
VALIDATION
AND PREDICTION
Data science has many modeling techniques, and the question is which one
is the right one to use. A good model has two properties: it has good
predictive power and it generalizes well to data it hasn’t seen. To achieve
this you define an error measure (how wrong the model is) and a validation
strategy. Two common error measures in machine learning are the classification
error rate for classification problems and the mean squared error for
regression problems. The classification error rate is the percentage of
observations in the test data set that your model mislabeled; lower is better.
Many validation strategies exist, including the following common
ones:
Ø Dividing your data into a training set with
X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model
creation)—This is the most common technique.
Ø K-folds cross validation—This strategy divides the data set into k
parts and uses each part one time as a test data set while using the others as
a training data set.
This has the advantage that you use all the data available in the data
set.
Ø Leave-1 out—This approach is the same as k-folds but with k=1. You
always leave one observation out and train on the rest of the data. This is
used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big
data analysts.
Once you’ve constructed
a good model, you can (optionally) use it to predict the future.
Predicting
new observations
If you’ve implemented the first three steps successfully, you now have a
performant model that generalizes to unseen data. The process of applying your
model to new data is called model scoring. In fact, model scoring is something
you implicitly did during validation, only now you don’t know the correct
outcome. By now you should trust your model enough to use it for real.
Model scoring involves two steps. First, you prepare a data set that has
features exactly as defined by your model. This boils down to repeating the
data preparation you did in step one of the modeling process but for a new data
set. Then you apply the model on this new data set, and this results in a
prediction.
Comments
Post a Comment