Introduction to Data Science - Unit : 2 - Topic 3 : MODELLING PROCESS FOR FEATURE ENGINEERING

 

MODELLING PROCESS FOR FEATURE ENGINEERING

The modelling phase consists of four steps:

1.     Feature engineering and model selection

2.     Training the model

3.     Model validation and selection

4.     Applying the trained model to unseen data

Before you find a good model, you’ll probably iterate among the first three steps. The last step isn’t always present because sometimes the goal isn’t prediction but explanation (root cause analysis). For instance, you might want to find out the causes of species’ extinctions but not necessarily predict which one is next in line to leave our planet.

It’s possible to chain or combine multiple techniques. When you chain multiple models, the output of the first model becomes an input for the second model. When you combine multiple models, you train them independently and combine their results. This last technique is also known as ensemble learning.

 

MODEL SELECTION

With engineering features, you must come up with and create possible predictors for the model. This is one of the most important steps in the process because a model recombines these features to achieve its predictions. Often you may need to consult an expert or the appropriate literature to come up with meaningful features.

Certain features are the variables you get from a data set, as is the case with the provided data sets in our exercises and in most school exercises. In practice you’ll need to find the features yourself, which may be scattered among different data sets. In several projects we had to bring together more than 20 different data sources before we had the raw data we required. Often you’ll need to apply a transformation to an input before it becomes a good predictor or to combine multiple inputs. An example of combining multiple inputs would be interaction variables: the impact of either single variable is low, but if both are present their impact becomes immense. This is especially true in chemical and medical environments.

When the initial features are created, a model can be trained to the data.

Training your model

With the right predictors in place and a modeling technique in mind, you can progress to model training. In this phase you present to your model data from which it can learn.

The most common modeling techniques have industry-ready implementations in almost every programming language, including Python. These enable you to train your models by executing a few lines of code. For more state-of-the art data science techniques, you’ll probably end up doing heavy mathematical calculations and implementing them with modern computer science techniques.

Once a model is trained, it’s time to test whether it can be extrapolated to reality: model validation.

VALIDATION AND PREDICTION

Data science has many modeling techniques, and the question is which one is the right one to use. A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t seen. To achieve this you define an error measure (how wrong the model is) and a validation strategy. Two common error measures in machine learning are the classification error rate for classification problems and the mean squared error for regression problems. The classification error rate is the percentage of observations in the test data set that your model mislabeled; lower is better.

Many validation strategies exist, including the following common ones:

Ø  Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)—This is the most common technique.

Ø  K-folds cross validation—This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set.

This has the advantage that you use all the data available in the data set.

Ø  Leave-1 out—This approach is the same as k-folds but with k=1. You always leave one observation out and train on the rest of the data. This is used only on small data sets, so it’s more valuable to people  evaluating laboratory experiments than to big data analysts.

Once you’ve constructed a good model, you can (optionally) use it to predict the future.

Predicting new observations

If you’ve implemented the first three steps successfully, you now have a performant model that generalizes to unseen data. The process of applying your model to new data is called model scoring. In fact, model scoring is something you implicitly did during validation, only now you don’t know the correct outcome. By now you should trust your model enough to use it for real.

Model scoring involves two steps. First, you prepare a data set that has features exactly as defined by your model. This boils down to repeating the data preparation you did in step one of the modeling process but for a new data set. Then you apply the model on this new data set, and this results in a prediction.

Comments

Popular posts from this blog

Career Guide - B.Tech Students

Artificial Intelligence - UNIT - 1 Topic - 1 : Introduction to AI (Artificial Intelligence)

Financial Aid for Students: Scholarships from Government, NGOs & Companies