What Is Predictive Modeling?
Before we answer ‘what is predictive modeling’, let’s understand the basic uses of data. There are three most common uses of data: Analytics, monitoring, and prediction. Data Analytics is the use of past data to infer which event occurred, when it did, and why it did. Real-time Data Monitoring provides information and insights for live event tracking and detection, so proactive action can be taken. Predictive Data Modelling (also referred to as predictive analytics or predictive analysis) analyses past data to predict an outcome (a future outcome or an unknown past outcome). In this article, we will explore everything related to Predictive Modeling and clarify what predictive modelling is.
What Is Predictive Modeling?
Predictive modelling uses data and statistics to predict an outcome. This outcome can be in the future (which is most often the case), for example, the outcome of a sports game based on the past statistics of the teams involved, or it can be the unknown outcome of an event that has already occurred, for example in identifying a crime suspect based on existing data.
Predictive analytics has captured the support of a wide range of organizations, with a global market projected to reach approximately $10.95 billion by 2022, growing at a compound annual growth rate (CAGR) of around 21 per cent between 2016 and 2022, according to a 2017 report issued by Zion Market Research.
Here’s a video that explains the Difference between forecasting, Predictive modeling, machine learning.
How Does Predictive Modeling Work
A predictive modelling program is achieved via different algorithms (classification algorithms, clustering algorithms, regression algorithms, etc.) which use statistical models to accomplish the task. Before understanding the algorithms, let’s take a look at what statistical models are.
A statistical model uses a set of predefined assumptions to generate sample data from a provided larger data source. The sample data is the prediction that we look to extract. In short, a statistical model allows us to predict the probability of an event. There are two main types of predictive models (and hence two main classes of predictive models themselves) –
Parametric Statistical Models
Non-parametric Statistical Models
Parametric Statistical Models are those models that have a finite number of input parameters, and these are well defined. A simple example of this class is the Normal Distributions Curve (also known as the Bell Curve) as seen in the image below.
The normal distribution curve is derived from a finite set of initial parameter. To use this model as a prediction model, you just need to know the defined parameters and prediction can be made by plotting these parameters on the defined graph.
Non-parametric Statistical Models do not solely depend on the input parameters, mainly because the parameters do not have a normal distribution, and are hence not used for assumptions. For example, if you collect data of the time taken for different patients taking a particular drug to cure a cold, the data will not be normally distributed (because the cure time will be different for each patient, and also for the same patient when re-tested).
The type of statistical model forms the basis for the prediction model, and that depends on the type of data. Different predictive modelling algorithms have been developed, and the right algorithm depends on the data and the desired outcome. Let’s take a look at some predictive model types.
Predictive Modeling Techniques
Classification Predictive Modeling
This is the first of five predictive modelling techniques we will explore in this article. The Classification Model analyzes existing historical data to categorize, or ‘classify’ data into different categories. This model is best suited for ‘yes’ or ‘no’ scenarios, such as to check if a potential lead is going to convert or no, or if a loan is going to be approved or no, etc. The following are some important characteristics of a classification model:
A classification model is built from data that can be classified into two (or more) categories.
The data variables can be real values or discrete input variables.
A classification problem with two classes is called a two-class or binary classification problem, and one with more than two classes is called a multi-class classification problem.
A classification problem where the input variable is assigned more than one class is called a multi-label classification problem.
Most often, a classification predictive model predicts the class (‘yes’ an email belongs in spam or ‘no’ it does not) and not a continuous variable (like an integer), but it can be programmed to predict a continuous value as the probability, which can be correlated to an output class. For example, the threat level of an email can be measured between 0.1 to 0.9 (with step measures of 0.1) with 0.1 implying the email is less likely to be spam and 0.9 implying it is most likely spam. The prediction model, in this case, results in a continuous variable (a number between 0.1 and 0.9) which is translated into spam or not spam based on the result’s proximity to the lower and upper threshold. The algorithm can be built in many different programs and is a common example of predictive modeling in R.
The following article will help you understand classification better by taking the case of text classification: .
Random Forest Predictive Modeling
Random Forest Predictive Modeling is one of the most commonly used predictive modeling techniques. It is capable of executing both classification and regression algorithms. The model gets its name because the algorithm is a combination of decision trees. Every individual decision tree is dependent on the values of a random vector that is independently sampled, and each tree is extended as much as possible. The following is an example of a decision tree (not the random forest predictive modeling):