Machine learning has been the biggest buzzword this past year. In addition to large influx of roles that require the skill it has some of the best use case applications in a wide range of industries ranging from health-care to competitive gaming.
The only problem is that most aspiring adopters of the craft learn to rely on scikit-learn as a plug and play option to train and evaluate their models without understanding the theoretical and mathematical concepts behind the subject.
This guides aims to eliminate this by giving you a solid foundation on Machine Learning.
What exactly is machine learning?
Simply put, Machine learning is a set of methods that can detect patterns in data and use those patterns to make future predictions.
There are three broad types of methods in Machine Learning:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
In supervised learning we want to predict the outcome Y, using a set of training inputs X. The outcome Y is usually labelled beforehand. In mathematical terms we can represent this as:
Where D is the model, Xi is the set of input values and Yi is the labelled outputs.
Practically speaking Xi = set of features of the dataset (Length, Width, Height) and Yi is the output prediction (Male or Female).
In the example above we have considered Yi to be a categorical variable with two outcomes (‘Male’ or ‘Female’) – Such a machine learning problem is called as a classification problem. Yi can also be a number – In this case the machine learning problem is called a regression problem.
In unsupervised learning we have a set of input data and we try and find patterns within this set of input data.
The difference between Supervised and unsupervised learning is the fact that the former has labels or outcomes predefined for us while the latter does not.
In mathematical terms we can represent unsupervised learning as follows:
Where D is the model, Xi is the set of input values from which we want to extract a pattern.
A typical example of unsupervised learning is clustering in which we create groups from a set of input data. This is illustrated for us in the figure below in which we use an unsupervised learning algorithm to extract 4 groups from our input data:
In reinforcement learning, we are given a set of actions and based on those actions the machine learning model needs to determine what are the correct actions it needs to execute. Mathematically reinforcement learning is represented as follows:
Where A is the set of actions given to us and Ai is the actions the model learns to execute.
Parametric Vs. Non-Parametric:
Parametric and Non-Parametric is a fundamental type of distinction between different machine learning techniques.
If the machine learning technique makes assumptions of the structure of the data (e.g.: There are 4 clusters) then it is parametric, otherwise non.
Parametric models are computationally simpler but they might lead to inaccuracy due to the assumptions. Non-parametric models are more flexible.
When we build models that fit the data perfectly it results in overfitting. It results in data that have very low errors but performs terribly on new data. A typical illustration of overfitting is illustrated below:
The model built using the green line is an overfit model while the model built using the red line will generalize better on new data that it sees.
Performance measurement is done to determine if our model has learnt well or not.
In supervised learning we want to know if we have predicted the outcomes correctly given a set of input vales. We can compute the performance using a parameter called the Misclassification Rate or the converse of this which is the proportion of correctly classified examples.
Another metric we can use to compute performance is the Learning Curve. The learning curve is defined as the percentage of correct predictions on the test set as a function of the training set size.
A typical example of a learning curve is illustrated below:
We can see from the learning curve above that as the size of the training set increases the % of correct predictions on the test set nears 1 or 100%.
While creating your models you may encounter different types of learning curves as illustrated by the figure below:
The curve you want to see is the ‘realizable’ curve. The ‘redundant’ curve is caused due to a large number of irrelevant features in the training set. The ‘non realizable’ curve is due to missing attributes.
Typically we use 20% of the data we have as the test set and 80% of the data as the training set. The disadvantage with this way of doing machine learning is that we are not making use of the entire data at our disposal to train it on.
One solution to the problem above is called K-fold cross validation. This is illustrated for you in the figure below:
In each of the five runs above we are dividing the dataset into 5 parts. The dark cell is used as the test set while the white cells are used as the training sets in each run.
In this way we are making use of the entire data for training and test runs. The above procedure is called 5-fold Cross-validaton. In cross-validation we are computing the scores for each of the 5 runs and we take the average of all the scores. This is a better estimate of the error rate than a single score.
Some of the most common performance metrics are:
- Confusion Matrix
- F1 score
- ROC curve.
Accuracy is defined as the proportion of correct responses:
A higher accuracy score is always desired from our models.
2. Confusion Matrix:
A confusion matrix is used to evaluate classification problems by using the actual labels and the predicted labels according to the table illustrated for us below:
To obtain the positive predictive power of our predictions we use the precision metric. In simple terms it is the fraction of correct yes predictions. It is given by the formula illustrated below:
The recall is also known as the True Positive rate or sensitivity and is given by the formula as illustrated below:
5. F1 score
The F1 score balances precision and recall equally and is given by the formula shown below:
6. ROC curve:
The ROC curve is a plot between the True Positive rate and the False Positive rate and is illustrated below:
The area under the curve gives the probability that a particular example will be classified correctly while the dotted line is the performance of a random classifier on average.
This guide covers the fundamentals of Machine Learning for anyone interested in taking their first steps into the subject.
Happy Machine Learning!