In this post, I will walk you through one of the most common supervised machine learning algorithms i.e., Support Vector Machine.

Support vector machine or SVM is a supervised machine learning algorithm, which is used for both regression and classification problems in machine learning. The method which is used for classification is called “Support Vector Classifier” and the method which is used for regression is called “Support Vector Regressor”. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called support vectors, and hence algorithm is termed Support Vector Machine.

##### Support Vectors

The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, they are called a Support vector.

##### Hyperplane

There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.

The distance between the observations and the threshold is called a soft margin. We use cross-validation to determine how many misclassifications and observations to allow inside of the soft margin to get the best classification.

##### SVM can be of two types:

- Linear SVM – Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line. Such data is termed as linearly separable data, and the classifier is used called Linear SVM classifier.
- Non-linear SVM – Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line. Such data is termed non-linear data, and the classifier used is called a Non-linear SVM classifier.

##### SVM Kernels and Kernel functions

In SVM, kernels are used to convert low dimensional data into the required form of data (high dimensional). This technique is known as kernel trick which transfers low dimensional space data into data with higher dimensional space. If the data is not linearly separable, you can use the kernel trick to make it work. Kernel converts non-separable problems into separable problems by adding more dimensions to them.

The function of the kernel is to take data as input and transform it into the required form. The various kernel functions used for the different SVM models are linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. The most used type of kernel function is RBF.

##### IMPLEMENTATION OF SUPPORT VECTOR MACHINE USING PYTHON

We are using Letter-Recognition dataset in this method. You can download the dataset here.

Starting with the process, import the necessary libraries required for the model.

```
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
```

`letters = pd.read_csv("letter-recognition.csv")`

##### Data Pre-processing

`print("Dimensions : ", letters.shape)`

*Dimensions : (20000, 17)*

**head()**method displays the first 5 rows of the dataset.

`letters.head()`

**info()**function is used to print a concise summary of Data. This method prints information about Data including the index dtype and column dtypes, non-null values, and memory usage of the data.

`letters.info()`

```
#prints the column names
letters.columns
```

The column names have space, e.g. ‘xbox ‘, which throws an error when indexed. So let’s reindex the column names.

```
letters.columns = ['letter', 'xbox', 'ybox', 'width', 'height', 'onpix',
'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybar',
'xy2bar', 'xedge','xedgey', 'yedge', 'yedgex']
print(letters.columns)
```

```
#Prints the unique values in the letter column
order = list(np.sort(letters['letter'].unique()))
print(order)
```

*[‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’, ‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’]*

Let us plot a graph, to understand how do various attributes vary with the letters

```
plt.figure(figsize=(16, 8))
sns.barplot(x='letter', y='xbox',
data=letters,
order=order)
```

```
#Finding mean values of the columns for the letters
letter_means = letters.groupby('letter').mean()
letter_means.head()
```

```
#Plotting the mean values in the form of heatmap.
plt.figure(figsize=(18, 10))
sns.heatmap(letter_means)
```

##### Data Preparation

Let’s conduct some data preparation steps before preparing the model. First, let’s see if it is important to rescale the features, since they may have varying ranges.

For example, here are the average values

```
# Average feature values
round(letters.drop('letter', axis=1).mean(), 2)
```

```
# Splitting into X and y
X = letters.drop("letter", axis = 1)
y = letters['letter']
```

**Feature Scaling or Standardization**

It is a step of Data Preprocessing that is applied to independent variables or features of data. It basically helps to normalize the data within a particular range.

```
# Scaling the features
X_scaled = scale(X)
# Train Test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 101)
```

##### Model Building

Let’s first build two basic models – Linear and Non-Linear with default hyperparameters, and then compare the accuracies.

```
# Linear model
model_linear = SVC(kernel='linear')
# Fit the model
model_linear.fit(X_train, y_train)
```

we have used kernel=’linear’, as here we are creating SVM for linearly separable data.

```
# Predicting the result
y_pred = model_linear.predict(X_test)
```

```
# Accuracy
print("Accuracy : ", metrics.accuracy_score(y_true=y_test, y_pred=y_pred), "\n")
# Confusion matrix
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))
```

*Accuracy : 0.8523333333333334*

The Linear model gives approximately 85% accuracy. Let’s look at a sufficiently non-linear model with randomly chosen hyperparameters.

```
# Non-Linear model
# using rbf kernel, C=1, default value of gamma
non_linear_model = SVC(kernel='rbf')
# fit the model
non_linear_model.fit(X_train, y_train)
```

```
# Predicting the result
y_pred = non_linear_model.predict(X_test)
```

```
# Accuracy
print("Accuracy : ", metrics.accuracy_score(y_true=y_test, y_pred=y_pred), "\n")
# Confusion Matrix
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))
```

*Accuracy : 0.9383333333333334*

The non-linear model gives approximately 93% accuracy. Thus, let’s choose hyperparameters corresponding to non-linear models.

**Grid Search: Hyperparameter Tuning**

Machine learning models have hyperparameters that you must set in order to customize the model to your dataset. Grid Search defines a search space as a grid of hyperparameter values and evaluates every position in the grid.

**KFold** will provide train/test indices to split data into train and test sets. It will split the dataset into k consecutive folds (without shuffling by default). Each fold is then used as a validation set once while the k – 1 remaining folds form the training set.

```
# Creating a KFold object with 5 splits
folds = KFold(n_splits = 5, shuffle = True, random_state = 101)
```

Cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias, and to give an insight on how the model will generalize to an independent dataset.

Let’s now tune the model to find the optimal values of C and gamma corresponding to an RBF kernel. We’ll use 5-fold cross-validation.

```
# Specify range of hyperparameters
# Set the parameters by cross-validation
hyper_params = [ {'gamma': [1e-2, 1e-3, 1e-4], 'C': [1, 10, 100, 1000]}]
```

```
model = SVC(kernel="rbf")
# Set up GridSearchCV()
model_cv = GridSearchCV(estimator = model,
param_grid = hyper_params, scoring= 'accuracy',
cv = folds, verbose = 1, return_train_score=True)
# Fit the model
model_cv.fit(X_train, y_train)
```

```
# Cross validation results
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results
```

```
# Converting C to numeric type for plotting on x-axis
cv_results['param_C'] = cv_results['param_C'].astype('int')
# Plotting
plt.figure(figsize=(16,6))
```

```
# Subplot 1/3
plt.subplot(131)
gamma_01 = cv_results[cv_results['param_gamma']==0.01]
plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"])
plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.01")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
```

```
# subplot 2/3
plt.subplot(132)
gamma_001 = cv_results[cv_results['param_gamma']==0.001]
plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"])
plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
```

```
# subplot 3/3
plt.subplot(133)
gamma_0001 = cv_results[cv_results['param_gamma']==0.0001]
plt.plot(gamma_0001["param_C"], gamma_0001["mean_test_score"])
plt.plot(gamma_0001["param_C"], gamma_0001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.0001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
```

The plots above show some useful insights,

- Non-linear models (high gamma) perform much better than Linear ones for this dataset.
- At any value of gamma, a high value of C leads to better performance.
- None of the models tend to overfit (even the complex ones), since the training and test accuracies closely follow each other

This suggests that the problem and the data are inherently non-linear in nature, and a complex model will outperform simple, Linear models in this case.

Let’s now choose the best hyperparameters.

```
# Printing the optimal accuracy score and hyperparameters
best_score = model_cv.best_score_
best_hyperparams = model_cv.best_params_
print("The best test score is {0} corresponding to hyperparameters {1}"
.format(best_score, best_hyperparams))
```

*The best test score is 0.9517142857142857 corresponding to hyperparameters {‘C’: 1000, ‘gamma’: 0.01}*

##### Building and Evaluating the Final Model

```
# Model with optimal hyperparameters
model = SVC(C=1000, gamma=0.01, kernel="rbf")
# Fit the model
model.fit(X_train, y_train)
# Predict the result
y_pred = model.predict(X_test)
# Metrics
print("Accuracy : ", metrics.accuracy_score(y_test, y_pred), "\n")
print(metrics.confusion_matrix(y_test, y_pred), "\n")
```

*Accuracy : 0.9596666666666667*

**Conclusion**

The accuracy achieved using a Non-linear kernel of 95% is much higher than that of a linear one of 85%. We can conclude that the problem is highly Non-linear in nature.

##### Advantages of SVM

- SVM can be used for linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin whereas non-linearly separable data poses a soft margin.
- SVM is relatively memory efficient.
- SVM works relatively well when there is a clear margin of separation between classes.
- SVM is more effective in high-dimensional spaces.
- SVM is effective in cases where the number of dimensions is greater than the number of samples.

##### Disadvantages of SVM

- The choice of the kernel is perhaps the biggest limitation of the support vector machine. Considering so many kernels present, it becomes difficult to choose the right one for the data.
- SVM algorithm is not suitable for large data sets.
- SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.
- In cases where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.