Skip to content## Understanding Support Vector Machine via Examples

#### Preparing Data for SVM Models

#### SVM for Classification Problems

##### Setup for SVM Classification

#### Parameter Tuning for SVM Classification

##### Multi-class Classification

#### SVM for Regression Problems

##### Setup for SVM Regression

#### Parameter Tuning for SVM Regression

#### Conclusions

## Similar Articles

— MachineLearning, Python — 6 min read

In the previous post on Support Vector Machines (SVM), we looked at the mathematical details of the algorithm. In this post, I will be discussing the practical implementations of SVM for classification as well as regression. I will be using the iris dataset as an example for the classification problem, and a randomly generated data as an example for the regression problem.

In Python, scikit-learn is a widely used library for implementing machine learning algorithms, SVM is also available in scikit-learn library and follow the usual structure (Import library, object creation, fitting model and prediction). The sklearn.svm module provides mainly two classes: sklearn.svm.svc for classification and sklearn.svm.svr for regression.

As pointed out by **Admiral deblue** in the comments below, all practical implementations of SVMs
have strict requirements for training and testing (prediction). The first requirement is that all
data should be numerical. Therefore, if you have categorical features, they need to be converted to
numerical values using variable transformation techniques like
one-hot-encoding,
label-encoding
etc. SVM model implementations in python also do not support missing values, hence you need to
either remove data with missing values or use some form of data imputing. The
sklearn.preprocessing.Imputer
module can be quite helpful for this exercise. Furthermore, since SVMs assume that the data it
works with is in a standard range, usually either 0 to 1, or -1 to 1 etc. (so that all feature
variables are treated equally), it would be best served to use the feature "normalization" before
training the model. The
sklearn.preprocessing.StandardScaler
module can use used for such normalization.

In general, sklearn models require training data (X) to be numpy nd-array and dependent variable (y) as numpy 1-d array. With newer versions of Pandas, Pandas data-frame and series can also be used for providing X and y to sklearn models.

sklearn.pipeline provides an impressive set of tools to deal with various aspects of data preparation for training different models in a coherent manner. This will be a topic of discussion for a post in near future.

The iris dataset is a simple dataset of contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two; the latter are NOT linearly separable from each other. Each instance has 4 features:

- sepal length
- sepal width
- petal length
- petal width

A typical problem to solve is to predict the *class* of the iris plant based on these 4 features.
For brevity and visualization, in this example we will be using only the first two features.

Below is the simplest implementation of a SVM for this problem. In this example, we see the simplest implementation of SVM classifier with the linear and the radial basis function (rbf) kernels.

`1import pandas as pd2import numpy as np3from sklearn import svm, datasets4import matplotlib.pyplot as plt5%matplotlib inline67iris = datasets.load_iris()8X = iris.data[:, :2] # we only take the first two features.9y = iris.target1011# Plot resulting Support Vector boundaries with original data12# Create fake input data for prediction that we will use for plotting13# create a mesh to plot in14x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 115y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 116h = (x_max / x_min)/10017xx, yy = np.meshgrid(np.arange(x_min, x_max, h),18np.arange(y_min, y_max, h))19X_plot = np.c_[xx.ravel(), yy.ravel()]2021# Create the SVC model object22C = 1.0 # SVM regularization parameter23svc = svm.SVC(kernel='linear', C=C, decision_function_shape='ovr').fit(X, y)24Z = svc.predict(X_plot)25Z = Z.reshape(xx.shape)2627plt.figure(figsize=(15, 5))28plt.subplot(121)29plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)30plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)31plt.xlabel('Sepal length')32plt.ylabel('Sepal width')33plt.xlim(xx.min(), xx.max())34plt.title('SVC with linear kernel')3536# Create the SVC model object37C = 1.0 # SVM regularization parameter38svc = svm.SVC(kernel='rbf', C=C, decision_function_shape='ovr').fit(X, y)3940Z = svc.predict(X_plot)41Z = Z.reshape(xx.shape)4243plt.subplot(122)44plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)45plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)46plt.xlabel('Sepal length')47plt.ylabel('Sepal width')48plt.xlim(xx.min(), xx.max())49plt.title('SVC with RBF kernel')5051plt.show()`

Similar to any machine learning algorithm, we need to choose/tune hyper-parameters for these models. The important parameters to tune are: C (the penalty parameter or the error term. Remember from our last post, this acts as a regularization parameter for SVM) and $\gamma$ (Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’ kernels). In above example, we used a default value of $\gamma = \frac{1}{n\_{features}} = 0.5$.

SVM by definition is well suited for binary classification. In order to perform multi-class classification, the problem needs to be transformed into a set of binary classification problems.

There are two approaches to do this:

**One vs. Rest Approach (OvR)**: This strategy involves training a single classifier per class,
with the samples of that class as positive samples and all other samples as negatives. This
strategy requires the base classifiers to produce a real-valued confidence score for its decision,
rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple
classes are predicted for a single sample.

**One vs. One Approach (OvO)**: In the one-vs.-one (OvO) strategy, one trains K(K − 1)/2 binary
classifiers for a K-way multi-class problem; each receives the samples of a pair of classes from
the original training set, and must learn to distinguish these two classes. At prediction time, a
voting scheme is applied: all K(K − 1)/2 classifiers are applied to an unseen sample and the class
that got the highest number of "+1" predictions gets predicted by the combined classifier. Like
OvR, OvO suffers from ambiguities in that some regions of its input space may receive the same
number of votes.

In svm.svc implementation, `decision_function_shape`

parameter provides the option to choose one of
two strategy. Although, by default OvO strategy is chosen for historical reasons, it is always
recommended to switch to the OvR approach.

Let us first understand what effects $C$ and $\gamma$ parameters have on SVM models. As seen below, we find that higher the value of $\gamma$, it will try to exact fit the as per training data set i.e. generalization error and cause over-fitting problem. $C$ controls the trade off between smooth decision boundary and classifying the training points correctly.

We will be using 5-fold cross validation to perform grid search to calculate optimal hyper-parameters. This is easily achieved in scikit-learn using the sklearn.model_selection.GridSearchCV class.

`1from sklearn.model_selection import train_test_split2from sklearn.model_selection import GridSearchCV3from sklearn.metrics import classification_report4from sklearn.utils import shuffle56# shuffle the dataset7X, y = shuffle(X, y, random_state=0)89# Split the dataset in two equal parts10X_train, X_test, y_train, y_test = train_test_split(11 X, y, test_size=0.25, random_state=0)1213# Set the parameters by cross-validation14parameters = [{'kernel': ['rbf'],15 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5],16 'C': [1, 10, 100, 1000]},17 {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]1819print("# Tuning hyper-parameters")20print()2122clf = GridSearchCV(svm.SVC(decision_function_shape='ovr'), parameters, cv=5)23clf.fit(X_train, y_train)2425print("Best parameters set found on development set:")26print()27print(clf.best_params_)28print()29print("Grid scores on training set:")30print()31means = clf.cv_results_['mean_test_score']32stds = clf.cv_results_['std_test_score']33for mean, std, params in zip(means, stds, clf.cv_results_['params']):34 print("%0.3f (+/-%0.03f) for %r"35 % (mean, std * 2, params))36print()`

Output:

`1# Tuning hyper-parameters23Best parameters set found on development set:45{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}67Grid scores on training set:890.634 (+/-0.066) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}100.634 (+/-0.066) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}110.634 (+/-0.066) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}120.768 (+/-0.168) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}130.768 (+/-0.161) for {'C': 1, 'gamma': 0.2, 'kernel': 'rbf'}140.768 (+/-0.173) for {'C': 1, 'gamma': 0.5, 'kernel': 'rbf'}150.634 (+/-0.066) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}160.634 (+/-0.066) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}170.768 (+/-0.168) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}180.750 (+/-0.193) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}190.750 (+/-0.193) for {'C': 10, 'gamma': 0.2, 'kernel': 'rbf'}200.732 (+/-0.183) for {'C': 10, 'gamma': 0.5, 'kernel': 'rbf'}210.634 (+/-0.066) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}220.768 (+/-0.168) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}230.759 (+/-0.178) for {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}240.741 (+/-0.164) for {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}250.723 (+/-0.175) for {'C': 100, 'gamma': 0.2, 'kernel': 'rbf'}260.732 (+/-0.183) for {'C': 100, 'gamma': 0.5, 'kernel': 'rbf'}270.768 (+/-0.168) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}280.759 (+/-0.178) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}290.750 (+/-0.193) for {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}300.732 (+/-0.183) for {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}310.732 (+/-0.183) for {'C': 1000, 'gamma': 0.2, 'kernel': 'rbf'}320.696 (+/-0.164) for {'C': 1000, 'gamma': 0.5, 'kernel': 'rbf'}330.768 (+/-0.173) for {'C': 1, 'kernel': 'linear'}340.759 (+/-0.178) for {'C': 10, 'kernel': 'linear'}350.759 (+/-0.178) for {'C': 100, 'kernel': 'linear'}360.759 (+/-0.178) for {'C': 1000, 'kernel': 'linear'}`

We have done a few things in above code. Let us break down these in steps.

First, if you pay attention to the input dataset, it lists three different class of iris plants in order. In order for models to be forgetful about such an order, its safer to first shuffle the dataset. This is achieved using the shuffle() method. We also want to take aside a fraction of dataset for final testing of our algorithms success. This is done using the train_test_split() method. In this particular case, we have kept aside about 1/4 th of the dataset for testing.

Moving to the main part of the code: tuning of hyper-parameters for SVM. It is done using the
`GridSearchCV()`

class (The highlighted lines in the above code blocks). At the end, we are also
printing out the accuracy score for different set of parameters. We can find the best set of
parameters by the `clf.best_params_`

property.

Classification Scoring

By default scikit-learn uses accuracy as score for classification tasks. GridSearchCV() provides option to use alternative scoring metrics via the`scoring`

parameter. Some common alternatives are, precision, recall, auc with different averaging strategies like micro, macro, weighted etc.

Finally, we can test our model on the test dataset and evaluate various classification metrics
using the `classification_report()`

method.

`1print("Detailed classification report:")2print()3print("The model is trained on the full development set.")4print("The scores are computed on the full evaluation set.")5print()6y_true, y_pred = y_test, clf.predict(X_test)7print(classification_report(y_true, y_pred))8print()`

Output:

`1Detailed classification report:23The model is trained on the full development set.4The scores are computed on the full evaluation set.56 precision recall f1-score support78 0 1.00 1.00 1.00 129 1 0.73 0.92 0.81 1210 2 0.91 0.71 0.80 141112avg / total 0.88 0.87 0.87 38`

Apart from accuracy, three major metrics to understand the task for classification are: precision, recall and f1-score.

**Precision**: The precision is the ratio `tp / (tp + fp)`

where `tp`

is the number of true
positives and `fp`

the number of false positives. The precision is intuitively the ability of the
classifier not to label as positive a sample that is negative.

**Recall**: The recall is the ratio `tp / (tp + fn)`

where `tp`

is the number of true positives and
`fn`

the number of false negatives. The recall is intuitively the ability of the classifier to find
all the positive samples.

**F _{1}-Score**: It can be interpreted as a weighted harmonic mean of the precision and
recall, where an f1-score reaches its best value at 1 and worst score at 0.

**Support**: Although not a scoring metric, it is an important quantity when looking at different
metrics. It is the number of occurrences of each class in `y_true`

.

Let us first generate a random dataset where we want to generate a regression model. In order to
have a good visualization of our results, it would be best to use a single feature as an example.
In order to study effect of non-linear models, we will be generating our data from the `sin()`

function.

`1X = np.sort(5 * np.random.rand(200, 1), axis=0)2y = np.sin(X).ravel()3y[::5] += 3 * (0.5 - np.random.rand(40))`

Below is the simplest implementation of a SVM for this regression problem. In sci-kit learn SVM regression models are implemented using the svm.SVR class.

In this example, we see the simplest implementation of SVM regressors with the linear, polynomial of degree 3 and the radial basis function (rbf) kernels.

`1from sklearn.svm import SVR2svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)3svr_lin = SVR(kernel='linear', C=1e3)4svr_poly = SVR(kernel='poly', C=1e3, degree=3)5y_rbf = svr_rbf.fit(X, y).predict(X)6y_lin = svr_lin.fit(X, y).predict(X)7y_poly = svr_poly.fit(X, y).predict(X)89lw = 210plt.figure(figsize=(12, 7))11plt.scatter(X, y, color='darkorange', label='data')12plt.plot(X, y_rbf, color='navy', lw=lw, label='RBF model')13plt.plot(X, y_lin, color='c', lw=lw, label='Linear model')14plt.plot(X, y_poly, color='cornflowerblue', lw=lw, label='Polynomial model')15plt.xlabel('data')16plt.ylabel('target')17plt.title('Support Vector Regression')18plt.legend()19plt.show()`

Output:

The common hyper-parameters in the case of SVM regressors are: $C$ (the error term), $\epsilon$
(specifies the epsilon-tube within which no penalty is associated in the training loss function
with points predicted within a distance epsilon from the actual value) and $\gamma$ (Kernel
coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’ kernels). Given our example is extremely simplified, we
won't be able to observe any significant impact of any of these parameters. In general, similar to
classification case, `GridSearchCV`

can be used to tune SVM regression models as well.

So that brings us to an end to the different aspects of Support Vector Machine algorithms. In the first post on the topic, I described the theory and the mathematical formulation of the algorithm. In this post, I discussed the implementation details in Python and ways to tune various hyper-parameters in both classification and regression cases. From practical experience, SVMs are great for:

- Small to medium data sets only. Training becomes extremely slow in the case of larger datasets.
- Data sets with low noise. When the data set has more noise i.e. target classes are overlapping, SVM perform very poorly.
- When feature dimensions are very large. SVMs are extremely helpful specially when no. of features is larger than no. of samples.
- Since only a subset of training points are used in the decision function (called support vectors), it is quite memory efficient. This also leads to extremely fast prediction.

An important point to note is that the SVM doesn’t directly provide probability estimates, these are calculated using an expensive cross-validation in scikit-learn implementation.

Finally, as would be the case with any machine learning algorithm, I would suggest you to use SVM and analyze the power of SVMs by tuning various hyper-parameters. I want to hear your experience with SVM, how have you tuned SVM models to avoid over-fitting and reduce the training time? Please share your views and experiences in the comments below.