Large Scale Multiple Kernel Learning

This page contains information regarding our JMLR paper "Large Scale Multiple Kernel Learning" by Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer & Bernhard Schölkopf. The paper can be found here.

Abstract

While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://raetschlab.org/suppl/shogun.

Multiple Kernel Learning Examples

These are matlab examples for classification and regression. They make use of our machine learning toolbox SHOGUN , which is a requirement.

MKL for classifying christmas stars

% This script should enable you to rerun the experiment in the
% paper that we labeled with "christmas star".
%
% The task is to classify two star-shaped classes that share the
% midpoint. The difficulty of the learning problem depends on the
% distance between the classes, which is varied
%
% Our model selection leads to a choice of C = 0.5. The model
% selection is not repeated inside this script.


% Preliminary settings:

C = 0.5;         % SVM Parameter
cache_size = 50; % cache per kernel in MB
svm_eps=1e-3;   % svm epsilon
mkl_eps=1e-3;   % mkl epsilon

no_obs = 2000;   % number of observations / data points (sum for train and test and both classes)
k_star = 20;     % number of "leaves" of the stars
alpha = 0.3;     % noise level of the data

radius_star(:,1) = [4.1:0.2:10]';    % increasing radius of the 1.class
radius_star(:,2) = 4*ones(length(radius_star(:,1)),1);   % fixed radius 2.class
                                     % distanz between the classes: diff(radius_star(:,1)-radius_star(:,2))
rbf_width = [0.01 0.1 1 10 100];     % different width for the five used rbf kernels


%%%%
%%%% Great loop: train MKL for every data set (the different distances between the stars)
%%%%

sg('send_command','loglevel ERROR');
sg('send_command','echo OFF');


for kk = 1:size(radius_star,1)

  % data generation
  fprintf('MKL for radius %+02.2f                                                      \n', radius_star(kk,1))

  dummy(1,:) = rand(1,4*no_obs);
  noise = alpha*randn(1,4*no_obs);

  dummy(2,:) = sin(k_star*pi*dummy(1,:)) + noise;         % sine
  dummy(2,1:2*no_obs) = dummy(2,1:2*no_obs)+ radius_star(kk,1);         % distanz shift: first class
  dummy(2,(2*no_obs+1):end) = dummy(2,(2*no_obs+1):end)+ radius_star(kk,2); % distanz shift: second class

  dummy(1,: ) = 2*pi*dummy(1,:);

  x(1,:) =  dummy(2,:).*sin(dummy(1,:));
  x(2,:) =  dummy(2,:).*cos(dummy(1,:));

  train_y = [-ones(1,no_obs) ones(1,no_obs)];
  test_y = [-ones(1,no_obs) ones(1,no_obs)];

  train_x = x(:,1:2:end);
  test_x  = x(:,2:2:end);

  clear dummy x;

  % train MKL

  sg('send_command','clean_kernels');
  sg('send_command','clean_features TRAIN');
  sg('add_features','TRAIN', train_x);       % set a trainingset for every SVM
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('set_labels','TRAIN', train_y);         % set the labels
  sg('send_command', 'new_svm LIGHT');
  sg('send_command', 'use_linadd 0');
  sg('send_command', 'use_mkl 1');
  sg('send_command', 'use_precompute 0');
  sg('send_command', sprintf('mkl_parameters %f 0', mkl_eps));
  sg('send_command', sprintf('svm_epsilon %f', svm_eps));
  sg('send_command', 'set_kernel COMBINED 0');
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(1) ));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(2) ));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(3) ));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(4) ));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(5) ));
  sg('send_command', sprintf('c %1.2e', C)) ;
  sg('send_command', 'init_kernel TRAIN');
  sg('send_command', 'svm_train');
  [b,alphas]=sg('get_svm') ;
  w(kk,:) = sg('get_subkernel_weights');

  % calculate train error

  sg('send_command','clean_features TEST');
  sg('add_features','TEST',train_x);
  sg('add_features','TEST',train_x);
  sg('add_features','TEST',train_x);
  sg('add_features','TEST',train_x);
  sg('add_features','TEST',train_x);
  sg('set_labels','TEST', train_y);
  sg('send_command', 'init_kernel TEST');
  sg('send_command', 'set_threshold 0');
  result.trainout(kk,:)=sg('svm_classify');
  result.trainerr(kk)  = mean(train_y~=sign(result.trainout(kk,:)));

  % calculate test error

  sg('send_command', 'clean_features TEST');
  sg('add_features','TEST',test_x);
  sg('add_features','TEST',test_x);
  sg('add_features','TEST',test_x);
  sg('add_features','TEST',test_x);
  sg('add_features','TEST',test_x);
  sg('set_labels','TEST',test_y);
  sg('send_command', 'init_kernel TEST');
  sg('send_command', 'set_threshold 0');
  result.testout(kk,:)=sg('svm_classify');
  result.testerr(kk)  = mean(test_y~=sign(result.testout(kk,:)));

end
disp('done. now w contains the kernel weightings and result test/train outputs and errors')

MKL for regression

  • sine wave:
% This script should enable you to rerun the experiment in the
% paper that we labeled "sine".
%
% In this regression task a sine wave is to be learned.
% We vary the frequency of the wave.

% Preliminary settings:

% Parameter for the SVMs.
C          = 10;        % obtained via model selection (not included in the script)
cache_size = 10;
mkl_eps  = 1e-3;  % threshold for precision
svm_eps    = 1e-3;
svr_tube_eps   = 1e-2;
debug = 0;

% Kernel width for the 5 "basic" SVMs
rbf_width(1) = 0.005;
rbf_width(2) = 0.05;
rbf_width(3) = 0.5;
rbf_width(4) = 1;
rbf_width(5) = 10;

% data
f = [0.1:0.2:5];   % values for the different frequencies
no_obs = 1000;     % number of observations

if debug
      sg('send_command', 'loglevel ALL');
      sg('send_command', 'echo ON');
else
      sg('send_command', 'loglevel ERROR');
      sg('send_command', 'echo OFF');
end

for kk = 1:length(f)    % big loop for the different learning problems

  % data generation

  train_x = 1:(((10*2*pi)-1)/(no_obs-1)):10*2*pi;
  train_y = sin(f(kk)*train_x);

  kernels={};

  % initialize MKL-SVR
  sg('send_command', 'new_svm SVRLIGHT');
  sg('send_command', 'use_mkl 1');
  sg('send_command', 'use_precompute 3');
  sg('send_command', sprintf('mkl_parameters %f 0', mkl_eps));
  sg('send_command', sprintf('c %f',C));
  sg('send_command', sprintf('svm_epsilon %f',svm_eps));
  sg('send_command', sprintf('svr_tube_epsilon %f',svr_tube_eps));
  sg('send_command', 'clean_features TRAIN' );
  sg('send_command', 'clean_kernels');
  sg('set_labels', 'TRAIN', train_y);               % set labels
  sg('add_features','TRAIN', train_x);              % add features for every SVR
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('add_features','TRAIN', train_x);
  sg('send_command', 'set_kernel COMBINED 0');
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(1)));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(2)));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(3)));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(4)));
  sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(5)));

  sg('send_command', 'init_kernel TRAIN');
  sg('send_command', 'svm_train');

  weights(kk,:) = sg('get_subkernel_weights') ;
  fprintf('frequency: %02.2f   rbf-kernel-weights:  %02.2f %02.2f %02.2f %02.2f %02.2f           \n', f(kk), weights(kk,:))
end
  • linear and sine mixture:
% This script should enable you to rerun the experiment in the
% paper that we labeled "mixture linear and sine ".
%
% The task is to learn a regression function where the true function
% is given by a mixture of 2 sine waves in addition to a linear trend.
% We vary the frequency of the second higher frequency sine wave.

% Setup: MKL on 10 RBF kernels of different widths on 1000 examples


% Preliminary setting

% kernel width for 10 basic SVMs
rbf_width(1) = 0.001;
rbf_width(2) = 0.005;
rbf_width(3) = 0.01;
rbf_width(4) = 0.05;
rbf_width(5) = 0.1;
rbf_width(6) = 1;
rbf_width(7) = 10;
rbf_width(8) = 50;
rbf_width(9) = 100;
rbf_width(10) = 1000;

% SVM parameter
C          = 1;
cache_size = 50;
mkl_eps    = 1e-4;
svm_eps    = 1e-4;
svm_tube   = 0.01;
debug = 0;

% data
f = [0:20];  % parameter that varies the frequency of the second sine wave
no_obs = 1000;    % number of observations

if debug
      sg('send_command', 'loglevel ALL');
      sg('send_command', 'echo ON');
else
      sg('send_command', 'loglevel ERROR');
      sg('send_command', 'echo OFF');
end

for kk = 1:length(f)   % Big loop

      % data generation

      train_x = 0:((4*pi)/(no_obs-1)):4*pi;
      trend = 2 * train_x* ((pi)/(max(train_x)-min(train_x)));
      wave1 = sin(train_x);
      wave2 = sin(f(kk)*train_x);
      train_y = trend + wave1 + wave2;

      % MKL learning

      kernels={};

      sg('send_command', 'new_svm SVRLIGHT');
      sg('send_command', 'use_mkl 1');
      sg('send_command', 'use_precompute 0');       % precompute every SINGLE kernel!
      sg('send_command', sprintf('mkl_parameters %f 0',mkl_eps));
      sg('send_command', sprintf('c %f',C));
      sg('send_command', sprintf('svm_epsilon %f',svm_eps));
      sg('send_command', sprintf('svr_tube_epsilon %f',svm_tube));
      sg('send_command', 'clean_features TRAIN' );
      sg('send_command', 'clean_kernels' );

      sg('set_labels', 'TRAIN', train_y);               % set labels
      sg('add_features','TRAIN', train_x);              % add features for every basic SVM
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('add_features','TRAIN', train_x);
      sg('send_command', 'set_kernel COMBINED 0');
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(1)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(2)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(3)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(4)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(5)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(6)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(7)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(8)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(9)));
      sg('send_command', sprintf('add_kernel 1 GAUSSIAN REAL %d %f', cache_size, rbf_width(10)));
      sg('send_command', 'init_kernel TRAIN') ;
      sg('send_command', 'svm_train');

      weights(kk,:) = sg('get_subkernel_weights') ;
      fprintf('frequency: %02.2f   rbf-kernel-weights:  %02.2f %02.2f %02.2f %02.2f %02.2f %02.2f %02.2f %02.2f %02.2f %02.2f           \n', f(kk), weights(kk,:))
end

15 Million Splice Dataset

The Splice dataset has the following format:

-1    TTCCAAACCCAAATAGTCAGAGTGCAAACCCTCACAGTAAACACAAGACTCTAAGCTCCCAGTGTGCCTCCAGCCATCTCCCCTGTTCATGTGGAGCTTTTCTCCTTTGCCAGCGGGGATCTGCAGCTATCTGGGAGTGCC
-1    TTGTTTATTGATTCTCTTTATCCTGGTGATATATTTGCAGGTTGCAGATATTTGTGAAGAAGAAGTGATATGGTTGGCTGTGTCCTCACCCAAGTCTCATCTTGAATTACAGCTCCCATAATCTCCATGTGTTGTGGGAAG
-1    AAAACAGGTACTAGAATTATATCTGTCATTGACCTAAAAAGGATAAAGAGAGTTGGCAGAAGATACAACTGCATGTAGGGGAATATGCTTTTCATTAACTCTGTAAAGTCGGGTTTTATCTGTTTGAAGGCTTATATAAGT
-1    CACCAGTGAACGGCCAAGTGACACGAGTGACACCATGAGCTTGGTGCCCTCTCCATCCCAAGCCAGAGGCGGAAGCCAGGCCCTTCCTCCCAGCCCAGACTCCTACATCCCAAACTTGAGCCATGGCACACATGCTGGGCA
-1    TCCACCCGCCCCGGCCTCCCGAAGTGCTGGGATTACCATGCCCAGCCCATCCAAATCTTTAGTGTTTTCCATCCATTTATCCCTTCCTCCATCTTGGAAGGACCCTAGAGCCAGACTTCCTGGGTTTTAAATCCTAATTCC
-1    TTCGTCAAGATGACTAATGATAAACAGCAAGCCAGGTGCTGAGATTTTTGGGGGGAATGAAGGGGGTATGAAAAGAAGAGGAAATACAGCGCAGGTCTGGGGGCCCGTCACAGCCCTTGCACTTGGCCTTGTGCTTCCGCT
-1    GGTTTGTGTGTACTTGCATACCCTGTAGTCTAGTACATTTTATATGGCTATGCTTTATAGAGCTTTAGAAAGTGAGGTCAAGCTAAATTTCTTGACTTTAAGGGTGGCCTGAATAGTTCACCATAATCTCATTATTGAAAC
-1    GTGAGAATCTGTTCTTGGAGGTTTCAGGGAAGTGTTTACAGGGAGATGTTGTTTGAGCTGAGACTTGAAGAGTAGGTGTATACCAGGCTGACAAGGTGACAAAATGGCCTTCTGTGGAGGAGGAAATAATCTGTGCAAAGT
-1    GTCCTCTCAACCAGGAAGGGAGCAGGGAGGGTGGCTGCAGGGCCGCAGGTGGGGAGGTGCAGGTGGGAGAGAGGCCCTCTGGTCTGGTCTGGTCTGGGCTGGGTGGTGCAGGGCAGATGGTCAGGCCCCAGCACATGCCAC
-1    GTAGCTGGGACTACAGATGCGTGCCACCACGCCCAGCTAATTTTTTGTATTTTTTTAAGTAGAGATGGGGTTTCACCGTGTTAGCTAGGATGGTCTCGGTCTCCTGACCTTGTGGTTTGCCCACCTCGACCTCCCAAAGTG
...

Here the first column is the label (+1 or -1) seperated by a tab and followed by a 141 character long string which only consists of the characters A,C,G,T. As the file is uncompressed about 2GB in size find here bzip2 compressed splits 0-10 (each about 50MB):

  1. human_acceptor_splice_data.txt_00.bz2
  2. human_acceptor_splice_data.txt_01.bz2
  3. human_acceptor_splice_data.txt_02.bz2
  4. human_acceptor_splice_data.txt_03.bz2
  5. human_acceptor_splice_data.txt_04.bz2
  6. human_acceptor_splice_data.txt_05.bz2
  7. human_acceptor_splice_data.txt_06.bz2
  8. human_acceptor_splice_data.txt_07.bz2
  9. human_acceptor_splice_data.txt_08.bz2
  10. human_acceptor_splice_data.txt_09.bz2
  11. human_acceptor_splice_data.txt_10.bz2