Tutorials

Note

Many of the datasets I used are available here.

Examples for binary classification

The following code performs binary classification with \(\ell_2\)-regularized logistic regression, with no intercept, on the criteo dataset (21Gb, huge sparse matrix):

from cyanure.estimators import Classifier
from cyanure.data_processing import preprocess
import scipy.sparse
import numpy as np

y_path = "dataset/criteo_y.npz"
x_path = "dataset/criteo_X.npz"


#load criteo dataset 21Gb, n=45840617, p=999999
dataY=np.load(y_path, allow_pickle=True)
y=dataY['arr_0']
X = scipy.sparse.load_npz(x_path)

#normalize the rows of X in-place, without performing any copy
preprocess(X,normalize=True,columns=False)
#declare a binary classifier for l2-logistic regression  uses the auto solver by default, performs at most 500 epochs
classifier=Classifier(loss='logistic',penalty='l2',lambda_1=0.1/X.shape[0],max_iter=500,tol=1e-3,duality_gap_interval=5, verbose=True, fit_intercept=False)
classifier.fit(X,y)

Before we comment the previous choices, let us run the above code on one thread on a Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz CPUs with 32Gb of memory.

Info : Matrix X, n=45840617, p=999999
Info : *********************************
Info : Catalyst Accelerator
Info : MISO Solver
Info : Incremental Solver
Info : with uniform sampling
Info : Lipschitz constant: 0.250004
Info : Logistic Loss is used
Info : L2 regularization
Info : Epoch: 5, primal objective: 0.456014, time: 159.994
Info : Best relative duality gap: 14383.9
Info : Epoch: 10, primal objective: 0.450885, time: 370.813
Info : Best relative duality gap: 1004.69
Info : Epoch: 15, primal objective: 0.450728, time: 578.932
Info : Best relative duality gap: 6.50049
Info : Epoch: 20, primal objective: 0.450724, time: 787.282
Info : Best relative duality gap: 0.068658
Info : Epoch: 25, primal objective: 0.450724, time: 997.926
Info : Best relative duality gap: 0.00173208
Info : Epoch: 30, primal objective: 0.450724, time: 1215.44
Info : Best relative duality gap: 0.00173207
Info : Epoch: 35, primal objective: 0.450724, time: 1436.1
Info : Best relative duality gap: 9.36947e-05
Info : Time elapsed : 1448.06

The solver used was catalyst-miso; the problem was solved up to accuracy tol=0.001 in about 20mn after 35 epochs (without taking into account the time to load the dataset from the hard drive). The regularization parameter was chosen to be \(\lambda=\frac{1}{10n}\), which is close to the optimal one given by cross-validation. Even though performing a grid search with cross-validation would be more costly, it nevertheless shows that processing such a large dataset does not necessarily require to massively invest in Amazon EC2 credits, GPUs, or distributed computing architectures.

In the next example, we use the squared hinge loss with \(\ell_1\)-regularization, choosing the regularization parameter such that the obtained solution has about 10% non-zero coefficients. It runs on 16 threads. We also fit an intercept.:

from cyanure.estimators import Classifier
from cyanure.data_processing import preprocess
import numpy as np
import scipy.sparse

#load rcv1 dataset about 1Gb, n=781265, p=47152
data = np.load('dataset/rcv1.npz',allow_pickle=True)
y=data['y']
y = np.squeeze(y)
X=data['X']

X = scipy.sparse.csc_matrix(X.all()).T # n x p matrix, csr format

#normalize the rows of X in-place, without performing any copy
preprocess(X,normalize=True,columns=False)
#declare a binary classifier for squared hinge loss + l1 regularization
classifier=Classifier(loss='sqhinge',penalty='l1',lambda_1=0.000005,max_iter=500,tol=1e-3, duality_gap_interval=10, verbose=True, fit_intercept=True)
# uses the auto solver by default, performs at most 500 epochs
classifier.fit(X,y)

which yields:

Info : Matrix X, n=781265, p=47152
Info : Memory parameter: 20
Info : *********************************
Info : QNing Accelerator
Info : MISO Solver
Info : Incremental Solver
Info : with uniform sampling
Info : Lipschitz constant: 1
Info : Squared Hinge Loss is used
Info : L1 regularization
Info : Epoch: 10, primal objective: 0.0916455, time: 9.38925
Info : Best relative duality gap: 0.486061
Info : Epoch: 20, primal objective: 0.0916331, time: 18.2816
Info : Best relative duality gap: 0.0197286
Info : Epoch: 30, primal objective: 0.0916331, time: 30.6386
Info : Best relative duality gap: 0.000296367
Info : Time elapsed : 30.806
Info : Total additional line search steps: 4
Info : Total skipping l-bfgs steps: 0

Multiclass classification

Let us now do something a bit more involved and perform multinomial logistic regression on the ckn_mnist dataset (10 classes, n=60000, p=2304, dense matrix), with multi-task group lasso regularization, using 2 Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz CPUs with 32Gb of memory., and choosing a regularization parameter that yields a solution with 5% non zero coefficients.:

from cyanure.estimators import Classifier
from cyanure.data_processing import preprocess
import numpy as np


#load ckn_mnist dataset 10 classes, n=60000, p=2304
data=np.load('dataset/ckn_mnist.npz')
y=data['y']
y = np.squeeze(y)
X=data['X']

#center and normalize the rows of X in-place, without performing any copy
preprocess(X,centering=True,normalize=True,columns=False)
#declare a multinomial logistic classifier with group Lasso regularization
classifier=Classifier(loss='multiclass-logistic',penalty='l1l2',lambda_1=0.0001,max_iter=500,tol=1e-3,duality_gap_interval=5, verbose=True, fit_intercept=False)
# uses the auto solver by default, performs at most 500 epochs
classifier.fit(X,y)

which produces:

Info : Matrix X, n=60000, p=2304
Info : Memory parameter: 20
Info : *********************************
Info : QNing Accelerator
Info : MISO Solver
Info : Incremental Solver
Info : with uniform sampling
Info : Lipschitz constant: 0.25
Info : Multiclass logistic Loss is used
Info : Mixed L1-L2 norm regularization
Info : Epoch: 5, primal objective: 0.340267, time: 23.5437
Info : Best relative duality gap: 0.332296
Info : Epoch: 10, primal objective: 0.337646, time: 47.2198
Info : Best relative duality gap: 0.069921
Info : Epoch: 15, primal objective: 0.337337, time: 70.9591
Info : Best relative duality gap: 0.0177314
Info : Epoch: 20, primal objective: 0.337294, time: 94.5435
Info : Best relative duality gap: 0.0106599
Info : Epoch: 25, primal objective: 0.337285, time: 127.509
Info : Best relative duality gap: 0.00454883
Info : Epoch: 30, primal objective: 0.337284, time: 160.711
Info : Best relative duality gap: 0.00094165
Info : Time elapsed : 161.034
Info : Total additional line search steps: 4
Info : Total skipping l-bfgs steps: 0

Learning the multiclass classifier took about 5mn and 26s. To conclude, we provide a last more classical example of learning l2-logistic regression classifiers on the same dataset, in a one-vs-all fashion.:

from cyanure.estimators import Classifier
from cyanure.data_processing import preprocess
import numpy as np


#load ckn_mnist dataset 10 classes, n=60000, p=2304
data=np.load('dataset/ckn_mnist.npz')
y=data['y']
y = np.squeeze(y)
X=data['X']

#center and normalize the rows of X in-place, without performing any copy
preprocess(X,centering=True,normalize=True,columns=False)
#declare a multinomial logistic classifier with group Lasso regularization
classifier=Classifier(loss='logistic',penalty='l2',lambda_1=0.01/X.shape[0],max_iter=500,tol=1e-3,duality_gap_interval=10, multi_class="ovr",verbose=True, fit_intercept=False)
# uses the auto solver by default, performs at most 500 epochs
classifier.fit(X,y)

Then, the 10 classifiers are learned in parallel using the 2 CPUs, which gives the following output after about 18 sec:

Info : Matrix X, n=60000, p=2304
Info : Solver 7 has terminated after 30 epochs in 20.0901 seconds
Info :    Primal objective: 0.0105676, relative duality gap: 0.000956126
Info : Solver 9 has terminated after 30 epochs in 20.5337 seconds
Info :    Primal objective: 0.0162128, relative duality gap: 0.000267688
Info : Solver 2 has terminated after 40 epochs in 25.8979 seconds
Info :    Primal objective: 0.010768, relative duality gap: 3.20012e-05
Info : Solver 1 has terminated after 40 epochs in 26.1818 seconds
Info :    Primal objective: 0.00555594, relative duality gap: 0.000841066
Info : Solver 5 has terminated after 40 epochs in 26.4256 seconds
Info :    Primal objective: 0.00918652, relative duality gap: 5.50489e-05
Info : Solver 4 has terminated after 50 epochs in 28.2959 seconds
Info :    Primal objective: 0.00892122, relative duality gap: 4.20708e-05
Info : Solver 0 has terminated after 50 epochs in 28.4744 seconds
Info :    Primal objective: 0.00581546, relative duality gap: 4.98054e-05
Info : Solver 3 has terminated after 50 epochs in 28.6934 seconds
Info :    Primal objective: 0.00806731, relative duality gap: 4.7563e-05
Info : Solver 8 has terminated after 50 epochs in 28.8942 seconds
Info :    Primal objective: 0.0154151, relative duality gap: 1.63124e-05
Info : Solver 6 has terminated after 50 epochs in 29.0729 seconds
Info :    Primal objective: 0.00696687, relative duality gap: 3.22834e-05
Info : Time for the one-vs-all strategy
Info : Time elapsed : 29.3725