Benchmarks

We consider the problem of \(\ell_2\)-logistic regression for binary classification, or multinomial logistic regression if multiple classes are present.

Datasets

We will present the results obtained by the solvers of Cyanure on 8 datasets, presented in the Table below. The 5 first datasets can be found on the LIBSVM dataset web-page. The ckn_mnist dataset is generated by encoding the MNIST dataset with a two-layer convolutional kernel network (CKN), NIPS’16. The last two datasets have been created by extracting the frozen representations given by models pre-trained with DINO. All datasets samples are normalized with \(\ell_2\)-norm and centered for dense datasets.

Note

Many of the datasets we used are available here. Thanks to Chih-Jen for allowing Julien to post them here in .npz format.

Dataset

Sparse

Num classes

n

p

Size (in Gb)

covtype

No

1

581012

54

0.25

alpha

No

1

500000

500

2

real-sim

No

1

72309

20958

0.044

epsilon

No

1

250000

2000

4

ocr

No

1

2500000

1155

23.1

ckn_mnist

No

10

60000

2304

0.55

imnet

No

1000

1281167

2048

9.8

places

No

205

2448873

2048

19

Setup

To select a reasonable regularization parameter \(\lambda\) for each dataset, we first split each dataset into 80% training and 20% validation, and select the optimal parameter from a logarithmic grid \(2^{-i}/n\) with \(i=1,\ldots,16\) when evaluating trained model on the validation set. Then, we keep the optimal parameter \(\lambda\), merge training and validation sets and report the objective function values in terms of execution time for various solvers. The experiments are run on one Intel(R) Xeon(R) Gold 6430 having access to 126 Go of RAM and a NVIDIA 6000 ADA GPU (48 Go of memory). Experiments were conducted on Linux using the Mamba Python 3.11 distribution.

In the evaluation, we include solvers that can be called from CuML, such as [LBFGS] or [OWL-QN]. We run each solver with different tolerance parameter tol=1e-5, 1e-6, 1e-7, 1e-9, 1e-11, 1e-13, 1e-15 in order to obtain several points illustrating their accuracy-speed trade-off. Each method is run for at most 500 epochs.

Results

The results are presented below. There are 8 datasets, and we are going to group them into categories leading to similar conclusions. We start with those requiring a small regularization parameter (e.g., \(\lambda=1/(100n)\)), which lead to more difficult optimization problems since there is less strong convexity.

Note

I am well aware of the limitations of this study (single runs, lack of error bars) Yet, the conclusions seem robust enough given the number of methods and datasets we used in this study. In this case, the curves are the results of 20 runs for Cyanure GPU and 3 runs for CuML and Cyanure CPU (Except for imnet and places dataset).

optimal \(\lambda\): covtype, epsilon, ckn_mnist – the hard ones

For these datasets, regularization is important, but not crucial to achieve a good predictive accuracy and thus the optimal \(\lambda\) is small. For instance, for ckn_mnist, the accuracy on test data is typically above 99%, and the dimension p for covtype is so small that regularization is useless. This leads to an interesting setting with clear conclusions.

Conclusions
  • qning and catalyst accelerations are very useful. Note that catalyst works well in practice.

  • The GPU version is at least 10 times faster than the CPU one. When we compare the two versions the GPU version is much faster, and we will see in the following part that bigger the dataset is bigger is the speed gap. (It is possible to almost reach a 100 times speedup)

  • We can observe a significative difference for the solution depending on features datatype.

  • Cyanure does better than sklearn-saga, lbfgs, and liblinear, sometimes with several orders of magnitudes. Note that sklearn-saga does as bad as our regular srvg solver for these dataset, which confirms that the key to obtain faster results is acceleration.

_images/covtype_logistic_l2.png _images/epsilon_logistic_l2.png _images/ckn_mnist_logistic_l2.png

optimal \(\lambda\): alpha, rcv1, real-sim, ocr, kddb – the easy ones

For these datasets, the optimal regularization parameter is close to \(\frac{1}{n}\), which is a regime where acceleration does not bring benefits in theory. The results below are consistent with theory and we can draw the following conclusions:

  • accelerations is useless here, as predicted by theory, which is why the ‘auto’ solver only uses acceleration when needed.

  • qning-miso and catalyst-miso are still among the best solvers here, but the difference with svrg is smaller. sklearn-saga is sometimes competitive, sometimes not.

_images/alpha_logistic_l2.png _images/ocr_logistic_l2.png _images/imnet_logistic_l2.png _images/places_logistic_l2.png