kautodiff.{h,c}
and kann.{h,c}
.You are encouraged to include these files in your source code tree. Noinstallation is needed. To compile examples:kann_save()
or use it to classify aMNIST image:Task | Framework | Machine | Device | Real | CPU | Command line |
---|---|---|---|---|---|---|
mnist-mlp | KANN+SSE | Linux | 1 CPU | 31.3s | 31.2s | mlp -m20 -v0 |
Mac | 1 CPU | 27.1s | 27.1s | |||
KANN+BLAS | Linux | 1 CPU | 18.8s | 18.8s | ||
Theano+Keras | Linux | 1 CPU | 33.7s | 33.2s | keras/mlp.py -m20 -v0 | |
4 CPUs | 32.0s | 121.3s | ||||
Mac | 1 CPU | 37.2s | 35.2s | |||
2 CPUs | 32.9s | 62.0s | ||||
TensorFlow | Mac | 1 CPU | 33.4s | 33.4s | tensorflow/mlp.py -m20 | |
2 CPUs | 29.2s | 50.6s | tensorflow/mlp.py -m20 -t2 | |||
Tiny-dnn | Linux | 1 CPU | 2m19s | 2m18s | tiny-dnn/mlp -m20 | |
Tiny-dnn+AVX | Linux | 1 CPU | 1m34s | 1m33s | ||
Mac | 1 CPU | 2m17s | 2m16s | |||
mnist-cnn | KANN+SSE | Linux | 1 CPU | 57m57s | 57m53s | mnist-cnn -v0 -m15 |
4 CPUs | 19m09s | 68m17s | mnist-cnn -v0 -t4 -m15 | |||
Theano+Keras | Linux | 1 CPU | 37m12s | 37m09s | keras/mlp.py -Cm15 -v0 | |
4 CPUs | 24m24s | 97m22s | ||||
1 GPU | 2m57s | keras/mlp.py -Cm15 -v0 | ||||
Tiny-dnn+AVX | Linux | 1 CPU | 300m40s | 300m23s | tiny-dnn/mlp -Cm15 | |
mul100-rnn | KANN+SSE | Linux | 1 CPU | 40m05s | 40m02s | rnn-bit -l2 -n160 -m25 -Nd0 |
4 CPUs | 12m13s | 44m40s | rnn-bit -l2 -n160 -t4 -m25 -Nd0 | |||
KANN+BLAS | Linux | 1 CPU | 22m58s | 22m56s | rnn-bit -l2 -n160 -m25 -Nd0 | |
4 CPUs | 8m18s | 31m26s | rnn-bit -l2 -n160 -t4 -m25 -Nd0 | |||
Theano+Keras | Linux | 1 CPU | 27m30s | 27m27s | rnn-bit.py -l2 -n160 -m25 | |
4 CPUs | 19m52s | 77m45s |
sgemm
) implemented in MKL. As isshown in a previous micro-benchmark, MKL/OpenBLAS can be twice asfast as the implementation in KANN.sgemm
routine from a BLAS library (enabled bymacro HAVE_CBLAS
). Linked against OpenBLAS-0.2.19, KANN matches thesingle-thread performance of Theano on Mul100-rnn. KANN doesn't reduceconvolution to matrix multiplication, so MNIST-cnn won't benefit fromOpenBLAS. We observed that OpenBLAS is slower than the native KANNimplementation when we use a mini-batch of size 1. The cause is unknown.