245 lines
7.4 KiB
ReStructuredText
245 lines
7.4 KiB
ReStructuredText
matrixmultiply
|
||
==============
|
||
|
||
General matrix multiplication for f32, f64 matrices. Operates on matrices with
|
||
general layout (they can use arbitrary row and column stride).
|
||
|
||
Please read the `API documentation here`__
|
||
|
||
__ https://docs.rs/matrixmultiply/
|
||
|
||
|
||
We presently provide a few good microkernels portable and for x86-64, and
|
||
only one operation: the general matrix-matrix multiplication (“gemm”).
|
||
|
||
This crate was inspired by the tmacro/microkernel approach to matrix
|
||
multiplication that is used by the BLIS_ project.
|
||
|
||
.. _BLIS: https://github.com/flame/blis
|
||
|
||
|crates|_
|
||
|
||
.. |crates| image:: https://img.shields.io/crates/v/matrixmultiply.svg
|
||
.. _crates: https://crates.io/crates/matrixmultiply
|
||
|
||
Development Goals
|
||
-----------------
|
||
|
||
- Code clarity and maintainability
|
||
- Portability and stable Rust
|
||
- Performance: provide target-specific microkernels when it is beneficial
|
||
- Testing: Test diverse inputs and test and benchmark all microkernels
|
||
- Small code footprint and fast compilation
|
||
- We are not reimplementing BLAS.
|
||
|
||
Benchmarks
|
||
----------
|
||
|
||
- ``cargo bench`` is useful for special cases and small matrices
|
||
- The best gemm and threading benchmark is is ``examples/benchmarks.rs`` which supports custom sizes,
|
||
some configuration, and csv output.
|
||
Use the script ``benches/benchloop.py`` to run benchmarks over parameter ranges.
|
||
|
||
Blog Posts About This Crate
|
||
---------------------------
|
||
|
||
+ `gemm: a rabbit hole`__
|
||
|
||
__ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/
|
||
|
||
Recent Changes
|
||
--------------
|
||
|
||
- 0.3.2
|
||
|
||
- Add optional feature ``cgemm`` for complex matmult functions ``cgemm`` and
|
||
``zgemm``
|
||
|
||
- Add optional feature ``constconf`` for compile-time configuration of matrix
|
||
kernel parameters for chunking. Improved scripts for benchmarking over ranges
|
||
of different settings. With thanks to @DutchGhost for the const-time
|
||
parsing functions.
|
||
|
||
- Improved benchmarking and testing.
|
||
|
||
- Threading is now slightly more eager to threads (depending on matrix element count).
|
||
|
||
- 0.3.1
|
||
|
||
- Attempt to fix bug #55 were the mask buffer in TLS did not seem to
|
||
get its requested alignment on macos. The mask buffer pointer is now
|
||
aligned manually (again, like it was in 0.2.x).
|
||
|
||
- Fix a minor issue where we were passing a buffer pointer as ``&T``
|
||
when it should have been ``&[T]``.
|
||
|
||
- 0.3.0
|
||
|
||
- Implement initial support for threading using a bespoke thread pool with
|
||
little contention.
|
||
To use, enable feature ``threading`` (and configure number of threads with the
|
||
variable ``MATMUL_NUM_THREADS``).
|
||
|
||
Initial support is for up to 4 threads - will be updated with more
|
||
experience in coming versions.
|
||
|
||
- Added a better benchmarking program for arbitrary size and layout, see
|
||
``examples/benchmark.rs`` for this; it supports csv output for better
|
||
recording of measurements
|
||
|
||
- Minimum supported rust version is 1.41.1 and the version update policy
|
||
has been updated.
|
||
|
||
- Updated to Rust 2018 edition
|
||
|
||
- Moved CI to github actions (so long travis and thanks for all the fish).
|
||
|
||
- 0.2.4
|
||
|
||
- Support no-std mode by @vadixidav and @jturner314
|
||
New (default) feature flag "std"; use default-features = false to disable
|
||
and use no-std.
|
||
Note that runtime CPU feature detection requires std.
|
||
|
||
- Fix tests so that they build correctly on non-x86 #49 platforms, and manage
|
||
the release by @bluss
|
||
|
||
- 0.2.3
|
||
|
||
- Update rawpointer dependency to 0.2
|
||
- Minor changes to inlining for ``-Ctarget-cpu=native`` use (not recommended -
|
||
use automatic runtime feature detection.
|
||
- Minor improvements to kernel masking (#42, #41) by @bluss and @SuperFluffy
|
||
|
||
- 0.2.2
|
||
|
||
- New dgemm avx and fma kernels implemented by R. Janis Goldschmidt
|
||
(@SuperFluffy). With fast cases for both row and column major output.
|
||
|
||
Benchmark improvements: Using fma instructions reduces execution time on
|
||
dgemm benchmarks by 25-35% compared with the avx kernel, see issue `#35`_
|
||
|
||
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by
|
||
5-7% compared with the previous version's autovectorized kernel.
|
||
|
||
- New fma adaption of the sgemm avx kernel by R. Janis Goldschmidt
|
||
(@SuperFluffy).
|
||
|
||
Benchmark improvement: Using fma instructions reduces execution time on
|
||
sgemm benchmarks by 10-15% compared with the avx kernel, see issue `#35`_
|
||
|
||
- More flexible kernel selection allows kernels to individually set all
|
||
their parameters, ensures the fallback (plain Rust) kernels can be tuned
|
||
for performance as well, and moves feature detection out of the gemm loop.
|
||
|
||
Benchmark improvement: Reduces execution time on various benchmarks
|
||
by 1-2% in the avx kernels, see `#37`_.
|
||
|
||
- Improved testing to cover input/output strides of more diversity.
|
||
|
||
.. _#35: https://github.com/bluss/matrixmultiply/issues/35
|
||
.. _#37: https://github.com/bluss/matrixmultiply/issues/37
|
||
|
||
- 0.2.1
|
||
|
||
- Improve matrix packing by taking better advantage of contiguous inputs.
|
||
|
||
Benchmark improvement: execution time for 64×64 problem where inputs are either
|
||
both row major or both column major changed by -5% sgemm and -1% for dgemm.
|
||
(#26)
|
||
|
||
- In the sgemm avx kernel, handle column major output arrays just like
|
||
it does row major arrays.
|
||
|
||
Benchmark improvement: execution time for 32×32 problem where output is column
|
||
major changed by -11%. (#27)
|
||
|
||
- 0.2.0
|
||
|
||
- Use runtime feature detection on x86 and x86-64 platforms, to enable
|
||
AVX-specific microkernels at runtime if available on the currently
|
||
executing configuration.
|
||
|
||
This means no special compiler flags are needed to enable native
|
||
instruction performance!
|
||
|
||
- Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up
|
||
matrix multiplication by another 25%.
|
||
|
||
- Use ``std::alloc`` for allocation of aligned packing buffers
|
||
|
||
- We now require Rust 1.28 as the minimal version
|
||
|
||
- 0.1.15
|
||
|
||
- Fix bug where the result matrix C was not updated in the case of a M × K by
|
||
K × N matrix multiplication where K was zero. (This resulted in the output
|
||
C potentially being left uninitialized or with incorrect values in this
|
||
specific scenario.) By @jturner314 (PR #21)
|
||
|
||
- 0.1.14
|
||
|
||
- Avoid an unused code warning
|
||
|
||
- 0.1.13
|
||
|
||
- Pick 8x8 sgemm (f32) kernel when AVX target feature is enabled
|
||
(with Rust 1.14 or later, no effect otherwise).
|
||
- Use ``rawpointer``, a µcrate with raw pointer methods taken from this
|
||
project.
|
||
|
||
- 0.1.12
|
||
|
||
- Internal cleanup with retained performance
|
||
|
||
- 0.1.11
|
||
|
||
- Adjust sgemm (f32) kernel to optimize better on recent Rust.
|
||
|
||
- 0.1.10
|
||
|
||
- Update doc links to docs.rs
|
||
|
||
- 0.1.9
|
||
|
||
- Workaround optimization regression in rust nightly (1.12-ish) (#9)
|
||
|
||
- 0.1.8
|
||
|
||
- Improved docs
|
||
|
||
- 0.1.7
|
||
|
||
- Reduce overhead slightly for small matrix multiplication problems by using
|
||
only one allocation call for both packing buffers.
|
||
|
||
- 0.1.6
|
||
|
||
- Disable manual loop unrolling in debug mode (quicker debug builds)
|
||
|
||
- 0.1.5
|
||
|
||
- Update sgemm to use a 4x8 microkernel (“still in simplistic rust”),
|
||
which improves throughput by 10%.
|
||
|
||
- 0.1.4
|
||
|
||
- Prepare support for aligned packed buffers
|
||
- Update dgemm to use a 8x4 microkernel, still in simplistic rust,
|
||
which improves throughput by 10-20% when using AVX.
|
||
|
||
- 0.1.3
|
||
|
||
- Silence some debug prints
|
||
|
||
- 0.1.2
|
||
|
||
- Major performance improvement for sgemm and dgemm (20-30% when using AVX).
|
||
Since it all depends on what the optimizer does, I'd love to get
|
||
issue reports that report good or bad performance.
|
||
- Made the kernel masking generic, which is a cleaner design
|
||
|
||
- 0.1.1
|
||
|
||
- Minor improvement in the kernel
|