The traditional LL/SC atomics perform poorly on modern arm64 systems with
many CPU cores. With the recent conversion of the sched lock to a mutex
some systems appear to hang if the sched lock is contended. ARMv8.1
introduced an LSE feature that provides atomic instructions such as CAS
that perform much better. Unfortunately these can't be used on older
ARMv8.0 systems. Use -moutline-atomics to make the compiler generate
function calls for atomic operations and provide an implementation for
the functions we use in the kernel that use LSE when available and fall
back on LL/SC.
Fixes regressions seen on Ampere Altra and Apple M2 Pro/Max/Ultra since
the conversion of the sched lock to a mutex.
tested by claudio@, phessler@, mpi@
ok patrick@