Provide additional BN primitives for BN_ULLONG architectures.
On BN_ULLONG architectures, the C compiler can usually do a decent job
of optimising primitives, however it struggles to see through primitive
calls due to type narrowing. As such, providing explicit versions of
compound primitives can result in the production of more optimal code.
For example, on arm the bn_mulw_addw_addw() primitive can be replaced
with a single umaal instruction, which provides significant performance
gains.
Rather than intermingling #ifdef/#else throughout the header, the
BN_ULLONG defines are pulled up above the normal functions. This also
allows complex compound primitives to be reused. The conditionals have also
been changed from BN_LLONG to BN_ULLONG, since that is what really matters.
ok tb@