Clean up alignment handling for SHA-512.
All assembly implementations are required to perform their own alignment
handling. In the case of the C implementation, on strict alignment
platforms, unaligned data will be copied into an aligned buffer. However,
most platforms then perform byte-by-byte reads (via the PULL64 macros).
Instead, remove SHA512_BLOCK_CAN_MANAGE_UNALIGNED_DATA and alignment
handling to sha512_block_data_order() - if the data is aligned then simply
perform 64 bit loads and then do endian conversion via be64toh(). If the
data is unaligned then use memcpy() and be64toh() (in the form of
crypto_load_be64toh()). Overall this reduces complexity and can improve
performance (on aarch64 we get a ~10% performance gain with aligned input
and about ~1-2% gain on armv7), while the same movq/bswapq is generated
for amd64 and movl/bswapl for i386.
ok tb@