Avoid a four byte overread in gcm_ghash_4bit() on amd64.
The assembly code for gcm_ghash_4bit() reads one too many times from Xi,
resulting in a four byte overread. Prevent this by not loading the next
value in the final iteration of the loop. If another full iteration is
required the next Xi value will be loaded at the top of the outer_loop.
Many thanks to Douglas Gliner <Douglas.Gliner at sony dot com> for finding
and reporting this issue, along with a detailed reproducer.
Same diff from deraadt@
ok tb@