Clean up BN_mod_mul() and simplify BN_mod_sqr().
Use the same naming/code pattern in BN_mod_mul() as is used in BN_mul().
Note that the 'rr' allocation is unnecessary, since both BN_mul() and
BN_sqr() handle the case where r == a || r == b. However, it avoids a
potential copy on the exit from BN_mul()/BN_sqr(), so leave it in place
for now.
Turn BN_mod_sqr() into a wrapper that calls BN_mod_mul(), since it already
calls BN_sqr() in the a == b. The supposed gain of calling BN_mod_ct()
instead of BN_nnmod() does not really exist.
ok tb@