Spliced TCP sockets become faster when the output part is running
as its own task thread. This is inspired by userland copy where a
process also has to go through the scheduler. This gives the socket
buffer a chance to be filled up and tcp_output() is called less
often and with bigger chunks.
When two kernel tasks share all the workload, the current scheduler
implementation will hang userland processes on single cpu machines.
As a workaround put a yield() into the splicing thread after each
task execution. This reduces the number of calls of tcp_output()
even more.
OK tedu@ mpi@