One major issue talked about in research papers is reducing the overhead
of the IOVA allocation. As far as I can see the current "best solution"
is to cache IOVA ranges in percpu magazines. I don't think we have this
issue at all thanks to bus_dmamap_create(9). The map is created ahead
of time, and we know the maximum size of the DMA transfer. Since with
smmu(4) we have IOVA per domain, allocating IOVA 'early' is essentially
free. But pagetable mapping also incurs a performance penalty, since we
allocate pagetable entry descriptors through pools. Since we have the
IOVA early, we can allocate those early as well. This allocation is a
bit more expensive though, but can be optimized further.
All this means that there is no allocation overhead in hot code paths.
The "only" thing remaining is assigning IOVA to the segments, adjusting
the pagetable mappings, and flushing the IOTLB on unload. Maybe there's
a way to do a combined flush for NICs, because we give a list of mbufs
to the network stack and we could do the IOTLB invalidation only once
right before we hand over the mbuf list to the upper layers.