From: dlg <dlg@openbsd.org>
Date: Mon, 14 Feb 2022 04:33:18 +0000 (+0000)
Subject: update sbchecklowmem() to better detect actual mbuf memory usage.
X-Git-Url: http://artulab.com/gitweb/?a=commitdiff_plain;h=c8502062f4e459f4aa1494ae11120bce0339a0b7;p=openbsd

update sbchecklowmem() to better detect actual mbuf memory usage.

previously sbchecklowmem() (and sonewconn()) would look at the mbuf
and mbuf cluster pools to see if they were approaching their hard
limits. based on how many mbufs/clusters were allocated against the
limits, socket operations would start to fail with ENOBUFS until
utilisation went down.

mbufs and clusters have changed a lot since then though. there are
now many mbuf cluster pools, not just one for 2k clusters. because
of this the mbuf layer now limits the amount of memory all the mbuf
pools can allocate backend pages from rather than limit the individual
pools. this means sbchecklowmem() ends up looking at the default
pool hard limit, which is UINT_MAX, which in turn means means
sbchecklowmem() probably never applies backpressure. this is made
worse on multiprocessor systems where per cpu caches of mbuf and
cluster pool items are enabled because the number of in use pool
items is distorted by the cpu caches.

this switches sbchecklowmem to looking at the page allocations made
by all the pools instead. the big benefit of this is that the page
allocations are much more representative of the overall mbuf memory
usage in the system. the downside is is that the backend page
allocation accounting does not see idle memory held by pools. pools
cannot release partially free pages to the page backend (obviously),
and pools cache idle items to avoid thrashing on the backend page
allocator. this means the page allocation level is higher than the
memory used by actual in-flight mbufs.

however, this can also be a benefit. the backend page allocation is a
kind of smoothed out "trend" line. mbuf utilisation over short periods
can be extremely bursty because of things like rx ring dequeue and fill
cycles, or large socket sends. if you're trying to grow socket
buffers while these things are happening, luck becomes an important
factor in whether it will work or not. because pools cache idle items,
the backend page utilisation better represents the overall trend
of activity in the system and will give more consistent behaviour here.

this diff is deliberately simple. we're basically going from "no
limits" to "some sort of limit" for sockets again, so keeping the
code simple means it should be easy to understand and tweak in the
future.

ok djm@ visa@ claudio@
---

diff --git a/sys/kern/uipc_mbuf.c b/sys/kern/uipc_mbuf.c
index acac2c0dbc8..2f11e8c43d5 100644
--- a/sys/kern/uipc_mbuf.c
+++ b/sys/kern/uipc_mbuf.c
@@ -1,4 +1,4 @@
-/*	$OpenBSD: uipc_mbuf.c,v 1.281 2022/02/08 11:28:19 dlg Exp $	*/
+/*	$OpenBSD: uipc_mbuf.c,v 1.282 2022/02/14 04:33:18 dlg Exp $	*/
 /*	$NetBSD: uipc_mbuf.c,v 1.15.4.1 1996/06/13 17:11:44 cgd Exp $	*/
 
 /*
@@ -1502,6 +1502,12 @@ m_pool_init(struct pool *pp, u_int size, u_int align, const char *wmesg)
 	pool_set_constraints(pp, &kp_dma_contig);
 }
 
+u_int
+m_pool_used(void)
+{
+	return ((mbuf_mem_alloc * 100) / mbuf_mem_limit);
+}
+
 #ifdef DDB
 void
 m_print(void *v,
diff --git a/sys/kern/uipc_socket2.c b/sys/kern/uipc_socket2.c
index 42a61e60bd2..6da3c74ac6e 100644
--- a/sys/kern/uipc_socket2.c
+++ b/sys/kern/uipc_socket2.c
@@ -1,4 +1,4 @@
-/*	$OpenBSD: uipc_socket2.c,v 1.116 2021/11/06 05:26:33 visa Exp $	*/
+/*	$OpenBSD: uipc_socket2.c,v 1.117 2022/02/14 04:33:18 dlg Exp $	*/
 /*	$NetBSD: uipc_socket2.c,v 1.11 1996/02/04 02:17:55 christos Exp $	*/
 
 /*
@@ -155,7 +155,7 @@ sonewconn(struct socket *head, int connstatus)
 	 */
 	soassertlocked(head);
 
-	if (mclpools[0].pr_nout > mclpools[0].pr_hardlimit * 95 / 100)
+	if (m_pool_used() > 95)
 		return (NULL);
 	if (head->so_qlen + head->so_q0len > head->so_qlimit * 3)
 		return (NULL);
@@ -517,13 +517,13 @@ int
 sbchecklowmem(void)
 {
 	static int sblowmem;
+	unsigned int used = m_pool_used();
 
-	if (mclpools[0].pr_nout < mclpools[0].pr_hardlimit * 60 / 100 ||
-	    mbpool.pr_nout < mbpool.pr_hardlimit * 60 / 100)
+	if (used < 60)
 		sblowmem = 0;
-	if (mclpools[0].pr_nout > mclpools[0].pr_hardlimit * 80 / 100 ||
-	    mbpool.pr_nout > mbpool.pr_hardlimit * 80 / 100)
+	else if (used > 80)
 		sblowmem = 1;
+
 	return (sblowmem);
 }
 
diff --git a/sys/sys/mbuf.h b/sys/sys/mbuf.h
index 488b75b525a..72655624edb 100644
--- a/sys/sys/mbuf.h
+++ b/sys/sys/mbuf.h
@@ -1,4 +1,4 @@
-/*	$OpenBSD: mbuf.h,v 1.253 2021/05/15 08:07:20 yasuoka Exp $	*/
+/*	$OpenBSD: mbuf.h,v 1.254 2022/02/14 04:33:18 dlg Exp $	*/
 /*	$NetBSD: mbuf.h,v 1.19 1996/02/09 18:25:14 christos Exp $	*/
 
 /*
@@ -429,6 +429,7 @@ void	m_align(struct mbuf *, int);
 struct mbuf *m_clget(struct mbuf *, int, u_int);
 void	m_extref(struct mbuf *, struct mbuf *);
 void	m_pool_init(struct pool *, u_int, u_int, const char *);
+u_int	m_pool_used(void);
 void	m_extfree_pool(caddr_t, u_int, void *);
 void	m_adj(struct mbuf *, int);
 int	m_copyback(struct mbuf *, int, int, const void *, int);