On Fri, Apr 05, 2002 at 01:18:26AM -0800, Andrew Morton wrote:
> 
> Andrea,
> 
> Marcelo would prefer that the VM retain the oom killer.  The thinking
> is that if try_to_free_pages fails, then we're better off making a
> deliberate selection of the process to kill rather than the random(ish)
> selection which we make by failing the allocation.
> 
> One example is at
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101405688319160&w=2
> 
> That failure was with vm-24, which I think had the less aggressive

vm-24 had a problem yes, that is fixed in the latest releases.

> i/dcache shrink code.  We do need to robustly handle the no-swap-left
> situation.
> 
> So I have resurrected the oom killer.  The patch is below.
> 
> During testing of this, a problem cropped up.  The machine has 64 megs
> of memory, no swap.  The workload consisted of running `make -j0
> bzImage' in parallel with `usemem 40'.  usemem will malloc a 40
> megabyte chunk, memset it and exit.
> 
> The kernel livelocked.  What appeared to be happening was that ZONE_DMA
> was short on free pages, but ZONE_NORMAL was not.  So this check:
> 
> 	if (!check_classzone_need_balance(classzone))
>         	break;
> 
> in try_to_free_pages() was seeing that ZONE_NORMAL had some headroom
> and was causing a return to __alloc_pages().
> 
> __alloc_pages has this logic:
> 
> 	min = 1UL << order;
> 	for (;;) {
> 		zone_t *z = *(zone++);
> 		if (!z)
> 			break;
> 
> 		min += z->pages_min;
> 		if (z->free_pages > min) {
> 			page = rmqueue(z, order);
> 			if (page)
> 				return page;
> 		}
> 	}
> 
> 
> On the first pass through this loop, `min' gets the value
> zone_dma.pages_min + 1.  On the second pass through the loop it gets
> the value zone_dma.pages_min + 1 + zone_normal.pages_min.  And this is
> greater than zone_normal.free_pages! So alloc_pages() gets stuck in an
> infinite loop.

This is a bug I fixed in the -rest patch, that's also broken on numa.
The deadlock cannot happen if you apply all my patches.

As for your patch it reintroduces a deadlock by looping in GFP relying
on the oom killer (that will also go and kill the
bigger task most of the time), the oom killer can select a task in D
state, or it can a sigterm, and secondly you broke google DB (the right
fix for that min thing are the point-of-view watermarks in the -rest
patch in my collection). the worst thing is that with the oom killer
we've to keep looping, so if the task is for whatever reason hung in R
state in kernel the machine will deadlock, while current way it will
make progress either in the do_exit, or in the -ENOMEM fail path (modulo
getblk that's not too bad anyways). the current memory balancing is now
been good enough to kill in function of probability, so I didn't feel
the need of risking (at the very least theorical) deadlocks there, this
is why I left it disabled.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/