24d4e7f642882a8a13da170b4ba86eec8fa91bf2
9019 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8bdd638091 |
mm: fix direct reclaim writeback regression
Shortly before 3.16-rc1, Dave Jones reported:
WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
xfs_vm_writepage+0x5ce/0x630 [xfs]()
CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
etc.
970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
971 PF_MEMALLOC))
I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.
However, my own /var/log/messages now shows similar complaints
WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
from stressing some mmotm trees during July.
Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?
Yes, 3.16-rc1's commit
|
||
|
|
355cb09304 |
Merge tag 'urgent-slab-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull slab fix from Mike Snitzer:
"This fixes the broken duplicate slab name check in
kmem_cache_sanity_check() that has been repeatedly reported (as
recently as today against Fedora rawhide).
Pekka seemed to have it staged for a late 3.15-rc in his 'slab/urgent'
branch but never sent a pull request, see:
https://lkml.org/lkml/2014/5/23/648"
* tag 'urgent-slab-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
slab_common: fix the check for duplicate slab names
|
||
|
|
0253d634e0 |
mm: hugetlb: fix copy_hugetlb_page_range()
Commit |
||
|
|
792ceaefe6 |
mm/fs: fix pessimization in hole-punching pagecache
I wanted to revert my v3.1 commit
|
||
|
|
b1a366500b |
shmem: fix splicing from a hole while it's punched
shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.
But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.
shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch). Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.
shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.
We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?
The origin of this part of the problem is my v3.1 commit
|
||
|
|
8e205f779d |
shmem: fix faulting into a hole, not taking i_mutex
Commit |
||
|
|
c118678bc7 |
mm: do not call do_fault_around for non-linear fault
Ingo Korb reported that "repeated mapping of the same file on tmpfs using remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when the process exits". He bisected the bug to |
||
|
|
a0f7a756c2 |
mm/rmap.c: fix pgoff calculation to handle hugepage correctly
I triggered VM_BUG_ON() in vma_address() when I tried to migrate an anonymous hugepage with mbind() in the kernel v3.16-rc3. This is because pgoff's calculation in rmap_walk_anon() fails to consider compound_order() only to have an incorrect value. This patch introduces page_to_pgoff(), which gets the page's offset in PAGE_CACHE_SIZE. Kirill pointed out that page cache tree should natively handle hugepages, and in order to make hugetlbfs fit it, page->index of hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch, but page_to_pgoff() contains the point to be fixed in a single function. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
45ccaf4764 | Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux into for-3.16-rcX | ||
|
|
743162013d |
sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
a8ddc8215e |
cgroup: distinguish the default and legacy hierarchies when handling cftypes
Until now, cftype arrays carried files for both the default and legacy
hierarchies and the files which needed to be used on only one of them
were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
gets confusing very quickly and we may end up exposing interface files
to the default hierarchy without thinking it through.
This patch makes cgroup core provide separate sets of interfaces for
cftype handling so that the cftypes for the default and legacy
hierarchies are clearly distinguished. The previous two patches
renamed the existing ones so that they clearly indicate that they're
for the legacy hierarchies. This patch adds the interface for the
default hierarchy and apply them selectively depending on the
hierarchy type.
* cftypes added through cgroup_subsys->dfl_cftypes and
cgroup_add_dfl_cftypes() only show up on the default hierarchy.
* cftypes added through cgroup_subsys->legacy_cftypes and
cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.
* cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
same array for the cases where the interface files are identical on
both types of hierarchies.
* This makes all the existing subsystem interface files legacy-only by
default and all subsystems will have no interface file created when
enabled on the default hierarchy. Each subsystem should explicitly
review and compose the interface for the default hierarchy.
* A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
makes subsystems which haven't decided the interface files for the
default hierarchy to present the legacy files on the default
hierarchy so that its behavior on the default hierarchy can be
tested. As the awkward name suggests, this is for development only.
* memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
array isn't used on the default hierarchy. The flag is removed.
v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.
v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
||
|
|
2cf669a58d |
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
Currently, cftypes added by cgroup_add_cftypes() are used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype addition functions and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch adds cgroup_add_legacy_cftypes() which currently is a simple wrapper around cgroup_add_cftypes() and replaces all cgroup_add_cftypes() usages with it. While at it, this patch drops a completely spurious return from __hugetlb_cgroup_file_init(). This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
||
|
|
5577964e64 |
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
||
|
|
40f6123737 |
Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children() |
||
|
|
aa6ec29bee |
cgroup: remove sane_behavior support on non-default hierarchies
sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> |
||
|
|
1ced953b17 |
blkcg, memcg: make blkcg depend on memcg on the default hierarchy
Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.
cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on. This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available. Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.
As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations. This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.
v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
build failure. Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
|
||
|
|
66d2f4d28c |
shmem: fix init_page_accessed use to stop !PageLRU bug
Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
in isolate_lru_pages() at mm/vmscan.c:1281!
Commit
|
||
|
|
0bc1f8b068 |
hwpoison: fix the handling path of the victimized page frame that belong to non-LRU
Until now, the kernel has the same policy to handle victimized page
frames that belong to kernel-space(reserved/slab-subsystem) or
non-LRU(unknown page state). In other word, the result of handling
either of these victimized page frames is (IGNORED | FAILED), and the
return value of memory_failure() is -EBUSY.
This patch is to avoid that memory_failure() returns very soon due to
the "true" value of (!PageLRU(p)), and it also ensures that
action_result() can report more precise information("reserved kernel",
"kernel slab", and "unknown page state") instead of "non LRU",
especially for memory errors which are detected by memory-scrubbing.
Andi said:
: While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
:
: soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
: page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
: page flags: 0x3ffff800004081(locked|slab|head)
: ------------[ cut here ]------------
: kernel BUG at mm/rmap.c:1495!
:
: I think what happened is that a LRU page turned into a slab page in
: parallel with offlining. memory_failure initially tests for this case,
: but doesn't retest later after the page has been locked.
:
: ...
:
: I ran this patch in a loop over night with some stress plus
: the mcelog test suite running in a loop. I cannot guarantee it hit it,
: but it should have given it a good beating.
:
: The kernel survived with no messages, although the mcelog test suite
: got killed at some point because it couldn't fork anymore. Probably
: some unrelated problem.
:
: So the patch is ok for me for .16.
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
496a8e6865 |
msync: fix incorrect fstart calculation
Fix a regression caused by
|
||
|
|
8a5b20aeba |
slub: fix off by one in number of slab tests
min_partial means minimum number of slab cached in node partial list.
So, if nr_partial is less than it, we keep newly empty slab on node
partial list rather than freeing it. But if nr_partial is equal or
greater than it, it means that we have enough partial slabs so should
free newly empty slab. Current implementation missed the equal case so
if we set min_partial is 0, then, at least one slab could be cached.
This is critical problem to kmemcg destroying logic because it doesn't
works properly if some slabs is cached. This patch fixes this problem.
Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
dc78327c0e |
mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER
With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE, the following is triggered at early boot: SMP: Total of 8 processors activated. devtmpfs: initialized Unable to handle kernel NULL pointer dereference at virtual address 00000008 pgd = fffffe0000050000 [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407 Internal error: Oops: 96000006 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44 task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000 PC is at __list_add+0x10/0xd4 LR is at free_one_page+0x270/0x638 ... Call trace: __list_add+0x10/0xd4 free_one_page+0x26c/0x638 __free_pages_ok.part.52+0x84/0xbc __free_pages+0x74/0xbc init_cma_reserved_pageblock+0xe8/0x104 cma_init_reserved_areas+0x190/0x1e4 do_one_initcall+0xc4/0x154 kernel_init_freeable+0x204/0x2a8 kernel_init+0xc/0xd4 This happens because init_cma_reserved_pageblock() calls __free_one_page() with pageblock_order as page order but it is bigger than MAX_ORDER. This in turn causes accesses past zone->free_list[]. Fix the problem by changing init_cma_reserved_pageblock() such that it splits pageblock into individual MAX_ORDER pages if pageblock is bigger than a MAX_ORDER page. In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all architectures expect for ia64, powerpc and tile at the moment, the âpageblock_order > MAX_ORDERâ condition will be optimised out since both sides of the operator are constants. In cases where pageblock size is variable, the performance degradation should not be significant anyway since init_cma_reserved_pageblock() is called only at boot time at most MAX_CMA_AREAS times which by default is eight. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Mark Salter <msalter@redhat.com> Tested-by: Christopher Covington <cov@codeaurora.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
391acf970d |
cpuset,mempolicy: fix sleeping function called from invalid context
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@vger.kernel.org> |
||
|
|
d05f0cdcbe |
mm: fix crashes from mbind() merging vmas
In v2.6.34 commit |
||
|
|
0378730142 |
slab: fix oops when reading /proc/slab_allocators
Commit
|
||
|
|
f00cdc6df7 |
shmem: fix faulting into a hole while it's punched
Trinity finds that mmap access to a hole while it's punched from shmem can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from completing, until the reader chooses to stop; with the puncher's hold on i_mutex locking out all other writers until it can complete. It appears that the tmpfs fault path is too light in comparison with its hole-punching path, lacking an i_data_sem to obstruct it; but we don't want to slow down the common case. Extend shmem_fallocate()'s existing range notification mechanism, so shmem_fault() can refrain from faulting pages into the hole while it's punched, waiting instead on i_mutex (when safe to sleep; or repeatedly faulting when not). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f72e7dcdd2 |
mm: let mm_find_pmd fix buggy race with THP fault
Trinity has reported:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
lock_acquire (arch/x86/include/asm/current.h:14
kernel/locking/lockdep.c:3602)
_raw_spin_lock (include/linux/spinlock_api_smp.h:143
kernel/locking/spinlock.c:151)
remove_migration_pte (mm/migrate.c:137)
rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
remove_migration_ptes (mm/migrate.c:224)
migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
migrate_misplaced_page (mm/migrate.c:1733)
__handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
handle_mm_fault (mm/memory.c:3948)
__get_user_pages (mm/memory.c:1851)
__mlock_vma_pages_range (mm/mlock.c:255)
__mm_populate (mm/mlock.c:711)
SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.
But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.
Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.
I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.
And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
5338a93722 |
mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep()
Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops in copy_page_rep() called from copy_user_huge_page() called from do_huge_pmd_wp_page(). I believe this is a DEBUG_PAGEALLOC false positive, due to the source page being split, and a tail page freed, while copy is in progress; and not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will prevent a miscopy from being made visible. Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to the usual get_page() and put_page() on head page in the usual config; but get and put references to all of the tail pages when DEBUG_PAGEALLOC. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7cd2b0a34a |
mm, pcp: allow restoring percpu_pagelist_fraction default
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
4a705fef98 |
hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry
There's a race between fork() and hugepage migration, as a result we try to "dereference" a swap entry as a normal pte, causing kernel panic. The cause of the problem is that copy_hugetlb_page_range() can't handle "swap entry" family (migration entry and hwpoisoned entry) so let's fix it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: <stable@vger.kernel.org> [2.6.37+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
13ace4d0d9 |
tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported
I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE support being added to fallocate(); but didn't realize until now that I had been too stupid to future-proof shmem_fallocate() against new additions. -EOPNOTSUPP instead of going on to ordinary fallocation. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: <stable@vger.kernel.org> [3.15] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e020d5bd8a |
mm: nommu: per-thread vma cache fix
mm could be removed from current task struct, using previous vma->vm_mm
It will crash on blackfin after updated to Linux 3.15. The commit "mm:
per-thread vma caching" caused the crash. mm could be removed from
current task struct before
mmput()->
exit_mmap()->
delete_vma_from_mm()
the detailed fault information:
NULL pointer access
Kernel OOPS in progress
Deferred Exception context
CURRENT PROCESS:
COMM=modprobe PID=278 CPU=0
invalid mm
return address: [0x000531de]; contents of:
0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8
0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8
0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090]
0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8
CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
task: 0572b720 ti: 0569e000 task.ti: 0569e000
Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014
SEQUENCER STATUS: Not tainted
SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806
EXCAUSE : 0x27
physical IVG3 asserted : <0xffa00744> { _trap + 0x0 }
physical IVG15 asserted : <0xffa00d68> { _evt_system_call + 0x0 }
logical irq 6 mapped : <0xffa003bc> { _bfin_coretmr_interrupt + 0x0 }
logical irq 7 mapped : <0x00008828> { _bfin_fault_routine + 0x0 }
logical irq 11 mapped : <0x00007724> { _l2_ecc_err + 0x0 }
logical irq 13 mapped : <0x00008828> { _bfin_fault_routine + 0x0 }
logical irq 39 mapped : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
logical irq 40 mapped : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
RETE: <0x00000000> /* Maybe null pointer? */
RETN: <0x0569fe50> /* kernel dynamic memory (maybe user-space) */
RETX: <0x00000480> /* Maybe fixed code section */
RETS: <0x00053384> { _exit_mmap + 0x28 }
PC : <0x000531de> { _delete_vma_from_mm + 0x92 }
DCPLB_FAULT_ADDR: <0x00000008> /* Maybe null pointer? */
ICPLB_FAULT_ADDR: <0x000531de> { _delete_vma_from_mm + 0x92 }
PROCESSOR STATE:
R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000
R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720
P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0
P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74
LB0: 057f523f LT0: 057f523e LC0: 00000000
LB1: 0005317c LT1: 00053172 LC1: 00000002
B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000
B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff
B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000
B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000
A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
USP : 056ffcf8 ASTAT: 02003024
Hardware Trace:
0 Target : <0x00003fb8> { _trap_c + 0x0 }
Source : <0xffa006d8> { _exception_to_level5 + 0xa0 } JUMP.L
1 Target : <0xffa00638> { _exception_to_level5 + 0x0 }
Source : <0xffa004f2> { _bfin_return_from_exception + 0x6 } RTX
2 Target : <0xffa004ec> { _bfin_return_from_exception + 0x0 }
Source : <0xffa00590> { _ex_trap_c + 0x70 } JUMP.S
3 Target : <0xffa00520> { _ex_trap_c + 0x0 }
Source : <0xffa0076e> { _trap + 0x2a } JUMP (P4)
4 Target : <0xffa00744> { _trap + 0x0 }
FAULT : <0x000531de> { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
Source : <0x000531da> { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
5 Target : <0x000531da> { _delete_vma_from_mm + 0x8e }
Source : <0x00053176> { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
6 Target : <0x0005314c> { _delete_vma_from_mm + 0x0 }
Source : <0x00053380> { _exit_mmap + 0x24 } JUMP.L
7 Target : <0x00053378> { _exit_mmap + 0x1c }
Source : <0x00053394> { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
8 Target : <0x00053390> { _exit_mmap + 0x34 }
Source : <0xffa020e0> { __cond_resched + 0x20 } RTS
9 Target : <0xffa020c0> { __cond_resched + 0x0 }
Source : <0x0005338c> { _exit_mmap + 0x30 } JUMP.L
10 Target : <0x0005338c> { _exit_mmap + 0x30 }
Source : <0x0005333a> { _delete_vma + 0xb2 } RTS
11 Target : <0x00053334> { _delete_vma + 0xac }
Source : <0x0005507a> { _kmem_cache_free + 0xba } RTS
12 Target : <0x00055068> { _kmem_cache_free + 0xa8 }
Source : <0x0005505e> { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
13 Target : <0x00055052> { _kmem_cache_free + 0x92 }
Source : <0x0005501a> { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
14 Target : <0x00054ff4> { _kmem_cache_free + 0x34 }
Source : <0x00054fce> { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
15 Target : <0x00054fc0> { _kmem_cache_free + 0x0 }
Source : <0x00053330> { _delete_vma + 0xa8 } JUMP.L
Kernel Stack
Stack info:
SP: [0x0569ff24] <0x0569ff24> /* kernel dynamic memory (maybe user-space) */
Memory from 0x0569ff20 to 056a0000
0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a
0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000
0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000
0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800
0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001
0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001 <00002806>
Return addresses in stack:
address : <0x00002806> { _show_cpuinfo + 0x2d2 }
Modules linked in:
Kernel panic - not syncing: Kernel exception
[ end Kernel panic - not syncing: Kernel exception
Signed-off-by: Steven Miao <realmz6@gmail.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
fb009e3a99 |
percpu: Use ALIGN macro instead of hand coding alignment calculation
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
05064084e8 |
fix __swap_writepage() compile failure on old gcc versions
Tetsuo Handa wrote:
"Commit
|
||
|
|
16b9057804 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
|
||
|
|
9c1d5284c7 |
Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus
Backmerge of dcache.c changes from mainline. It's that, or complete rebase... Conflicts: fs/splice.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
f6cb85d00e |
shmem: switch to iter_file_splice_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
14208b0ec5 |
Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
|
||
|
|
b738d76465 |
Don't trigger congestion wait on dirty-but-not-writeout pages
shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.
The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.
We'll still trigger the congestion waiting if
(a) the process is kswapd, and we hit pages flagged for immediate
reclaim
(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.
This probably needs to be revisited, but as it is this fixes a reported
regression.
Reported-by: Felipe Contreras <felipe.contreras@gmail.com>
Pinpointed-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
f8409abdc5 |
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions. In addition,
improve the scalability of adding and remove inodes from the orphan
list"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
ext4: handle symlink properly with inline_data
ext4: fix wrong assert in ext4_mb_normalize_request()
ext4: fix zeroing of page during writeback
ext4: remove unused local variable "stored" from ext4_readdir(...)
ext4: fix ZERO_RANGE test failure in data journalling
ext4: reduce contention on s_orphan_lock
ext4: use sbi in ext4_orphan_{add|del}()
ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
ext4: remove unnecessary double parentheses
ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
ext4: make local functions static
ext4: fix block bitmap validation when bigalloc, ^flex_bg
ext4: fix block bitmap initialization under sparse_super2
ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
ext4: avoid unneeded lookup when xattr name is invalid
ext4: fix data integrity sync in ordered mode
ext4: remove obsoleted check
ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
ext4: fix locking for O_APPEND writes
...
|
||
|
|
3f17ea6dea |
Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ... |
||
|
|
b1de0d139c |
mm: convert some level-less printks to pr_*
printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f59428ab73 |
mm/kmemleak-test.c: use pr_fmt for logging
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
33041a0d76 |
mm: mark remap_file_pages() syscall as deprecated
The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() over using repeated calls to mmap(2) is that the former approach does not require the kernel to create additional VMA (Virtual Memory Area) data structures. Supporting of nonlinear mapping requires significant amount of non-trivial code in kernel virtual memory subsystem including hot paths. Also to get nonlinear mapping work kernel need a way to distinguish normal page table entries from entries with file offset (pte_file). Kernel reserves flag in PTE for this purpose. PTE flags are scarce resource especially on some CPU architectures. It would be nice to free up the flag for other usage. Fortunately, there are not many users of remap_file_pages() in the wild. It's only known that one enterprise RDBMS implementation uses the syscall on 32-bit systems to map files bigger than can linearly fit into 32-bit virtual address space. This use-case is not critical anymore since 64-bit systems are widely available. The plan is to deprecate the syscall and replace it with an emulation. The emulation will create new VMAs instead of nonlinear mappings. It's going to work slower for rare users of remap_file_pages() but ABI is preserved. One side effect of emulation (apart from performance) is that user can hit vm.max_map_count limit more easily due to additional VMAs. See comment for DEFAULT_MAX_MAP_COUNT for more details on the limit. [akpm@linux-foundation.org: fix spello] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
cf2c81279e |
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e231875ba7 |
mm: memcontrol: clean up memcg zoneinfo lookup
Memcg zoneinfo lookup sites have either the page, the zone, or the node id and zone index, but sites that only have the zone have to look up the node id and zone index themselves, whereas sites that already have those two integers use a function for a simple pointer chase. Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let sites that already have node id and zone index - all for each node, for each zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid]. Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
aedf95ea05 |
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
Kmemleak could ignore memory blocks allocated via memblock_alloc() leading to false positives during scanning. This patch adds the corresponding callbacks and removes kmemleak_free_* calls in mm/nobootmem.c to avoid duplication. The kmemleak_alloc() in mm/nobootmem.c is kept since __alloc_memory_core_early() does not use memblock_alloc() directly. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1741196281 |
mm/mempool.c: update the kmemleak stack trace for mempool allocations
When mempool_alloc() returns an existing pool object, kmemleak_alloc() is no longer called and the stack trace corresponds to the original object allocation. This patch updates the kmemleak allocation stack trace for such objects to make it more useful for debugging. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ffe2c748e2 |
mm: introduce kmemleak_update_trace()
The memory allocation stack trace is not always useful for debugging a memory leak (e.g. radix_tree_preload). This function, when called, updates the stack trace for an already allocated object. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
aae0ad7ae5 |
mm/kmemleak.c: use %u to print ->checksum
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
688eb988d1 |
vmscan: memcg: always use swappiness of the reclaimed memcg
Memory reclaim always uses swappiness of the reclaim target memcg (origin of the memory pressure) or vm_swappiness for global memory reclaim. This behavior was consistent (except for difference between global and hard limit reclaim) because swappiness was enforced to be consistent within each memcg hierarchy. After "mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control" each memcg can have its own swappiness independent of hierarchical parents, though, so the consistency guarantee is gone. This can lead to an unexpected behavior. Say that a group is explicitly configured to not swapout by memory.swappiness=0 but its memory gets swapped out anyway when the memory pressure comes from its parent with a It is also unexpected that the knob is meaningless without setting the hard limit which would trigger the reclaim and enforce the swappiness. There are setups where the hard limit is configured higher in the hierarchy by an administrator and children groups are under control of somebody else who is interested in the swapout behavior but not necessarily about the memory limit. From a semantic point of view swappiness is an attribute defining anon vs. file proportional scanning of LRU which is memcg specific (unlike charges which are propagated up the hierarchy) so it should be applied to the particular memcg's LRU regardless where the memory pressure comes from. This patch removes vmscan_swappiness() and stores the swappiness into the scan_control structure. mem_cgroup_swappiness is then used to provide the correct value before shrink_lruvec is called. The global vm_swappiness is used for the root memcg. [hughd@google.com: oopses immediately when booted with cgroup_disable=memory] Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |