Linux Memory Management Notes: December 2011

Wednesday, December 14, 2011

Swapper Subsystem Files Structure

swap.c: it moves pages around in the lru lists
swapfile.c: it handles actual swapping of the pages to backing store
it also provides function calls to control swapping from usermode
swap_state.c: it handles swap cache
vmscan.c: glues the swapper subsystem to the memory manager.
provides more high-level functions that the memory manager.
kswapd is here.

Tuesday, December 13, 2011

Discussion on swap.c/put_page related functions

There are two phases of these functions:
1. delete the page from lru cache (__page_cache_release)
2. freeing the page to memory allocator

Consider the allocation process: 1. page is allocated, 2. page table entries are fixed 3. page is added to lru cache.

In the put page functions, page table entries are not handled. So the control path should fix/remove appropriate page table entries depending before calling thes functions.

Thursday, December 8, 2011

Get Tasks Page Tables

step 1: get the current task
step 2: get the runqueue
step 3: get the cfs rbtree root
step 4: traverse the tree (recursive)
for each node, get the scheduling entity
from the scheduling entity, get the task pointer
from the task structure, get the pgd
print pgd information :D

Notes on rbtree

definition
----------

#define RB_RED 0
#define RB_BLACK 1
struct rb_node
{
unsigned long rb_parent_color;
int rb_color;
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

usage:
-------

Notes on Swapping (how to totally smash it)

modified pages (backed by the block device) are not swapped out. but it can be synchronized with the block device.

kernel can discard pages (backed by a file) which are not modifiable.

page reclaim = { selecting, swapping, synchronizing, discarding }

private mapping = map a file in memory but changes to that memory is not reflected to the backing block device.

pages candidates for the swap area
1. MAP_PRIVATE, 2. MAP_ANONYMOUS, 3. IPC pages

inside swap areas, there are slots and clusters. swap areas may have different priorities. clustering makes it fast to do readaheads to/from swap space.

bitmap is used to keep track of used and free slots in swap area.

to reduce swapping pressure, there are kswapd (normal mem pressure) and direct reclaim (extreme memory pressure).

when swapping is not enough, there is OOM killer. so, we can see there are levels of mechanisms to keep the system running.

there are 2 lists of each zone: active and inactive. pages are transferred between these 2 lists based on their h/w accessed bit.

swap_info_struct keeps track of each swap areas. similar to the way kernel keeps track of a block device, it keeps track of the swap area.

the first slot of a swap area is used for identification purpose and keeping state information about the swap partition. (i have to see how that is done)

extent list works like a page table in VM. they keep track of linear page slots of a swap area to the scattered blocks on disk. for file-based swap area, the number of extent list structures are more than partition-based swap area where blocks are sequential.

Reverse Mapping

vma_area_struct: 3 members are needed: shared, anon_vma_node, anon_vma

2 alternatives: anonymous pages and file-mapped pages

General MM Notes

Linux kernel space has 4 kinds of memory mapping
1. direct mapping
2. vmalloc mapping
3. kmap mapping
4. fixmap mapping

direct mapping is used for low memory pages which are used to store permanent data structures, page tables, interrupt tables and so on.

vmalloc mapping is used for non-contiguous memory mapping. i can use it any way possible. if you have a space reserved in vmalloc space and a bunch of pages in your hand, you can map those pages to that vmalloc space by modifying the kernel page tables. (example usage: by insmod to store the modules)

kmap mapping is kind of the vmalloc space but only one pgd entry in master kernel page table is dedicated for kmap virtual address.

fixmap mapping is similar to kmap mapping. (example usage: map a page table residing on a high mem page)

all these 4 kinds of memory virtual and physical pages are given as soon as the mapping request is placed and served. to page fault here means something terribly wrong going on inside.

user process pages are similar to kernel vmalloc space except they are in virtual address in user space and the physical pages are not allocated and mapped as long as there is page fault generated on those virtual addresss spaces.

Buddy System (MM)

During initialization in memmap_init_zone(), all pages are marked movable.

Each block of pages inside buddy allocator has a head page. head page contains PG_buddy flag set and page->private=order. other pages does not have these bits set.

Compound Page (Huge TLB page):
A group of consecutive pages: head page and tail pages. for each page, PG_compound flag set, page->private=head page. The first tail page contains destructor 'free_compound_page' and number of pages 'n' in the lru.next and lru.prev fields.

A free page can be in 2 places: 1. buddy list 2. pcp list
When a free page is in buddy list and is the head page of a block, then page->private=order
When a free page is in pcp list, then page->private=migratetype

The node and zone for a page is encoded in the upper portion of the page flag.

Hot-N-Cold PCP

[ref. PLKA p146, p183, ]

per_cpu_pageset structure. each zone has pageset[NR_CPUS].
NR_CPUS = 1 (uni-processor), 2-32 (smp32), 2-64 (smp64)

per_cpu_pageset -> per_cpu_pages[2] ... 0:hot, 1:cold

per_cpu_pages { count, high, batch (chunk size for BA), list }
... low watermark: 0, high watermark: high (PLKAfig3.6)

list of pages are kept track of by page->lru field.

I will take the page from the back of the list, test it and put it in the front of the list.

(Performance) I can have a free page reserve to fill up pcp quickly.

Free Pages Grab and Release

If you directly take out page from the free list you need to be careful about the following

** pcp is protected by local irq (local_irq_disable, local_irq_enable).
** zone is protected by zone specific spin lock with irq (spin_lock_irqsave(z->lock, flags), spin_lock_irqrestore(z->lock, flags).

pcp
----
1. anytime if you grab a page from pcp, you must call prep_new_page() on it.
2. when you grab page block manually from bdyalloc, you need to either
... 2.a clear the page->count before releasing that page to freelist (set_page_count). or
... 2.b prep_new_page on that block just after you extract that page.

bdyalloc
---------
1. when you take out a page block from bdyalloc, you need to clear the PG_Buddy flag and private=0 on the first page of the block. (rmv_page_order)
2. when you put a block of pages back into bdyalloc, you need to set the order (set_page_order).

remove pages: (you must call prep_new_page)
__rmqueue_smallest: works on the freelists of a zone (not the pcp)

__rmqueue_fallback: similar but takes care of migratetype more flexibly

rmqueue_bulk: zone, order, count ...

little user-friendly: no need to call prep_new_page
buffered_rmqueue: works both on pcp and freelists (based on the order)

insert pages: (page count must be 0)
free_hot_and_cold_page: frees pages into the pcp list

Page Migration

the old page is locked, and pte entries of the old page is replaced with migration entries for which the page fault with wait till the migration is finished

new page is locked and set not-uptodate before the migration so that processes will wait on the till the migration ends.

unmap old page with migration entries
replace the radix tree

if a page is locked then the processes trying to map that page in their pts will wait on the lock. also, the page won't be swapped out.

there are two granularity of locking: 1. vma lock which is in virtual level lock, does not correspond to any physical pages. and 2. page lock which is the physical page lock.

page ref count means some process or kernel component is currently using or holding that physical page somehow. (page_count)

page map count means that page is mapped to some process page table (page_mapcount, +1)

map count <= ref count

if the phy page is in the swap cache, then private field will contain the swap cache entry for that page.
this entry can be used as the swap pte for installation in the page table.

vma private is used sometimes for containing non-linear vma information.

page refcount set/increase/decrease

set
-----
page_alloc.c/free_pages_bootmem() ... sets to 1 to the first page of a page block
page_alloc.c/prep_new_page() ... sets to 1 to the first page of a page block
page_alloc.c/split_page() ... sets to 1 to the first page of each of the page blocks

increase
--------------

decrease
---------------

Linux Memory Management Notes