Wednesday, April 29, 2009

Page Reclaim

PFRA works on
1. user mode process pages
2. kernel caches

there are two aspects of swapping: the mechanism (writeback) and the policy (page reclaim).

each zone has 2 lru lists: active and inactive. if a page is lru, PG_lru flag is set, if active then PG_active flag is also set, inactive flag is cleared. pages are linked with the lru list member. the whole lru list locked with lru_lock spinlock of that zone.

lru list contains pages from only: 1. user mode addr spc pages, 2. page cache. so, no free pages of that zone will be in lru list.

these pages are referenced by: 1. usr proc, 2. fs layer code, 3. device driver

mark_page_accessed() makes a slow move of pages between active and inactive list (using act, ref flags combination)


there is no lock on the whole memory object page tables. the lock has been taken to pmd only. so, a pmd page has a lock associated with it. (spinlock)

page private field encodes the swap entry number. (NOTE: for a buffer page, the private field stores the buffer head of the first block of data.)

Notes on swapping

modified pages (backed by the block device) are not swapped out. but it can be synchronized with the block device.

kernel can discard pages (backed by a file) which are not modifiable.

page reclaim = { selecting, swapping, synchronizing, discarding }

private mapping = map a file in memory but changes to that memory is not reflected to the backing block device.

pages candidates for the swap area
1. MAP_PRIVATE, 2. MAP_ANONYMOUS, 3. IPC pages

inside swap areas, there are slots and clusters. swap areas may have different priorities. clustering makes it fast to do readaheads to/from swap space.

bitmap is used to keep track of used and free slots in swap area.

to reduce swapping pressure, there are kswapd (normal mem pressure) and direct reclaim (extreme memory pressure).

when swapping is not enough, there is OOM killer. so, we can see there are levels of mechanisms to keep the system running.

there are 2 lists of each zone: active and inactive. pages are transferred between these 2 lists based on their h/w accessed bit.

swap_info_struct keeps track of each swap areas. similar to the way kernel keeps track of a block device, it keeps track of the swap area.

the first slot of a swap area is used for identification purpose and keeping state information about the swap partition. (i have to see how that is done)

extent list works like a page table in VM. they keep track of linear page slots of a swap area to the scattered blocks on disk. for file-based swap area, the number of extent list structures are more than partition-based swap area where blocks are sequential.