Linux Memory Management Notes

Tuesday, May 26, 2009

Memory Area Management

memory area is a sequence of contiguous physical addresses. this idea came up from the event that the kerne uses the same kind of objects. so grouping them will save much mem space.

qus: what happens when cache is increased or decreased? it is not supposed to get contig pages anymore. it will get another group of contig pages.

there are 13 list of memory areas of sizes from 32 to 131072 bytes. each list has 2 caches: dma and normal.

user processes do not have access to or use the slab allocator. it is solely used by the kernel.

2 types: general (used by the slab alloc) and specific (used by the kernel)

kmem_cache keeps track of the other caches.

pages belong to a slab will have the PG_Slab flag set on them. it is not possible to get pages from ZONE_HIGHMEM because page_address() will return NULL in that case :D.

every physical or abstract entity has a descriptor in the kernel. a page frame within a slab has its PG_Slab flags set, lru.next=slab desc, lru.prev=cache desc. lru is used in buddy allocator to make a chain of free consecutive pages. as the page is no longer in buddy alloc, we can use this lru field to do something useful. slab pages are pseudo-free as they are free but not tracked by the buddy alloc.

slab page frames can be tracked by s_mem of slab desc and cachedesc.gfporder

Saturday, May 23, 2009

Page Cache

during hibernation, RAM is written to swap space.

page cache is a set of radix trees that helps to quickly find out a page from the address space of object of the owner.

kinds of pages in page cache

regular file, directories
block device data
swapped out processes
file of special fs (eg, ipc shm)

page owner is the inode object. (the idea of owner of a page comes only when the page cache is in action). r/w depends on the type of owner. 3 types of owner:

regular file
block device file
swap area and swap cache

inode contains address space object which is the page cache (pages and methods). block device also has an inode for it. the idea of the owner based on uniqueness of resource.

the radix tree of the pages in a page cache is searched in a way similar to page table lookup. the number of bits taken for indexing depends on the height of the tree.

find_get_page() increases the page usage counter (page->count). find_lock_page() additionally sets the PG_locked flag.

the radix tree is a kind of database for the pages (by utilizing the tag field)

PG_writeback means the page is currently being written back to disk (tag[1]).

A buffer page is a page with buffer heads. every page in a page cache should be a buffer page ... isn't it?

all anon/file reads and writes are executed on RAM only. if a page gets written, the PG_dirty flag is set during page fault. if a dirty page is in an addr space, then it must be written back to disk.

the private field of a buffer page points to the buffer head of the first block.

Sunday, May 17, 2009

Memory Mapping

2 kinds of pages: anonymous pages (pure RAM) and file mapped pages (backed by a file on disk/blk device). 2 levels of pages: physical and logical structure.

anonymous pages have mapping=NULL, mapped pages have mapping=addr_space. mapped pages also form a radix tree under the addr spc object.

the layer above page layer is the memory region layer. that layer keeps some flag to specify the kind of pages underneath.

initially, the memory region is not linked with the page frames. during page faults, the mem reg gets pages (anon, map). during page reclaim the link is broken temporarily.

the processes use file object. the memory region links the file object to the (logical) pages. if the pages are anon, then mem reg does not use the 'vm_pgoff' and 'vm_file'. but if the pages are mapped, then 'vm_file'=file object, 'vm_pgoff'=location in the file.

the address_space object links the physical pages and the virtual memory regions associated with a file.

page_tree member is the radix tree of physical pages (the page cache)
i_mmap is the radix priority search tree (PST) of the memory regions objects. PST is used for reverse mapping.

mapping (address space) has the host field pointing to the inode. file object->f_mapping = address space. one object on disk is one address space object. different modes of files are tracked by the file objects.

if vma_area_struct.vm_file is NULL then this region does not map any file.

if a memory region does not map a file, its nopage method is NULL (anonymous mapping).

Tuesday, May 5, 2009

Handling Swap-Page Faults

read_swap_cache_async locks the page and block layer unlocks when it finishes the reading.

read ahead pages are kept in swap cache.

swap area is on disk. swap cache is in RAM.

the first part of read_swap_cache_async reserves RAM pages for the swap-in and puts them neatly in the swap cache space. then in the 2nd part, swap_readpage reads the page from swap area (disk/ram) to the swap cache.

swap_readpage initiates the data transfer by { get_swap_bio: req gen, submit_bio: req send }

swap cache is just a radix tree.

swap entry in the swap area index is kept in page private field.

swapin_readahead just an additional layer around read_swap_cache_async ... tune it by using /proc/sys/vm/page-cluster

Wednesday, April 29, 2009

Page Reclaim

PFRA works on
1. user mode process pages
2. kernel caches

there are two aspects of swapping: the mechanism (writeback) and the policy (page reclaim).

each zone has 2 lru lists: active and inactive. if a page is lru, PG_lru flag is set, if active then PG_active flag is also set, inactive flag is cleared. pages are linked with the lru list member. the whole lru list locked with lru_lock spinlock of that zone.

lru list contains pages from only: 1. user mode addr spc pages, 2. page cache. so, no free pages of that zone will be in lru list.

these pages are referenced by: 1. usr proc, 2. fs layer code, 3. device driver

mark_page_accessed() makes a slow move of pages between active and inactive list (using act, ref flags combination)

there is no lock on the whole memory object page tables. the lock has been taken to pmd only. so, a pmd page has a lock associated with it. (spinlock)

page private field encodes the swap entry number. (NOTE: for a buffer page, the private field stores the buffer head of the first block of data.)

Notes on swapping

modified pages (backed by the block device) are not swapped out. but it can be synchronized with the block device.

kernel can discard pages (backed by a file) which are not modifiable.

page reclaim = { selecting, swapping, synchronizing, discarding }

private mapping = map a file in memory but changes to that memory is not reflected to the backing block device.

pages candidates for the swap area
1. MAP_PRIVATE, 2. MAP_ANONYMOUS, 3. IPC pages

inside swap areas, there are slots and clusters. swap areas may have different priorities. clustering makes it fast to do readaheads to/from swap space.

bitmap is used to keep track of used and free slots in swap area.

to reduce swapping pressure, there are kswapd (normal mem pressure) and direct reclaim (extreme memory pressure).

when swapping is not enough, there is OOM killer. so, we can see there are levels of mechanisms to keep the system running.

there are 2 lists of each zone: active and inactive. pages are transferred between these 2 lists based on their h/w accessed bit.

swap_info_struct keeps track of each swap areas. similar to the way kernel keeps track of a block device, it keeps track of the swap area.

the first slot of a swap area is used for identification purpose and keeping state information about the swap partition. (i have to see how that is done)

extent list works like a page table in VM. they keep track of linear page slots of a swap area to the scattered blocks on disk. for file-based swap area, the number of extent list structures are more than partition-based swap area where blocks are sequential.