Tuesday, July 28, 2009

Page Reclaim IV (Swap Cache)

pages in a swap cache has the following things...

page->mapping = NULL ... why is that? afaik, it should point to the swap cache addr space object
PG_swapcache flag is on
page->private = swapped-out page id

a swap cache page linked to a page slot in swap area and some processe(s).

a swapper_space address space object is used for the swap cache for the whole system. question: how is a page found inside a swap cache? ans: using the swapped out page identifier

It is possible to know how many processes were sharing a swapped out page. Until everyone gets it in, the page is temporarily stored in the swap cache.

swap area has clustering of page slots (for efficiency during r/w: swap in-out)

Page Structure Finds

PG_swapcache: the page is in the swap cache

PG_locked: undergoing IO, in swap space update, in file r/w etc.

Friday, July 24, 2009

Page Reclaim I (Basics) - Swapping Subsystem (swapfile.c, swap_state.c)

3 kinds of pages handled:
1. anon mem pages (usr stack/heap)
2. dirty pages of private mem map (does not change the files on disk)
3. IPC shared pages

note that the dirty pages are not swapped out, they are just updated in disk and removed from RAM. because this has the same effect as swapping out.

swap area can be on a physical disk partition or a separate file inside a large partition. but these data is stored only as long as the system is on and discarded otherwise. so swap area has a very little control info (unlike the filesystem info)

extents and priority -- interesting

swap_info array contains all the swap areas (disk partition,bdev or files). they are kept in swap area descriptors.

swapped out page id: page slot index[8:31], area number[1:7], 0[0]

the difference between a blank pte and swapped pte is that the id for blank pte = 0...0

Friday, June 26, 2009

Page Tables

the page->ptl is a spinlock which is used when this page is being used as a pmd full of ptes. this pmd page can be in high memory or normal zone.

there are two reserved fixmap vaddrs (KM_PTE0,1) to map the pmd page if you want the pmd page be in high memory addr.

pte_offset_map_lock gets you the pmd page table (locked) and the lock itself.
if we want to change any pte inside this table, we must have the lock on it. after finishing work, unlock the key. (use pte_unmap_unlock)

Sunday, June 21, 2009

Page Reclamation V (Periodic Reclaim) - kswapd and vmscan.c

this is a kernel thread, runs one instance globally. because we need atomic memory allocation. so we should maintain a fairly descent amount of free pages in the system. try_to_free_pages does its work when mem is scarce which is a time consuming (not atomic) job to do.

each memory node has a kswapd which sleeps on kswapd_wait queue of that node.

reclaiming is done on behalf of a process.

Tuesday, June 9, 2009

using pcp list of another cpu from one cpu

there will be a main module bind with some specific cpu
the main module will have a workqueue on that cpu

inside the main module, create 2 kernel threads (kernel_threads)
then bind them to different cpus (kthread_bind)
each of them will register timers which will be running on specific cpu
and take out pages from that specific cpu pcp list
and put them on the common workqueue

they will have completion mechanism to flag their completion and do_exit()

during cleanup_module(), the main module will flag completion for both of the kernel threads.

but those pages will be tested on the testing function on a specific cpu

Saturday, June 6, 2009

Manual Extraction of Free pages from pcp or buddy list

Page count must be 0 when returning a page to the list of free pages (page->_count)

page->lru keeps the link inside pcp

count=1 just after extracting the page from any free list

pcp is locally locked by get_cpu()/put_cpu() and irq_save()/irq_restore().

buddy is globally locked by the corresponding zone's spinlocks. rmqueue* functions work on buddy allocator by using that lock. this spinlock does not make conflict with the pcp local locks.

Tuesday, May 26, 2009

Memory Area Management

memory area is a sequence of contiguous physical addresses. this idea came up from the event that the kerne uses the same kind of objects. so grouping them will save much mem space.

qus: what happens when cache is increased or decreased? it is not supposed to get contig pages anymore. it will get another group of contig pages.

there are 13 list of memory areas of sizes from 32 to 131072 bytes. each list has 2 caches: dma and normal.

user processes do not have access to or use the slab allocator. it is solely used by the kernel.

2 types: general (used by the slab alloc) and specific (used by the kernel)

kmem_cache keeps track of the other caches.

pages belong to a slab will have the PG_Slab flag set on them. it is not possible to get pages from ZONE_HIGHMEM because page_address() will return NULL in that case :D.

every physical or abstract entity has a descriptor in the kernel. a page frame within a slab has its PG_Slab flags set, lru.next=slab desc, lru.prev=cache desc. lru is used in buddy allocator to make a chain of free consecutive pages. as the page is no longer in buddy alloc, we can use this lru field to do something useful. slab pages are pseudo-free as they are free but not tracked by the buddy alloc.

slab page frames can be tracked by s_mem of slab desc and cachedesc.gfporder

Saturday, May 23, 2009

Page Cache

during hibernation, RAM is written to swap space.

page cache is a set of radix trees that helps to quickly find out a page from the address space of object of the owner.

kinds of pages in page cache
  • regular file, directories
  • block device data
  • swapped out processes
  • file of special fs (eg, ipc shm)
page owner is the inode object. (the idea of owner of a page comes only when the page cache is in action). r/w depends on the type of owner. 3 types of owner:
  • regular file
  • block device file
  • swap area and swap cache
inode contains address space object which is the page cache (pages and methods). block device also has an inode for it. the idea of the owner based on uniqueness of resource.

the radix tree of the pages in a page cache is searched in a way similar to page table lookup. the number of bits taken for indexing depends on the height of the tree.

find_get_page() increases the page usage counter (page->count). find_lock_page() additionally sets the PG_locked flag.

the radix tree is a kind of database for the pages (by utilizing the tag field)

PG_writeback means the page is currently being written back to disk (tag[1]).

A buffer page is a page with buffer heads. every page in a page cache should be a buffer page ... isn't it?

all anon/file reads and writes are executed on RAM only. if a page gets written, the PG_dirty flag is set during page fault. if a dirty page is in an addr space, then it must be written back to disk.

the private field of a buffer page points to the buffer head of the first block.

Sunday, May 17, 2009

Memory Mapping

2 kinds of pages: anonymous pages (pure RAM) and file mapped pages (backed by a file on disk/blk device). 2 levels of pages: physical and logical structure.

anonymous pages have mapping=NULL, mapped pages have mapping=addr_space. mapped pages also form a radix tree under the addr spc object.

the layer above page layer is the memory region layer. that layer keeps some flag to specify the kind of pages underneath.

initially, the memory region is not linked with the page frames. during page faults, the mem reg gets pages (anon, map). during page reclaim the link is broken temporarily.

the processes use file object. the memory region links the file object to the (logical) pages. if the pages are anon, then mem reg does not use the 'vm_pgoff' and 'vm_file'. but if the pages are mapped, then 'vm_file'=file object, 'vm_pgoff'=location in the file.

the address_space object links the physical pages and the virtual memory regions associated with a file.
  • page_tree member is the radix tree of physical pages (the page cache)
  • i_mmap is the radix priority search tree (PST) of the memory regions objects. PST is used for reverse mapping.
mapping (address space) has the host field pointing to the inode. file object->f_mapping = address space. one object on disk is one address space object. different modes of files are tracked by the file objects.

if vma_area_struct.vm_file is NULL then this region does not map any file.

if a memory region does not map a file, its nopage method is NULL (anonymous mapping).

Tuesday, May 5, 2009

Handling Swap-Page Faults

read_swap_cache_async locks the page and block layer unlocks when it finishes the reading.

read ahead pages are kept in swap cache.

swap area is on disk. swap cache is in RAM.

the first part of read_swap_cache_async reserves RAM pages for the swap-in and puts them neatly in the swap cache space. then in the 2nd part, swap_readpage reads the page from swap area (disk/ram) to the swap cache.

swap_readpage initiates the data transfer by { get_swap_bio: req gen, submit_bio: req send }

swap cache is just a radix tree.

swap entry in the swap area index is kept in page private field.

swapin_readahead just an additional layer around read_swap_cache_async ... tune it by using /proc/sys/vm/page-cluster

Wednesday, April 29, 2009

Page Reclaim

PFRA works on
1. user mode process pages
2. kernel caches

there are two aspects of swapping: the mechanism (writeback) and the policy (page reclaim).

each zone has 2 lru lists: active and inactive. if a page is lru, PG_lru flag is set, if active then PG_active flag is also set, inactive flag is cleared. pages are linked with the lru list member. the whole lru list locked with lru_lock spinlock of that zone.

lru list contains pages from only: 1. user mode addr spc pages, 2. page cache. so, no free pages of that zone will be in lru list.

these pages are referenced by: 1. usr proc, 2. fs layer code, 3. device driver

mark_page_accessed() makes a slow move of pages between active and inactive list (using act, ref flags combination)


there is no lock on the whole memory object page tables. the lock has been taken to pmd only. so, a pmd page has a lock associated with it. (spinlock)

page private field encodes the swap entry number. (NOTE: for a buffer page, the private field stores the buffer head of the first block of data.)

Notes on swapping

modified pages (backed by the block device) are not swapped out. but it can be synchronized with the block device.

kernel can discard pages (backed by a file) which are not modifiable.

page reclaim = { selecting, swapping, synchronizing, discarding }

private mapping = map a file in memory but changes to that memory is not reflected to the backing block device.

pages candidates for the swap area
1. MAP_PRIVATE, 2. MAP_ANONYMOUS, 3. IPC pages

inside swap areas, there are slots and clusters. swap areas may have different priorities. clustering makes it fast to do readaheads to/from swap space.

bitmap is used to keep track of used and free slots in swap area.

to reduce swapping pressure, there are kswapd (normal mem pressure) and direct reclaim (extreme memory pressure).

when swapping is not enough, there is OOM killer. so, we can see there are levels of mechanisms to keep the system running.

there are 2 lists of each zone: active and inactive. pages are transferred between these 2 lists based on their h/w accessed bit.

swap_info_struct keeps track of each swap areas. similar to the way kernel keeps track of a block device, it keeps track of the swap area.

the first slot of a swap area is used for identification purpose and keeping state information about the swap partition. (i have to see how that is done)

extent list works like a page table in VM. they keep track of linear page slots of a swap area to the scattered blocks on disk. for file-based swap area, the number of extent list structures are more than partition-based swap area where blocks are sequential.