Linux Memory Management Notes: 2011

Wednesday, December 14, 2011

Swapper Subsystem Files Structure

swap.c: it moves pages around in the lru lists
swapfile.c: it handles actual swapping of the pages to backing store
it also provides function calls to control swapping from usermode
swap_state.c: it handles swap cache
vmscan.c: glues the swapper subsystem to the memory manager.
provides more high-level functions that the memory manager.
kswapd is here.

Tuesday, December 13, 2011

Discussion on swap.c/put_page related functions

There are two phases of these functions:
1. delete the page from lru cache (__page_cache_release)
2. freeing the page to memory allocator

Consider the allocation process: 1. page is allocated, 2. page table entries are fixed 3. page is added to lru cache.

In the put page functions, page table entries are not handled. So the control path should fix/remove appropriate page table entries depending before calling thes functions.

Thursday, December 8, 2011

Get Tasks Page Tables

step 1: get the current task
step 2: get the runqueue
step 3: get the cfs rbtree root
step 4: traverse the tree (recursive)
for each node, get the scheduling entity
from the scheduling entity, get the task pointer
from the task structure, get the pgd
print pgd information :D

Notes on rbtree

definition
----------

#define RB_RED 0
#define RB_BLACK 1
struct rb_node
{
unsigned long rb_parent_color;
int rb_color;
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

usage:
-------

Notes on Swapping (how to totally smash it)

modified pages (backed by the block device) are not swapped out. but it can be synchronized with the block device.

kernel can discard pages (backed by a file) which are not modifiable.

page reclaim = { selecting, swapping, synchronizing, discarding }

private mapping = map a file in memory but changes to that memory is not reflected to the backing block device.

pages candidates for the swap area
1. MAP_PRIVATE, 2. MAP_ANONYMOUS, 3. IPC pages

inside swap areas, there are slots and clusters. swap areas may have different priorities. clustering makes it fast to do readaheads to/from swap space.

bitmap is used to keep track of used and free slots in swap area.

to reduce swapping pressure, there are kswapd (normal mem pressure) and direct reclaim (extreme memory pressure).

when swapping is not enough, there is OOM killer. so, we can see there are levels of mechanisms to keep the system running.

there are 2 lists of each zone: active and inactive. pages are transferred between these 2 lists based on their h/w accessed bit.

swap_info_struct keeps track of each swap areas. similar to the way kernel keeps track of a block device, it keeps track of the swap area.

the first slot of a swap area is used for identification purpose and keeping state information about the swap partition. (i have to see how that is done)

extent list works like a page table in VM. they keep track of linear page slots of a swap area to the scattered blocks on disk. for file-based swap area, the number of extent list structures are more than partition-based swap area where blocks are sequential.

Reverse Mapping

vma_area_struct: 3 members are needed: shared, anon_vma_node, anon_vma

2 alternatives: anonymous pages and file-mapped pages

General MM Notes

Linux kernel space has 4 kinds of memory mapping
1. direct mapping
2. vmalloc mapping
3. kmap mapping
4. fixmap mapping

direct mapping is used for low memory pages which are used to store permanent data structures, page tables, interrupt tables and so on.

vmalloc mapping is used for non-contiguous memory mapping. i can use it any way possible. if you have a space reserved in vmalloc space and a bunch of pages in your hand, you can map those pages to that vmalloc space by modifying the kernel page tables. (example usage: by insmod to store the modules)

kmap mapping is kind of the vmalloc space but only one pgd entry in master kernel page table is dedicated for kmap virtual address.

fixmap mapping is similar to kmap mapping. (example usage: map a page table residing on a high mem page)

all these 4 kinds of memory virtual and physical pages are given as soon as the mapping request is placed and served. to page fault here means something terribly wrong going on inside.

user process pages are similar to kernel vmalloc space except they are in virtual address in user space and the physical pages are not allocated and mapped as long as there is page fault generated on those virtual addresss spaces.

Buddy System (MM)

During initialization in memmap_init_zone(), all pages are marked movable.

Each block of pages inside buddy allocator has a head page. head page contains PG_buddy flag set and page->private=order. other pages does not have these bits set.

Compound Page (Huge TLB page):
A group of consecutive pages: head page and tail pages. for each page, PG_compound flag set, page->private=head page. The first tail page contains destructor 'free_compound_page' and number of pages 'n' in the lru.next and lru.prev fields.

A free page can be in 2 places: 1. buddy list 2. pcp list
When a free page is in buddy list and is the head page of a block, then page->private=order
When a free page is in pcp list, then page->private=migratetype

The node and zone for a page is encoded in the upper portion of the page flag.

Hot-N-Cold PCP

[ref. PLKA p146, p183, ]

per_cpu_pageset structure. each zone has pageset[NR_CPUS].
NR_CPUS = 1 (uni-processor), 2-32 (smp32), 2-64 (smp64)

per_cpu_pageset -> per_cpu_pages[2] ... 0:hot, 1:cold

per_cpu_pages { count, high, batch (chunk size for BA), list }
... low watermark: 0, high watermark: high (PLKAfig3.6)

list of pages are kept track of by page->lru field.

I will take the page from the back of the list, test it and put it in the front of the list.

(Performance) I can have a free page reserve to fill up pcp quickly.

Free Pages Grab and Release

If you directly take out page from the free list you need to be careful about the following

** pcp is protected by local irq (local_irq_disable, local_irq_enable).
** zone is protected by zone specific spin lock with irq (spin_lock_irqsave(z->lock, flags), spin_lock_irqrestore(z->lock, flags).

pcp
----
1. anytime if you grab a page from pcp, you must call prep_new_page() on it.
2. when you grab page block manually from bdyalloc, you need to either
... 2.a clear the page->count before releasing that page to freelist (set_page_count). or
... 2.b prep_new_page on that block just after you extract that page.

bdyalloc
---------
1. when you take out a page block from bdyalloc, you need to clear the PG_Buddy flag and private=0 on the first page of the block. (rmv_page_order)
2. when you put a block of pages back into bdyalloc, you need to set the order (set_page_order).

remove pages: (you must call prep_new_page)
__rmqueue_smallest: works on the freelists of a zone (not the pcp)

__rmqueue_fallback: similar but takes care of migratetype more flexibly

rmqueue_bulk: zone, order, count ...

little user-friendly: no need to call prep_new_page
buffered_rmqueue: works both on pcp and freelists (based on the order)

insert pages: (page count must be 0)
free_hot_and_cold_page: frees pages into the pcp list

Page Migration

the old page is locked, and pte entries of the old page is replaced with migration entries for which the page fault with wait till the migration is finished

new page is locked and set not-uptodate before the migration so that processes will wait on the till the migration ends.

unmap old page with migration entries
replace the radix tree

if a page is locked then the processes trying to map that page in their pts will wait on the lock. also, the page won't be swapped out.

there are two granularity of locking: 1. vma lock which is in virtual level lock, does not correspond to any physical pages. and 2. page lock which is the physical page lock.

page ref count means some process or kernel component is currently using or holding that physical page somehow. (page_count)

page map count means that page is mapped to some process page table (page_mapcount, +1)

map count <= ref count

if the phy page is in the swap cache, then private field will contain the swap cache entry for that page.
this entry can be used as the swap pte for installation in the page table.

vma private is used sometimes for containing non-linear vma information.

page refcount set/increase/decrease

set
-----
page_alloc.c/free_pages_bootmem() ... sets to 1 to the first page of a page block
page_alloc.c/prep_new_page() ... sets to 1 to the first page of a page block
page_alloc.c/split_page() ... sets to 1 to the first page of each of the page blocks

increase
--------------

decrease
---------------

Monday, November 21, 2011

Dissection of Page Fault Handler

__do_fault (memory.c, 3.1.1)

as this function creates new mapping for a page which does not exist yet, TLB entries are not altered.

This function has two parts:
1. allocate a page and prepare it appropriately.
2. fix page tables to point to this page.

In the first if block we see there are three tasks:
1. prepare the anonymous vma
2. allocate the page
3. register the page with memory control group
if unable, release the page
in all three cases, return VM_FAULT_OOM on failure.

the COW page is allocated before any other processing because it will reduce the lock_page() holding time on page cache.

how to detect, if it is a COW request: FAULT_FLAG_WRITE and the vma is NOT shared. COW is specially used when there is a fork from parent and a new process is created. but the memory pages are not allocated right away. instead, the virtual memory regions are marked not sharable and the page table entry is marked read only. so next time there is a page fault, it can detect COW.

after fixing COW page, the function prepares vmf structure. (vm_fault)

next, the fs fault handler is invoked. when the control reaches __do_fault, it is already decided that there is fs code involved.

...

the backing address space might want to know that the page is about to become writable or not. the filesystem code implements this functionality. in that case, the vma->vm_ops->page_mkwrite function will be present.

Wednesday, March 23, 2011

Small Test Project Idea on Reverse Mapping?

Difficulty Levels:
[6 - very easy, 5 - easy, 4 - medium, 3 - hard, 2 - very hard, 1 - extremely hard]

P1. find out mapcounts of each of the memory pages. -- L6
P2. find out PTEs for a shared anonymous pages (involves handing simple anon region) -- L4
P3. find out PTEs for a shared file page (involves handling address space) -- L3
P4. find out name of the processes that are sharing a page. -- L5 after P2/3 are done
P5. force remap of shared page -- L3
P6. measure TLB effect using 'perf' -- L4

P7. find out PDT, PMD, PTEs for a given memory region descriptor -- L4
P8. find out number of allocated page frames for a process (using its mm descriptor rss fields) -- L5

P9. Restrict memory for a process.

Monday, March 21, 2011

Page Reclamation III (Reverse Mapping)

LWN article (anon_mm) (anon_vma) (comparison)

from PLKA:
page structure contains a single element to implement reverse mapping { atomic_t _mapcount; }.
two other data structures are needed: priority search tree for file address space, and a linked list for anon address space.
region descriptor (vm_area_struct) has all info to generate reverse mapping: union shared, anon_vma_node, anon_vma
this is called object-based reverse mapping because a page is not directly associated with the process. rather a memory regions are associated with the page (, therefore, the processes too)

anon mapping:
page->mapping: points to anon vma inside memory region object desc (vma)
page->index: relative position of the page in the memory region
last bit of page->mapping will be 1 for anon mapping (PAGE_MAPPING_ANON) or 0 for file mapping.
adding into any mapping increments page->_mapcount count.

mapcount and activity are not synonymous. because mapcount is static, whereas activity means page is being actively used right now. activity means the _PAGE_ACCESSED bit is set in the page table entry for that page in a memory regions of a process. so, during page_referenced() function, we need to get each memory region for that page, get the page table entry, check _PAGE_ACCESSED bit, clear it if it is set. interestingly, page_referenced means the number of _PAGE_ACCESSED bits set for that particular page for all the processes (memory regions) that are using that page.

from ULK3:
page structures stores backward link to memory region desc. a mem reg des contains PGD which can be used to find the PTE for that page. thus, we can get the list of PTEs from a given page structure easily. to find the number of places from where this page is mapped, we can use the page->_mapcount field. to see if the mapping is file or anon, we have to look into the last bit of page->mapping. page->index contains the relative position of that page from the beginning of the mem reg.
[note: a page in the buddy should have a mapcount of -1., non-shared page mapcount 0, shared page mapcount 1+]

now, page->mapping links data structures that connects the memory regions for this page.
page->mapping = NULL, this page is in swap cache.
page->mapping = anon_vma if last bit is 1 (anon mapping)
page->mapping = address_space if last bit is 0 (file mapping)

anon memory desc:
when kernel assigns the first page to an anonymous memory region, it allocates anon_vma data structure which has a lock and a listhead.
memory regions are kept in that list. mem_reg->anon_vma = anon_vma, mem_reg->anon_vma_node maintains the list.
notice there is a lock involved here, so think about it when considering scalability with too many shared (anonymous) pages.

[note: vma->vm_pgoff = offset of memory region in the mapped file, page->index = offset of the page in the memory region]

to find the PTE, we must need the actual linear address of the page (in that memory region). it is very important. if somehow we can't figure out the linear address of the page for a memory region, we need to search the all the PTEs in that particular memory region for that page, this happens for non-linear memory mapping. For a particular memory region, we can get the PTEs because we have the beginning and the ending addresses. So, it is easy to do a query into the page table structures to view its current state.

A page might have different linear addresses depending on the memory region it was mapped to. To find PTE, we need PGD and the linear address. whenever thinking about page mapped in memory, think about both linear address and physical address.

[pitfalls]
mremap() may crash the party by directly modifying page table entries.
if PTE says the page is accessed, then unmapping won't take place as that page is considered in-use.
locked/reserved memory regions can also nullify remapping effort.

file address space desc:

Friday, March 18, 2011

Page Reclamation II (Policy)

There are four levels of page activities: AR (00,01,10,11) [A=PG_Active, R=PG_Referenced page flag]

page reference flag is removed each time an activity check is performed. there are two types of functionalities: activity check and page moving between lists.

mark_page_active() pushes the pages towards AR=11 and page_referenced() pushes the page towards AR=00. ironically, page_referenced should have been called page_ref_killed()... :))

page_referenced() additionally tells us how many references to this page has been made so far after it has been mapped in.

swap token overrides resetting PG_reference bit and keeps it set so that the process does not suffer from swapping under heavy swapping pressure.

shrink_active_list() puts some active pages from active lru list to inactive list. shrink_inactive_list() swaps out some pages from inactive lru list.

active and inactive lists are protected by spinlock zone->lru_lock

PG_lru is set only when the page is on LRU list. they use local list to get pages out of LRU list into local list to avoid holding the lru_lock for a long time.

Thursday, March 17, 2011

Source Files Used in Page Reclamation

generic:
pagevec.h -- page vectors

functional:
mm/swap.c -- LRU, activate_page(), mark_page_accessed()
mm/rmap.c -- page unmapping, page_referenced()
mm/vmscan.c -- isolate_lru_pages()

Page Usage Counter

increments:
page_cache_get() in lru_cache_add() -- because this page is now in the LRU cache

decrements:
try_to_unmap_one() -- because one more process stopped using this page after a successful unmap

Locks and Sequential Processing in Swapping

lru_cache_add() -- disable preemption

Page Map Counter

_mapcount refers to reverse mapping info. original value is -1. it is 0 when the page is inserted in the reverse mapping data structure and incremented 1 for each additional user. so, when the page is within buddy allocator, its mapcount=-1

Increments:

Decrements:

Page Flags

Page Reclaim and Swapping:
PG_lru is set if the page is moved into lru list, if it were active page then additionally PG_active bit is also set.

Page activity determination:
PG_referenced and PG_active bits

Buddy:

Slab:

Files:

Wednesday, March 16, 2011

discussion on rmap.c

* Unlike common anonymous pages, anonymous hugepages have no accounting code and no lru code, because we handle hugepages differently from common pages.

* anon vma does not point to pages but pages point to anon vma.

these are all in-use pages:
rmap code extensively use address space (file/anon) to extract mapping info of pages and vmas. together, pages are mapped in vma, in page tables. primarily.
secondarily, they are mapped in file/anon address spaces too.
third, swapping uses LRU of pages.
mapping count is also extensively used here in these files.

the swapping subsystem should depend on rmaps extensively, because swapping needs to remap pages on the fly.

two types of walking involved for rmaps: file and anon. address space needs to be locked at all costs. this locking will have scalability problems as we know locking means sequential code block irrespective of the number of cores we have there.

Monday, March 7, 2011

Writing Style

After some time, the swap token is passed to some other process that also undergoes swapping and requires memory more exigent than the current token holder.

Thanks to this extension, the algorithm does take minimum account of whether pages are used frequently or not, but does not deliver the performance expected of state-of-the-art memory management.

How can the kernel mark or sort pages as simply as possible in order to estimate access frequency without requiring an inordinate amount of time to organize data structures?

Hardly any part of the kernel entails as many technical difficulties as the virtual memory subsystem.

As I have demonstrated, not only are the technical details of paramount importance in achieving an efficient and powerful implementation of the swap system, but the design of the overall system must also support the best possible interaction between the various components to ensure that actions are performed smoothly and harmoniously.

This is combined with a ‘‘least recently used’’ method to realize page reclaim policy.

Page Reclamation I (Basics)

The swapping subsusytem has few components
1. Activation of page selection (periodic/explicit)
2. Page reclaim (policy)
3. Reverse Mapping
4. Swap cache
5. Writeback mechanism

a little distantly related: fault on a swapped page (notes)

code pages are file pages and kept track of by file address space object. difference between a code page and a data page is not by the backing file. but it is whether that page is editable in memory or not. if not editable, then it is a code page (most probably), if it is editable then it is a data page. now editable data page is synchronized with backing block device during swapping. but non-editable code pages are just discarded during swapping.

process heap, stack are anonymous pages because they do not have any backing block device store. memory area can be mapped anonymously using mmap too. these anon pages are swapped out to swap area.

private mapping is an exception because in this case data is on the backing file but private mapping decouples the data from the file. so private mapped pages are swapped out to swap area.

IPC shared pages are also swapped to swap area.

PFRA is all about identifying which data is more important than others. less important data is swapped out to swap area if required. From the viewpoint of a process, OS need to find out the working set of a process. there might be read working set (PCM-based) and write working set (DRAM-based).

PFRA needs to be 1) fair and 2) scalable

there is an idea of swap token, which makes a process immune against swapping because OS tries not to swap out that process' pages, thereby making that process run faster.

there can be multiple swap areas with different priorities depending on their relative speed. to keep track of slots in swap area, kernel uses bitmap. all swap files/areas are called swap files in kernel terminology.

each zone has an active and an inactive list of pages.

shrink_zone() is just one function which does a periodical move of pages from active to inactive lists. A page is first regarded as inactive, but it must earn some importance to be considered active.

* Thoughts and Ideas *

* finding out read working set and write working set of a process... read working set can reside in PCM and write working set can reside in DRAM.
*

***
* page uses two flags to keep track of LRU pages. Can't this technique be used for my work too to clear tested flags automatically? Testing closely resembles existing swapping technique. the scalability issue can also be resolved based on the idea acquired in this work.

scalability improvement: each CPU has a list of pvecs where LRU pages are added to central LRU list.

task: look into the paper Linux scalability analysis to get some more idea.

Linux Memory Management Notes