Linux Memory Management Notes: March 2011

Wednesday, March 23, 2011

Small Test Project Idea on Reverse Mapping?

Difficulty Levels:
[6 - very easy, 5 - easy, 4 - medium, 3 - hard, 2 - very hard, 1 - extremely hard]

P1. find out mapcounts of each of the memory pages. -- L6
P2. find out PTEs for a shared anonymous pages (involves handing simple anon region) -- L4
P3. find out PTEs for a shared file page (involves handling address space) -- L3
P4. find out name of the processes that are sharing a page. -- L5 after P2/3 are done
P5. force remap of shared page -- L3
P6. measure TLB effect using 'perf' -- L4

P7. find out PDT, PMD, PTEs for a given memory region descriptor -- L4
P8. find out number of allocated page frames for a process (using its mm descriptor rss fields) -- L5

P9. Restrict memory for a process.

Monday, March 21, 2011

Page Reclamation III (Reverse Mapping)

LWN article (anon_mm) (anon_vma) (comparison)

from PLKA:
page structure contains a single element to implement reverse mapping { atomic_t _mapcount; }.
two other data structures are needed: priority search tree for file address space, and a linked list for anon address space.
region descriptor (vm_area_struct) has all info to generate reverse mapping: union shared, anon_vma_node, anon_vma
this is called object-based reverse mapping because a page is not directly associated with the process. rather a memory regions are associated with the page (, therefore, the processes too)

anon mapping:
page->mapping: points to anon vma inside memory region object desc (vma)
page->index: relative position of the page in the memory region
last bit of page->mapping will be 1 for anon mapping (PAGE_MAPPING_ANON) or 0 for file mapping.
adding into any mapping increments page->_mapcount count.

mapcount and activity are not synonymous. because mapcount is static, whereas activity means page is being actively used right now. activity means the _PAGE_ACCESSED bit is set in the page table entry for that page in a memory regions of a process. so, during page_referenced() function, we need to get each memory region for that page, get the page table entry, check _PAGE_ACCESSED bit, clear it if it is set. interestingly, page_referenced means the number of _PAGE_ACCESSED bits set for that particular page for all the processes (memory regions) that are using that page.

from ULK3:
page structures stores backward link to memory region desc. a mem reg des contains PGD which can be used to find the PTE for that page. thus, we can get the list of PTEs from a given page structure easily. to find the number of places from where this page is mapped, we can use the page->_mapcount field. to see if the mapping is file or anon, we have to look into the last bit of page->mapping. page->index contains the relative position of that page from the beginning of the mem reg.
[note: a page in the buddy should have a mapcount of -1., non-shared page mapcount 0, shared page mapcount 1+]

now, page->mapping links data structures that connects the memory regions for this page.
page->mapping = NULL, this page is in swap cache.
page->mapping = anon_vma if last bit is 1 (anon mapping)
page->mapping = address_space if last bit is 0 (file mapping)

anon memory desc:
when kernel assigns the first page to an anonymous memory region, it allocates anon_vma data structure which has a lock and a listhead.
memory regions are kept in that list. mem_reg->anon_vma = anon_vma, mem_reg->anon_vma_node maintains the list.
notice there is a lock involved here, so think about it when considering scalability with too many shared (anonymous) pages.

[note: vma->vm_pgoff = offset of memory region in the mapped file, page->index = offset of the page in the memory region]

to find the PTE, we must need the actual linear address of the page (in that memory region). it is very important. if somehow we can't figure out the linear address of the page for a memory region, we need to search the all the PTEs in that particular memory region for that page, this happens for non-linear memory mapping. For a particular memory region, we can get the PTEs because we have the beginning and the ending addresses. So, it is easy to do a query into the page table structures to view its current state.

A page might have different linear addresses depending on the memory region it was mapped to. To find PTE, we need PGD and the linear address. whenever thinking about page mapped in memory, think about both linear address and physical address.

[pitfalls]
mremap() may crash the party by directly modifying page table entries.
if PTE says the page is accessed, then unmapping won't take place as that page is considered in-use.
locked/reserved memory regions can also nullify remapping effort.

file address space desc:

Friday, March 18, 2011

Page Reclamation II (Policy)

There are four levels of page activities: AR (00,01,10,11) [A=PG_Active, R=PG_Referenced page flag]

page reference flag is removed each time an activity check is performed. there are two types of functionalities: activity check and page moving between lists.

mark_page_active() pushes the pages towards AR=11 and page_referenced() pushes the page towards AR=00. ironically, page_referenced should have been called page_ref_killed()... :))

page_referenced() additionally tells us how many references to this page has been made so far after it has been mapped in.

swap token overrides resetting PG_reference bit and keeps it set so that the process does not suffer from swapping under heavy swapping pressure.

shrink_active_list() puts some active pages from active lru list to inactive list. shrink_inactive_list() swaps out some pages from inactive lru list.

active and inactive lists are protected by spinlock zone->lru_lock

PG_lru is set only when the page is on LRU list. they use local list to get pages out of LRU list into local list to avoid holding the lru_lock for a long time.

Thursday, March 17, 2011

Source Files Used in Page Reclamation

generic:
pagevec.h -- page vectors

functional:
mm/swap.c -- LRU, activate_page(), mark_page_accessed()
mm/rmap.c -- page unmapping, page_referenced()
mm/vmscan.c -- isolate_lru_pages()

Page Usage Counter

increments:
page_cache_get() in lru_cache_add() -- because this page is now in the LRU cache

decrements:
try_to_unmap_one() -- because one more process stopped using this page after a successful unmap

Locks and Sequential Processing in Swapping

lru_cache_add() -- disable preemption

Page Map Counter

_mapcount refers to reverse mapping info. original value is -1. it is 0 when the page is inserted in the reverse mapping data structure and incremented 1 for each additional user. so, when the page is within buddy allocator, its mapcount=-1

Increments:

Decrements:

Page Flags

Page Reclaim and Swapping:
PG_lru is set if the page is moved into lru list, if it were active page then additionally PG_active bit is also set.

Page activity determination:
PG_referenced and PG_active bits

Buddy:

Slab:

Files:

Wednesday, March 16, 2011

discussion on rmap.c

* Unlike common anonymous pages, anonymous hugepages have no accounting code and no lru code, because we handle hugepages differently from common pages.

* anon vma does not point to pages but pages point to anon vma.

these are all in-use pages:
rmap code extensively use address space (file/anon) to extract mapping info of pages and vmas. together, pages are mapped in vma, in page tables. primarily.
secondarily, they are mapped in file/anon address spaces too.
third, swapping uses LRU of pages.
mapping count is also extensively used here in these files.

the swapping subsystem should depend on rmaps extensively, because swapping needs to remap pages on the fly.

two types of walking involved for rmaps: file and anon. address space needs to be locked at all costs. this locking will have scalability problems as we know locking means sequential code block irrespective of the number of cores we have there.

Monday, March 7, 2011

Writing Style

After some time, the swap token is passed to some other process that also undergoes swapping and requires memory more exigent than the current token holder.

Thanks to this extension, the algorithm does take minimum account of whether pages are used frequently or not, but does not deliver the performance expected of state-of-the-art memory management.

How can the kernel mark or sort pages as simply as possible in order to estimate access frequency without requiring an inordinate amount of time to organize data structures?

Hardly any part of the kernel entails as many technical difficulties as the virtual memory subsystem.

As I have demonstrated, not only are the technical details of paramount importance in achieving an efficient and powerful implementation of the swap system, but the design of the overall system must also support the best possible interaction between the various components to ensure that actions are performed smoothly and harmoniously.

This is combined with a ‘‘least recently used’’ method to realize page reclaim policy.

Page Reclamation I (Basics)

The swapping subsusytem has few components
1. Activation of page selection (periodic/explicit)
2. Page reclaim (policy)
3. Reverse Mapping
4. Swap cache
5. Writeback mechanism

a little distantly related: fault on a swapped page (notes)

code pages are file pages and kept track of by file address space object. difference between a code page and a data page is not by the backing file. but it is whether that page is editable in memory or not. if not editable, then it is a code page (most probably), if it is editable then it is a data page. now editable data page is synchronized with backing block device during swapping. but non-editable code pages are just discarded during swapping.

process heap, stack are anonymous pages because they do not have any backing block device store. memory area can be mapped anonymously using mmap too. these anon pages are swapped out to swap area.

private mapping is an exception because in this case data is on the backing file but private mapping decouples the data from the file. so private mapped pages are swapped out to swap area.

IPC shared pages are also swapped to swap area.

PFRA is all about identifying which data is more important than others. less important data is swapped out to swap area if required. From the viewpoint of a process, OS need to find out the working set of a process. there might be read working set (PCM-based) and write working set (DRAM-based).

PFRA needs to be 1) fair and 2) scalable

there is an idea of swap token, which makes a process immune against swapping because OS tries not to swap out that process' pages, thereby making that process run faster.

there can be multiple swap areas with different priorities depending on their relative speed. to keep track of slots in swap area, kernel uses bitmap. all swap files/areas are called swap files in kernel terminology.

each zone has an active and an inactive list of pages.

shrink_zone() is just one function which does a periodical move of pages from active to inactive lists. A page is first regarded as inactive, but it must earn some importance to be considered active.

* Thoughts and Ideas *

* finding out read working set and write working set of a process... read working set can reside in PCM and write working set can reside in DRAM.
*

***
* page uses two flags to keep track of LRU pages. Can't this technique be used for my work too to clear tested flags automatically? Testing closely resembles existing swapping technique. the scalability issue can also be resolved based on the idea acquired in this work.

scalability improvement: each CPU has a list of pvecs where LRU pages are added to central LRU list.

task: look into the paper Linux scalability analysis to get some more idea.

Linux Memory Management Notes