Linux Memory Management Notes

Thursday, March 17, 2011

Page Flags

Page Reclaim and Swapping:
PG_lru is set if the page is moved into lru list, if it were active page then additionally PG_active bit is also set.

Page activity determination:
PG_referenced and PG_active bits

Buddy:

Slab:

Files:

Wednesday, March 16, 2011

discussion on rmap.c

* Unlike common anonymous pages, anonymous hugepages have no accounting code and no lru code, because we handle hugepages differently from common pages.

* anon vma does not point to pages but pages point to anon vma.

these are all in-use pages:
rmap code extensively use address space (file/anon) to extract mapping info of pages and vmas. together, pages are mapped in vma, in page tables. primarily.
secondarily, they are mapped in file/anon address spaces too.
third, swapping uses LRU of pages.
mapping count is also extensively used here in these files.

the swapping subsystem should depend on rmaps extensively, because swapping needs to remap pages on the fly.

two types of walking involved for rmaps: file and anon. address space needs to be locked at all costs. this locking will have scalability problems as we know locking means sequential code block irrespective of the number of cores we have there.

Monday, March 7, 2011

Writing Style

After some time, the swap token is passed to some other process that also undergoes swapping and requires memory more exigent than the current token holder.

Thanks to this extension, the algorithm does take minimum account of whether pages are used frequently or not, but does not deliver the performance expected of state-of-the-art memory management.

How can the kernel mark or sort pages as simply as possible in order to estimate access frequency without requiring an inordinate amount of time to organize data structures?

Hardly any part of the kernel entails as many technical difficulties as the virtual memory subsystem.

As I have demonstrated, not only are the technical details of paramount importance in achieving an efficient and powerful implementation of the swap system, but the design of the overall system must also support the best possible interaction between the various components to ensure that actions are performed smoothly and harmoniously.

This is combined with a ‘‘least recently used’’ method to realize page reclaim policy.

Page Reclamation I (Basics)

The swapping subsusytem has few components
1. Activation of page selection (periodic/explicit)
2. Page reclaim (policy)
3. Reverse Mapping
4. Swap cache
5. Writeback mechanism

a little distantly related: fault on a swapped page (notes)

code pages are file pages and kept track of by file address space object. difference between a code page and a data page is not by the backing file. but it is whether that page is editable in memory or not. if not editable, then it is a code page (most probably), if it is editable then it is a data page. now editable data page is synchronized with backing block device during swapping. but non-editable code pages are just discarded during swapping.

process heap, stack are anonymous pages because they do not have any backing block device store. memory area can be mapped anonymously using mmap too. these anon pages are swapped out to swap area.

private mapping is an exception because in this case data is on the backing file but private mapping decouples the data from the file. so private mapped pages are swapped out to swap area.

IPC shared pages are also swapped to swap area.

PFRA is all about identifying which data is more important than others. less important data is swapped out to swap area if required. From the viewpoint of a process, OS need to find out the working set of a process. there might be read working set (PCM-based) and write working set (DRAM-based).

PFRA needs to be 1) fair and 2) scalable

there is an idea of swap token, which makes a process immune against swapping because OS tries not to swap out that process' pages, thereby making that process run faster.

there can be multiple swap areas with different priorities depending on their relative speed. to keep track of slots in swap area, kernel uses bitmap. all swap files/areas are called swap files in kernel terminology.

each zone has an active and an inactive list of pages.

shrink_zone() is just one function which does a periodical move of pages from active to inactive lists. A page is first regarded as inactive, but it must earn some importance to be considered active.

* Thoughts and Ideas *

* finding out read working set and write working set of a process... read working set can reside in PCM and write working set can reside in DRAM.
*

***
* page uses two flags to keep track of LRU pages. Can't this technique be used for my work too to clear tested flags automatically? Testing closely resembles existing swapping technique. the scalability issue can also be resolved based on the idea acquired in this work.

scalability improvement: each CPU has a list of pvecs where LRU pages are added to central LRU list.

task: look into the paper Linux scalability analysis to get some more idea.

Tuesday, December 14, 2010

using red-black tree

http://lwn.net/Articles/184495/

#include

we need a rbtree root of type struct rb_node:

[looking for an entry]
struct intv_obj * intv = NULL;
struct rb_node * rb_node = chnl->rb_root;
while(rb_node){
struct intv_obj * intv_tmp;
intv_tmp = rb_entry(rb_node,struct intv_obj,rb_node);
if(time < intv_tmp->end){
intv = intv_tmp;
if(intv_tmp->start<=time) break; // we are done
rb_node = rb_node->left;
}
else
rb_node = rb_node->right;
}
return vma;

[inserting an entry]

[deleting an entry]

Friday, November 19, 2010

Huge Pages in Linux

ref: http://lwn.net/Articles/374424/

Some useful formulas on TLB miss penalty is given there on Part 1. Everyone thinks about fitting app data and kernel data fitting inside CPU cache. This boosts performance a lot.

database workloads will gain about 2-7% performance using huge pages where as scientific workloads can range between 1% and 45%

In the initial support for huge pages on Linux, huge pages were faulted at the same time as mmap() was called. This guaranteed that all references would succeed for shared mappings once mmap() returned successfully. Private mappings were safe until fork() was called. Once called, it's important that the child call exec() as soon as possible or that the huge page mappings were marked MADV_DONTFORK with madvise() in advance. Otherwise, a Copy-On-Write (COW) fault could result in application failure by either parent or child in the event of allocation failure.