Thursday, December 8, 2011

Buddy System (MM)

During initialization in memmap_init_zone(), all pages are marked movable.

Each block of pages inside buddy allocator has a head page. head page contains PG_buddy flag set and page->private=order. other pages does not have these bits set.

Compound Page (Huge TLB page):
A group of consecutive pages: head page and tail pages. for each page, PG_compound flag set, page->private=head page. The first tail page contains destructor 'free_compound_page' and number of pages 'n' in the lru.next and lru.prev fields.

A free page can be in 2 places: 1. buddy list 2. pcp list
When a free page is in buddy list and is the head page of a block, then page->private=order
When a free page is in pcp list, then page->private=migratetype

The node and zone for a page is encoded in the upper portion of the page flag.

Hot-N-Cold PCP

[ref. PLKA p146, p183, ]

per_cpu_pageset structure. each zone has pageset[NR_CPUS].
NR_CPUS = 1 (uni-processor), 2-32 (smp32), 2-64 (smp64)

per_cpu_pageset -> per_cpu_pages[2] ... 0:hot, 1:cold

per_cpu_pages { count, high, batch (chunk size for BA), list }
... low watermark: 0, high watermark: high (PLKAfig3.6)

list of pages are kept track of by page->lru field.

I will take the page from the back of the list, test it and put it in the front of the list.

(Performance) I can have a free page reserve to fill up pcp quickly.

Free Pages Grab and Release

If you directly take out page from the free list you need to be careful about the following

** pcp is protected by local irq (local_irq_disable, local_irq_enable).
** zone is protected by zone specific spin lock with irq (spin_lock_irqsave(z->lock, flags), spin_lock_irqrestore(z->lock, flags).

pcp
----
1. anytime if you grab a page from pcp, you must call prep_new_page() on it.
2. when you grab page block manually from bdyalloc, you need to either
... 2.a clear the page->count before releasing that page to freelist (set_page_count). or
... 2.b prep_new_page on that block just after you extract that page.

bdyalloc
---------
1. when you take out a page block from bdyalloc, you need to clear the PG_Buddy flag and private=0 on the first page of the block. (rmv_page_order)
2. when you put a block of pages back into bdyalloc, you need to set the order (set_page_order).

remove pages: (you must call prep_new_page)
__rmqueue_smallest: works on the freelists of a zone (not the pcp)

__rmqueue_fallback: similar but takes care of migratetype more flexibly

rmqueue_bulk: zone, order, count ...

little user-friendly: no need to call prep_new_page
buffered_rmqueue: works both on pcp and freelists (based on the order)

insert pages: (page count must be 0)
free_hot_and_cold_page: frees pages into the pcp list

Page Migration

the old page is locked, and pte entries of the old page is replaced with migration entries for which the page fault with wait till the migration is finished

new page is locked and set not-uptodate before the migration so that processes will wait on the till the migration ends.

unmap old page with migration entries
replace the radix tree

if a page is locked then the processes trying to map that page in their pts will wait on the lock. also, the page won't be swapped out. 

there are two granularity of locking: 1. vma lock which is in virtual level lock, does not correspond to any physical pages. and 2. page lock which is the physical page lock.

page ref count means some process or kernel component is currently using or holding that physical page somehow. (page_count)

page map count means that page is mapped to some process page table (page_mapcount, +1)

map count <= ref count 

if the phy page is in the swap cache, then private field will contain the swap cache entry for that page.
this entry can be used as the swap pte for installation in the page table.

vma private is used sometimes for containing non-linear vma information. 


page refcount set/increase/decrease

set
-----
page_alloc.c/free_pages_bootmem() ... sets to 1 to the first page of a page block
page_alloc.c/prep_new_page() ... sets to 1 to the first page of a page block
page_alloc.c/split_page() ... sets to 1 to the first page of each of the page blocks

increase
--------------


decrease
---------------

Monday, November 21, 2011

Dissection of Page Fault Handler

__do_fault (memory.c, 3.1.1)

as this function creates new mapping for a page which does not exist yet, TLB entries are not altered.

This function has two parts:
1. allocate a page and prepare it appropriately.
2. fix page tables to point to this page.

In the first if block we see there are three tasks:
1. prepare the anonymous vma
2. allocate the page
3. register the page with memory control group
       if unable, release the page
in all three cases, return VM_FAULT_OOM on failure.

the COW page is allocated before any other processing because it will reduce the lock_page() holding time on page cache.

how to detect, if it is a COW request: FAULT_FLAG_WRITE and the vma is NOT shared. COW is specially used when there is a fork from parent and a new process is created. but the memory pages are not allocated right away. instead, the virtual memory regions are marked not sharable and the page table entry is marked read only. so next time there is a page fault, it can detect COW.

after fixing COW page, the function prepares vmf structure. (vm_fault)

next, the fs fault handler is invoked. when the control reaches __do_fault, it is already decided that there is fs code involved.


...

the backing address space might want to know that the page is about to become writable or not. the filesystem code implements this functionality. in that case, the vma->vm_ops->page_mkwrite function will be present.




Wednesday, March 23, 2011

Small Test Project Idea on Reverse Mapping?

Difficulty Levels:
[6 - very easy, 5 - easy, 4 - medium, 3 - hard, 2 - very hard, 1 - extremely hard]

P1. find out mapcounts of each of the memory pages. -- L6
P2. find out PTEs for a shared anonymous pages (involves handing simple anon region) -- L4
P3. find out PTEs for a shared file page (involves handling address space) -- L3
P4. find out name of the processes that are sharing a page. -- L5 after P2/3 are done
P5. force remap of shared page -- L3
P6. measure TLB effect using 'perf' -- L4

P7. find out PDT, PMD, PTEs for a given memory region descriptor -- L4
P8. find out number of allocated page frames for a process (using its mm descriptor rss fields) -- L5

P9. Restrict memory for a process.