Login

How does Linux determine the address of another process to execute with a syscall? Like in this example?

mov rax, 59
mov rdi, progName
syscall

It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.

# syscall

The `syscall` instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:
```
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD;
FI;
RCX ← RIP; (* Will contain address of next instruction *)
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base ← 0;
(* Flat segment *)
CS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
CS.Type ← 11;
(* Execute/read code, accessed *)
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
(* Entry is to 64-bit mode *)
CS.D ← 0;
(* Required if CS.L = 1 *)
CS.G ← 1;
(* 4-KByte granularity *)
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
(* SS just above CS *)
(* Set rest of SS to a fixed value *)
SS.Base ← 0;
(* Flat segment *)
SS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
SS.Type ← 3;
(* Read/write data, accessed *)
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
(* 32-bit stack segment *)
SS.G ← 1;
(* 4-KByte granularity *)
```
The most important part are the two instructions that save and manage the RIP register:

RCX ← RIP
RIP ← IA32_LSTAR

So in other words, there must be code at the address saved in `IA32_LSTAR` (a register) and `RCX` is the return address.

The `CS` and `SS` segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)

The `#UD` may happen if you do not have the right to execute `syscall` or if the instruction doesn't exist.

# How is `RAX` interpreted?

This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if `RAX > __NR_syscall_max`), then dispatches to (C syntax) `sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);`

```
; Intel-syntax translation of Linux 4.12 syscall entry point
... ; save user-space registers etc.
call [sys_call_table + rax * 8] ; dispatch to sys_execve() or whatever kernel C function

;;; execve probably won't return via this path, but most other calls will
... ; restore registers except RAX return value, and return to user-space
```

Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of `call *sys_call_table(, %rax, 8)` from `ENTRY(entry_SYSCALL_64)` in [Linux 4.12 arch/x86/entry/entry_64.S][1] (before Spectre/Meltdown mitigations were added). Also related:

[To see links please register here]

has some more details about the kernel side of system-call dispatching.

# Fast?

The instruction is said to be _fast_. This is because in the old days one would have to use an instruction such as `INT3`. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow `IRET` to exit the exception state and return to the address just after the interrupt. This is generally much slower.

With the `syscall` you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.

Another instruction which is used along `syscall` is `swapgs`. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.

# New Process?

The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.

When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:

* Make sure the executable exists
* Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
* Allocates a stack buffer
* Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
* Setup the start address in the task (ELF entry point from the executable)
* Mark the task as ready (a.k.a. running)

This is, of course, super simplified.

The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current `RIP` pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.

(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)

All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.

(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)

# Determine the Address?

So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by `ld`). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.

It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.

In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.

Also, if the process is already running and you start it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as _copy on write_.)

[1]:

[To see links please register here]

inlay624

apothegm527