/* Make a thread the running thread.  The thread must previously been
   sleeping, and not holding the CPU semaphore. This will set the
   thread state to VgTs_Runnable, and the thread will attempt to take
   the CPU semaphore.  By the time it returns, tid will be the running
   thread. */
extern void VG_(set_running) ( ThreadId tid );

/* Set a thread into a sleeping state.  Before the call, the thread
   must be runnable, and holding the CPU semaphore.  When this call
   returns, the thread will be set to the specified sleeping state,
   and will not be holding the CPU semaphore.  Note that another
   thread could be running by the time this call returns, so the
   caller must be careful not to touch any shared state.  It is also
   the caller's responsibility to actually block until the thread is
   ready to run again. */
extern void VG_(set_sleeping) ( ThreadId tid, ThreadStatus state );


The master semaphore is run_sema in vg_scheduler.c.


(what happens at a fork?)

VG_(scheduler_init) registers sched_fork_cleanup as a child atfork
handler.  sched_fork_cleanup, among other things, reinitializes the
semaphore with a new pipe so the process has its own.

--------------------------------------------------------------------

Re:   New World signal handling
From: Jeremy Fitzhardinge <jeremy@goop.org>
To:   Julian Seward <jseward@acm.org>
Date: Mon Mar 14 09:03:51 2005

Well, the big-picture things to be clear about are:

   1. signal handlers are process-wide global state
   2. signal masks are per-thread (there's no notion of a process-wide
      signal mask)
   3. a signal can be targeted to either
         1. the whole process (any eligable thread is picked for
            delivery), or
         2. a specific thread

1 is why it is always a bug to temporarily reset a signal handler (say,
for SIGSEGV), because if any other thread happens to be sent one in that
window it will cause havok (I think there's still one instance of this
in the symtab stuff).
2 is the meat of your questions; more below.
3 is responsible for some of the nitty detail in the signal stuff, so
its worth bearing in mind to understand it all. (Note that even if a
signal is targeting the whole process, its only ever delivered to one
particular thread; there's no such thing as a broadcast signal.)

While a thread are running core code or generated code, it has almost
all its signals blocked (all but the fault signals: SEGV, BUS, ILL, etc).

Every N basic blocks, each thread calls VG_(poll_signals) to see what
signals are pending for it.  poll_signals grabs the next pending signal
which the client signal mask doesn't block, and sets it up for delivery;
it uses the sigtimedwait() syscall to fetch blocked pending signals
rather than have them delivered to a signal handler.   This means that
we avoid the complexity of having signals delivered asynchronously via
the signal handlers; we can just poll for them synchronously when
they're easy to deal with.

Fault signals, being caused by a specific instruction, are the exception
because they can't be held off; if they're blocked when an instruction
raises one, the kernel will just summarily kill the process.  Therefore,
they need to be always unblocked, and the signal handler is called when
an instruction raises one of these exceptions. (It's also necessary to
call poll_signals after any syscall which may raise a signal, since
signal-raising syscalls are considered to be synchronous with respect to
their signal; ie, calling kill(getpid(), SIGUSR1) will call the handler
for SIGUSR1 before kill is seen to complete.)

The one time when the thread's real signal mask actually matches the
client's requested signal mask is while running a blocking syscall.  We
have to set things up to accept signals during a syscall so that we get
the right signal-interrupts-syscall semantics.  The tricky part about
this is that there's no general atomic
set-signal-mask-and-block-in-syscall mechanism, so we need to fake it
with the stuff in VGA_(_client_syscall)/VGA_(interrupted_syscall). 
These two basically form an explicit state machine, where the state
variable is the instruction pointer, which allows it to determine what
point the syscall got to when the async signal happens.  By keeping the
window where signals are actually unblocked very narrow, the number of
possible states is pretty small.

This is all quite nice because the kernel does almost all the work of
determining which thread should get a signal, what the correct action
for a syscall when it has been interrupted is, etc.  Particularly nice
is that we don't need to worry about all the queuing semantics, and the
per-signal special cases (which is, roughly, signals 1-32 are not queued
except when they are, and signals 33-64 are queued except when they aren't).

BUT, there's another complexity: because the Unix signal mechanism has
been overloaded to deal with two separate kinds of events (asynchronous
signals raised by kill(), and synchronous faults raised by an
instruction), we can't block a signal for one form and not the other. 
That is, because we have to leave SIGSEGV unblocked for faulting
instructions, it also leaves us open to getting an async SIGSEGV sent
with kill(pid, SIGSEGV). 

To handle this case, there's a small per-thread signal queue set up to
deal with this case (I'm using tid 0's queue for "signals sent to the
whole process" - a hack, I'll admit).  If an async SIGSEGV (etc) signal
appears, then it is pushed onto the appropriate queue. 
VG_(poll_signals) also checks these queues for pending signals to decide
what signal to deliver next.  These queues are only manipulated with
*all* signals blocked, so there's no risk of two concurrent async signal
handlers modifying the queues at once.  Also, because the liklihood of
actually being sent an async SIGSEGV is pretty low, the queues are only
allocated on demand.



There are two mechanisms to prevent disaster if multiple threads get
signals concurrently.  One is that a signal handler is set up to block a
set of signals while the signal is being delivered.  Valgrind's handlers
block all signals, so there's no risk of a new signal being delivered to
the same thread until the old handler has finished.

The other is that if the thread which recieves the signal is not running
(ie, doesn't hold the run_sema, which implies it must be waiting for a
syscall to complete), then the signal handler will grab the run_sema
before making any global state changes.  Since the only time we can get
an async signal asynchronously is during a blocking syscall, this should
be all the time. (And since synchronous signals are always the result of
running an instruction, we should already be holding run_sema.)


Valgrind will occasionally generate signals for itself. These are always
synchronous faults as a result instruction fetch or something an
instruction did.  The two mechanims are the synth_fault_* functions,
which are used to signal a problem while fetching an instruction, or by
getting generated code to call a helper which contains a fault-raising
instruction (used to deal with illegal/unimplemented instructions and
for instructions who's only job is to raise exceptions).

That all explains how signals come in, but the second part is how they
get delivered.

The main function for this is VG_(deliver_signal).  There are three cases:

   1. the process is ignoring the signal (SIG_IGN)
   2. the process is using the default handler (SIG_DFL)
   3. the process has a handler for the signal

In general, VG_(deliver_signal) shouldn't be called for ignored signals;
if it has been called, it assumes the ignore is being overridden (if an
instruction gets a SEGV etc, SIG_IGN is ignored and treated as SIG_DFL).

VG_(deliver_signal) handles the default handler case, and the
client-specified signal handler case.

The default handler case is relatively easy: the signal's default action
is either Terminate, or Ignore.  We can ignore Ignore.

Terminate always kills the entire process; there's no such thing as a
thread-specific signal death. Terminate comes in two forms: with
coredump, or without.  vg_default_action() will write a core file, and
then will tell all the threads to start terminating; it then longjmps
back to the current thread's scheduler loop.  The scheduler loop will
terminate immediately, and the master_tid thread will wait for all the
others to exit before shutting down the process (this is the same
mechanism as exit_group).

Delivering a signal to a client-side handler modifys the thread state so
that there's a signal frame on the stack, and the instruction pointer is
pointing to the handler.  The fiddly bit is that there are two
completely different signal frame formats: old and RT.  While in theory
the exact shape of these frames on stack is abstracted, there are real
programs which know exactly where various parts of the structures are on
stack (most notably, g++'s exception throwing code), which is why it has
to have two separate pieces of code for each frame format.  Another
tricky case is dealing with the client stack running out/overflowing
while setting up the signal frame.

Signal return is also interesting.  There are two syscalls, sigreturn
and rt_sigreturn, which a signal handler will use to resume execution.
The client will call the right one for the frame it was passed, so the
core doesn't need to track that state.  The tricky part is moving the
frame's register state back into the thread's state, particularly all
the FPU state reformatting gunk.  Also, *sigreturn checks for new
pending signals after the old frame has been cleaned up, since there's a
requirement that all deliverable pending signals are delivered before
the mainline code makes progress.  This means that a program could
live-lock on signals, but that's what would happen running natively...

Another thing to watch for: programs which unwind the stack (like gdb,
or exception throwers) recognize the existence of a signal frame by
looking at the code the return address points to: if it is one of the
two specific signal return sequences, it knows its a signal frame. 
That's why the signal handler return address must point to a very
specific set of instructions.


What else.  Ah, the two internal signals.

SIGVGKILL is pretty straightforward: its just used to dislodge a thread
from being blocked in a syscall, so that we can get the thread to
terminate in a timely fashion.

SIGVGCHLD is used by a thread to tell the master_tid that it has
exited.  However, the only time the master_tid cares about this is when
it has already exited, and its waiting for everyone else to exit.  If
the master_tid hasn't exited, then this signal is ignored.  It isn't
enough to simply block it, because that will cause a pile of queued
SIGVGCHLDs to build up, eventually clogging the kernel's signal delivery
mechanism.  If its unblocked and ignored, it doesn't interrupt syscalls
and it doesn't accumulate.


I hope that helps clarify things.  And explain why there's so much stuff
in there: it's tracking a very complex and arcane underlying set of
machinery.

    J

--------------------------------------------------------------------

>I've been seeing references to 'master thread' around the place.
>What distinguishes the master thread from the rest?  Where does
>the requirement to have a master thread come from?
>
It used to be tid 1, but I had to generalize it.

The master_tid isn't very special; its main job is at process shutdown. 
It waits for all the other threads to exit, and then produces all the
final reports. Until it exits, it's just a normal thread, with no other
responsibilities.

The alternative to having a master thread would be to make whichever
thread exits last be responsible for emitting all the output.  That
would work, but it would make the results a bit asynchronous (that is,
if the main thread exits and the other hang around for a while, anyone
waiting on the process would see it as having exited, but no results
would have been produced).

VG_(master_tid) is a varable to handle the case where a threaded program
forks.  In the first process, the master_tid will be 1.  If that program
creates a few threads, and then, say, thread 3 forks, the child process
will have a single thread in it.  In the child, master_tid will be 3. 
It was easier to make the master thread a variable than to try to work
out how to rename thread 3 to 1 after a fork.

    J

--------------------------------------------------------------------

Re:   Fwd: Documentation of kernel's signal routing ?
From: David Woodhouse <...>
To:   Julian Seward <jseward@acm.org>

> Regarding sys_clone created threads.  I have a vague idea that 
> there is a notion of 'thread group'.  I further understand that if 
> one thread in a group calls sys_exit_group then all threads in that
> group exit.  Whereas if a thread calls sys_exit then just that
> thread exits.
> 
> I'm pretty hazy on this:

Hmm, so am I :)

> * Is the above correct?

Yes, I believe so.

> * How is thread-group membership defined/changed?

By specifying CLONE_THREAD in the flags to clone(), you remain part of
the same thread group as the parent. In a single-threaded process, the
thread group id (tgid) is the same as the pid. 

Linux just has tasks, which sometimes happen to share VM -- and now with
NPTL we also share other stuff like signals, etc. The 'pid' in Linux is
what POSIX would call the 'thread id', and the 'tgid' in Linux is
equivalent to the POSIX 'pid'.

> * Do you know offhand how LinuxThreads and NPTL use thread groups?

I believe that LT doesn't use the kernel's concept of thread groups at
all. LT predates the kernel's support for proper POSIX-like sharing of
anything much but memory, so uses only the CLONE_VM (and possibly
CLONE_FILES) flags. I don't _think_ it uses CLONE_SIGHAND -- it does
most of its work by propagating signals manually between threads.

NTPL uses thread groups as generated by the CLONE_THREAD flag, which is
what invokes the POSIX-related thread semantics.

>   Is it the case that each LinuxThreads threads is in its own
>   group whereas all NTPL threads [in a process] are in a single
>   group?

Yes, that's my understanding.

-- 
dwmw2