PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Logiciels d'hébergement > comp.mail.imap > UW 2004g on Linux fork (for mlock) problem
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.mail.imap Discussion of IMAP-based mail systems.

UW 2004g on Linux fork (for mlock) problem

Réponse
 
LinkBack Outils de la discussion
Vieux 24/08/2006, 17h23   #1
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut UW 2004g on Linux fork (for mlock) problem

We're running UW 2004g on a Centos 2.6.9-34.0.2.ELsmp system, about
2500 users on the system. We use restrictive mailspools, thus mlock
for locking.

Occasionally (every couple of days) we see an imap that's stuck in
locking. The grandparent is in env_unix.c dotlock_lock, after EACCES.
The grandparent has forked the parent, and the parent has forked the
child, which execed mlock. See (1) for the processes.

The grandparent is stuck in a mutex. gdb unfortunately doesn't have
anything interesting as to where it is at. See (2) for the lack of
details.

The parent process is Z, wchan of exit, which implies it exited and
somebody needs to wait() on it.

The mlock process is running and in a read on the communication pipe.
I'm interpreting that as meaning it got the lock, told the grandparent
OK (+) and is waiting for the grandparent to do the work and then it
can relinquish the lock.

Anybody ever seen anything like this? I'm inclined to think it's a
kernel bug but I wanted to throw it against the imap newsgroup to see
if anything stuck.

Nik Conwell Boston University nik@bu.edu


(1) The processes look like this:

[grandparent] 4 S foobar 29199 23953 0 77 0 - 1307 322564 10:55 ?
00:00:00 /usr/sbin/imapd
[parent] 1 Z foobar 29202 29199 0 77 0 - 0 exit 10:55 ?
00:00:00 [imapd] <defunct>
[child] 0 S foobar 29203 1 0 79 0 - 370 pipe_w 10:55 ?
00:00:00 /usr/sbin/mlock 4 /mailspool/25/foobar


(2) gdb of grandparent process.

gdb /usr/sbin/imapd 29199
Attaching to program: /usr/sbin/imapd, process 29199
[...]
Reading symbols from /usr/lib/libc-client.so.2004g...Reading symbols
from /usr/lib/debug/usr/lib/libc-client.so.2004g.debug...done.
done.
Loaded symbols for /usr/lib/libc-client.so.2004g
[...]
0x001117a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0 0x001117a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x0050469e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2 0x00496aef in _L_mutex_lock_10230 () from /lib/tls/libc.so.6
#3 0x00000000 in ?? ()

  Réponse avec citation
Vieux 24/08/2006, 18h02   #2
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Thu, 24 Aug 2006, Nik Conwell wrote:
> The parent process is Z, wchan of exit, which implies it exited and
> somebody needs to wait() on it.


That's strange, because the grandparent supposedly already reaped it (the
call to grim_pid_reap()). In fact, it reaps it before reading the data
from the pipe to the child.

Offhand, it looks like this is what is happening:

The grandparent is waiting for the parent to terminate, and won't read
from the child's pipe until the wait happens. For some reason, rather
than the reap happening, it's stuck in some internal C library mutex.

The parent is terminated and waiting to be reaped.

The child is waiting for the grandparent to read from the pipe.


If you can figure out what that mutex is, and why the grandparent is stuck
in it instead of reaping the parent, you'll have the key to the entire
puzzle.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
  Réponse avec citation
Vieux 24/08/2006, 19h27   #3
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Mark Crispin wrote:

> The grandparent is waiting for the parent to terminate, and won't read
> from the child's pipe until the wait happens. For some reason, rather
> than the reap happening, it's stuck in some internal C library mutex.


Thanks for taking a look. I'll throw some debug syslogs in there to
figure out where it's getting stuck. It's annoying gdb doesn't show
where it's at. IIRC strace showed it in mutex or futex or something.
(I've been hanging around for the past couple of days waiting for
another one to happen.)

  Réponse avec citation
Vieux 24/08/2006, 21h42   #4
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Thu, 24 Aug 2006, Nik Conwell wrote:
> Thanks for taking a look. I'll throw some debug syslogs in there to
> figure out where it's getting stuck. It's annoying gdb doesn't show
> where it's at. IIRC strace showed it in mutex or futex or something.
> (I've been hanging around for the past couple of days waiting for
> another one to happen.)


The code is in dotlock_lock() in env_unix.c. It's basically doing:
...create pipes...
if (!(pid = fork ())) { /* create child */
if (!fork ()) { /* in child, create grandchild */
...stuff to run mlock... /* in grandchild */
}
_exit (1); /* child exits immediately */
}
else if (pid > 0) { /* in parent, was child created? */
waitpid (pid,0,0); /* reap the child */
...read pipe stuff from grandchild...
}

The purpose of having a child create a grandchild, rather than just
running mlock in the child, is zombie avoidance.

The direct child is immediately reaped, and the grandchild consequently
gets inherited by init which reaps any zombies that it finds that it owns.
The grandchild has the other end of the pipe, and I/O to it is under a
select() timeout in both cases. So, sooner or later, either the pipe data
is sent or eventually both sides give up. Either way, init reaps the
grandchild.

Anyway, that's how it's supposed to work on paper. The fact that the
child became a zombie indicates that the reap never was done.

I just realized that there is another possibility. We already discussed
if the waitpid() somehow is hanging in that mutex.

The other possibility is if the fork() to create a child returned -1 to
the parent, but actually did create the child (and hence the grandchild).
In that case, the parent would treat it as a lock failure and block again
(which may be the mutex that you are seeing).

If this is happening, I'd assert that it's either a kernel or
documentation bug. The man page for fork() says "On failure, a -1 will be
returned in the parent's context, no child process will be created..."
There's no mention of an error return from fork() that creates a child.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
  Réponse avec citation
Vieux 15/09/2006, 17h08   #5
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Hi. Sorry for the delay for a response - other priorities.

Bypassing the mutex, I now have a stack trace. Looks like we're
getting KOD
while we were in something in libc holding a mutex. The following
example is
a malloc lock (main_arena):


(gdb) where
#0 0x004a7fde in free () from /lib/tls/libc.so.6
#1 0x004c1f55 in tzset_internal () from /lib/tls/libc.so.6
#2 0x004c29ae in tzset () from /lib/tls/libc.so.6
#3 0x004c754e in strftime_l () from /lib/tls/libc.so.6
#4 0x0050992b in vsyslog () from /lib/tls/libc.so.6
#5 0x00509e9f in syslog () from /lib/tls/libc.so.6
#6 0x0804b56c in kodint () at imapd.c:1654
#7 <signal handler called>
#8 0x004a83f3 in _int_malloc () from /lib/tls/libc.so.6
#9 0x004aa0b1 in malloc () from /lib/tls/libc.so.6
#10 0x004a0013 in open_memstream () from /lib/tls/libc.so.6
#11 0x005098ae in vsyslog () from /lib/tls/libc.so.6
#12 0x00509e9f in syslog () from /lib/tls/libc.so.6
#13 0x080525ae in main (argc=5, argv=0xbffffe14) at imapd.c:1363
#14 0x0045ae23 in __libc_start_main () from /lib/tls/libc.so.6
#15 0x0804aa01 in _start ()

The imapd.c:1363 is some extra syslog stuff I've added. The logging I
added
is some timing info on the executed command:

syslog(LOG_INFO,"elapsed: %d.%06d;
%s",seconds,microseconds,cmd);

I don't think it changes the spirit of the problem, but it will
increase the
probability.


Here's an example when we were holding the tzset_lock mutex:

#0 0x004c2a35 in __tz_convert () from /lib/tls/libc.so.6
#1 0x004c0c5d in localtime_r () from /lib/tls/libc.so.6
#2 0x005098fc in vsyslog () from /lib/tls/libc.so.6
#3 0x00509e9f in syslog () from /lib/tls/libc.so.6
#4 0x0804b56c in kodint () at imapd.c:1654
#5 <signal handler called>
#6 0x004c3a44 in __tzfile_compute () from /lib/tls/libc.so.6
#7 0x004c2b37 in __tz_convert () from /lib/tls/libc.so.6
#8 0x004c0ca0 in localtime () from /lib/tls/libc.so.6
#9 0x0014a901 in mail_parse_date (elt=0x809b9c8, s=0xbfffd5f8 "") at
mail.c:2948
#10 0x00182225 in unix_parse (stream=0x807c938, lock=0xbfffe3b0, op=1)
at unix.c:1343
#11 0x001839d0 in unix_open (stream=0x807c938) at unix.c:504
#12 0x00154eb4 in mail_open (stream=0x807c938, name=0x8068cf0 "INBOX",
options=0) at mail.c:1223
#13 0x080531a0 in main (argc=5, argv=0xbffffe14) at imapd.c:938
#14 0x0045ae23 in __libc_start_main () from /lib/tls/libc.so.6
#15 0x0804aa01 in _start ()


We also have a lot of stupid clients (webmail, outlook express, etc.)
that
insist on making multiple connections to the same mailbox. We're
pretty tied
to mbox for now.

  Réponse avec citation
Vieux 15/09/2006, 18h03   #6
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Fri, 15 Sep 2006, Nik Conwell wrote:
> Bypassing the mutex, I now have a stack trace. Looks like we're getting
> KOD while we were in something in libc holding a mutex.


If that's the cause of the problem, then the patch below should remedy the
issue. Basically, it instructs imapd not to respond to KOD events that
occur while the mlock interchange is in progress.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

*** env_unix.c.old 2006-08-31 13:37:32.000000000 -0700
--- env_unix.c 2006-09-15 10:00:55.000000000 -0700
***************
*** 1116,1121 ****
--- 1116,1122 ----

if (fd >= 0) switch (errno) {
case EACCES: /* protection failure? */
+ MM_CRITICAL (NIL); /* go critical */
/* make command pipes */
if (!closedBox && !stat (LOCKPGM,&sb) && (pipe (pi) >= 0)) {
if (pipe (po) >= 0) {
***************
*** 1152,1157 ****
--- 1153,1159 ----
base->pipei = pi[0]; base->pipeo = po[1];
/* close child's side of the pipes */
close (pi[1]); close (po[0]);
+ MM_NOCRITICAL (NIL);/* no longer critical */
return LONGT;
}
}
***************
*** 1159,1164 ****
--- 1161,1167 ----
}
close (pi[0]); close (pi[1]);
}
+ MM_NOCRITICAL (NIL); /* no longer critical */
/* find directory/file delimiter */
if (s = strrchr (base->lock,'/')) {
*s = '\0'; /* tie off at directory */
  Réponse avec citation
Vieux 15/09/2006, 18h23   #7
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

That addresses my initial problem, but from the two stack traces I got
above, the process receiving the KOD isn't in dotlock_lock(). One was
doing a syslog() and one was parsing dates from the mailbox.

Based on glibc's locking, it would seem that calling any glibc function
from within a signal handler could involve deadlock.

  Réponse avec citation
Vieux 15/09/2006, 19h57   #8
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Fri, 15 Sep 2006, Nik Conwell wrote:
> That addresses my initial problem, but from the two stack traces I got
> above, the process receiving the KOD isn't in dotlock_lock(). One was
> doing a syslog() and one was parsing dates from the mailbox.
> Based on glibc's locking, it would seem that calling any glibc function
> from within a signal handler could involve deadlock.


That's bad news. That essentially means that a signal handler that
responds to any critical condition (autologout timer, KOD, hangup,
termination) is precluded from doing much of anything, even if it has no
intention of resuming from the signal.

I guess that the authors of glibc had a good reason for not making glibc
be reentrant, but in general it's not a good thing to do.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
  Réponse avec citation
Vieux 15/09/2006, 22h57   #9
Sam
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Nik Conwell writes:

> That addresses my initial problem, but from the two stack traces I got
> above, the process receiving the KOD isn't in dotlock_lock(). One was
> doing a syslog() and one was parsing dates from the mailbox.
>
> Based on glibc's locking, it would seem that calling any glibc function
> from within a signal handler could involve deadlock.


Correct. The only functions you can invoke from a signal handler are kernel
syscalls (man section 2). Unless explicitly stated otherwise, none of libc
functions, from man section 3, can be safely called from a signal handler.
Even malloc.




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBFCyHKx9p3GYHlUOIRAv88AJ0Qym2b7/yBZpHXFWp/r2ReuaAWhACfZ8b3
D7LnhkGkmp6T7SRSD5SfYmo=
=j+ua
-----END PGP SIGNATURE-----

  Réponse avec citation
Vieux 15/09/2006, 22h59   #10
Sam
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Mark Crispin writes:

> On Fri, 15 Sep 2006, Nik Conwell wrote:
>> That addresses my initial problem, but from the two stack traces I got
>> above, the process receiving the KOD isn't in dotlock_lock(). One was
>> doing a syslog() and one was parsing dates from the mailbox.
>> Based on glibc's locking, it would seem that calling any glibc function
>> from within a signal handler could involve deadlock.

>
> That's bad news. That essentially means that a signal handler that
> responds to any critical condition (autologout timer, KOD, hangup,
> termination) is precluded from doing much of anything, even if it has no
> intention of resuming from the signal.
>
> I guess that the authors of glibc had a good reason for not making glibc
> be reentrant, but in general it's not a good thing to do.


Standard C library was not reentrant long before glibc came on the scene.

I recall reading explicit warnings in AT&T SVR3 man pages that libc
functions are not reentrant.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBFCyJIx9p3GYHlUOIRAnjNAJ92N1YuZEy0AeOSoiAKKc hdtVTJBwCfYRTf
UcEI2Cw3K2bm7k6P9y14cLQ=
=MLjd
-----END PGP SIGNATURE-----

  Réponse avec citation
Vieux 15/09/2006, 23h29   #11
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Fri, 15 Sep 2006, Sam wrote:
> I recall reading explicit warnings in AT&T SVR3 man pages that libc functions
> are not reentrant.


The old AT&T documentation that I have simply warned about the signal
handler trying to dismiss back to a libc function if the signal handler
also called a libc function. It said nothing about a signal handler that
has no intention of returning and instead will exit. That's the case
here.

The apparent purpose of the mutex is to protect the original libc call
from ill effects on it caused by the reentered call. I assume that the
choice for a mutex which waits (which would deadlock) as opposed to one
that caused an abort() call was intentional.

Since the signal handler has no intention of returning, is there a way to
disable the mutex? That is, the signal handler wants to be treated more
like a setjmp()/longjmp(). Otherwise, that precludes the signal handler
from even logging that it was called.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
  Réponse avec citation
Vieux 16/09/2006, 00h08   #12
Sam
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Mark Crispin writes:

> On Fri, 15 Sep 2006, Sam wrote:
>> I recall reading explicit warnings in AT&T SVR3 man pages that libc functions
>> are not reentrant.

>
> The old AT&T documentation that I have simply warned about the signal
> handler trying to dismiss back to a libc function if the signal handler
> also called a libc function. It said nothing about a signal handler that
> has no intention of returning and instead will exit. That's the case
> here.


exit() itself is a libc function that tries to flush any open files, before
terminating the process. _exit(), I think, is a syscall that's safe to use
in a signal handler.

>
> The apparent purpose of the mutex is to protect the original libc call
> from ill effects on it caused by the reentered call. I assume that the
> choice for a mutex which waits (which would deadlock) as opposed to one
> that caused an abort() call was intentional.
>
> Since the signal handler has no intention of returning, is there a way to
> disable the mutex? That is, the signal handler wants to be treated more
> like a setjmp()/longjmp(). Otherwise, that precludes the signal handler
> from even logging that it was called.


There is no mutex. The internal data structures in libc simply are not
re-enterable. In a pre-threaded world, most malloc implementations, for
example, maintained somewhat involved strategies for recycling memory
blocks; often managing multiple lists of memory blocks of different sizes,
trying to optimize for O(n) performance. If you are interrupted in a middle
of reshuffling internal memory pool lists, you just can't reenter malloc()
even if you have no intention of returning from the signal handler, since
the internal memory lists and pointers are likely to be in an inconsistent
state.

In a modern, threaded, world, most internal structures are protected by
mutexes so that they're thread safe. But it's not just a single mutex
protecting the C library. That would be a performance suicide, if only one
thread can be running inside the C library, locking out all other threads
even if they need to do something completely unrelated. Most
implementations use granular mutexes that are simply not exposed to the user
app. glibc -- and that I know -- use some gnu ld tricks to impose the
overhead of mutexes only when the app is multithreaded and links against
libpthread. Furthermore, many glibc functions use thread-local storage for
temporary scratch space; as such glibc is thread-safe, but not
reentrant-safe.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBFCzJYx9p3GYHlUOIRAjSNAJ0V5sd/Z3wdrNOKdR24ya6u0cd6bgCfc+au
jE3+ILuFxerJzNB3Bd5QbtE=
=f+aA
-----END PGP SIGNATURE-----

  Réponse avec citation
Vieux 16/09/2006, 02h57   #13
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Fri, 15 Sep 2006, Sam wrote:
>> It said nothing about a signal handler that
>> has no intention of returning and instead will exit. That's the case here.

> exit() itself is a libc function that tries to flush any open files, before
> terminating the process. _exit(), I think, is a syscall that's safe to use
> in a signal handler.


Of course I know that; my use of the verb "exit" instead of the function
name "exit()" was intentional.

Specifically: the signal handler has no intention of returning and instead
will exit with _exit().

> In a pre-threaded world, most malloc implementations, for
> example, maintained somewhat involved strategies for recycling memory blocks;
> often managing multiple lists of memory blocks of different sizes, trying to
> optimize for O(n) performance. If you are interrupted in a middle of
> reshuffling internal memory pool lists, you just can't reenter malloc() even
> if you have no intention of returning from the signal handler, since the
> internal memory lists and pointers are likely to be in an inconsistent state.


I know all this, too. It makes sense that the heap may be in an
inconsistant state during manipulation and thus heap can't be reentered.
That is why I block signals during my calls to heap routines such as
malloc().

> glibc --
> and that I know -- use some gnu ld tricks to impose the overhead of mutexes
> only when the app is multithreaded and links against libpthread.


That doesn't explain why the original poster encountered the mutex. The
server is not multithreaded and doesn't link with any thread library. My
expectation would be that the signal handler was therefore at liberty to
call libc functions, as long as it was reasonably careful.

libc, too, ought to be a bit more thoughtful, especially when
non-threaded. It should not be that much of an effort if libc make stdout
calls and syslog() reentrant in a signal handler, even if only in a
special non-buffered form. These are operations that a signal handler is
likely to want to do, as in recording why it's about to commit suicide
rather than just vanishing without a trace.

Without that available, the only recourse is to have a external monitoring
process that records the event instead of the process doing it itself.

It also isn't as if this class of issue is new. The general problem with
software interrupts has been known at least since the 1960s; and was
solved quite well on ITS with PCLSR. Other systems didn't go that far,
but still allowed "dangerous" calls to be made as long as the application
was willing to abandon returning from the interrupt.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
  Réponse avec citation
Vieux 16/09/2006, 03h58   #14
Sam
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

Mark Crispin writes:

> server is not multithreaded and doesn't link with any thread library. My
> expectation would be that the signal handler was therefore at liberty to
> call libc functions, as long as it was reasonably careful.


The posted backtrace shows that strftime() was getting invoked in the signal
handler indirectly through syslog(). That's definitely not reentrant.
strftime() itself calls tzset(), which calls free(), as the backtrace shows.
We're definitely well into non-reenterable territory.

I can understand the assumption that syslog() would be a syscall. But, it's
not. Can't do that in a signal handler.

You probably have some memory corruption happening here, so you cannot fully
trust the backtrace that shows mutex functions on the stack.

The Linux signal man page actually enumerates the functions that may be
safely invoked from a signal handler, and refers to POSIX as the reference
for the safe function list. So it looks like there's even a 2003 POSIX
standard of functions that are guaranteed to be reenterable. Anything not
on that list is off the table.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBFC2hcx9p3GYHlUOIRAgS5AJ4qxPqQUXJhAT7ZLJ2VfI JdMm8gRACcCKlK
3HmA5/iJMJvqVfCdzgLFkjQ=
=TUZS
-----END PGP SIGNATURE-----

  Réponse avec citation
Vieux 16/09/2006, 19h18   #15
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Fri, 15 Sep 2006, Sam wrote:
> I can understand the assumption that syslog() would be a syscall. But, it's
> not. Can't do that in a signal handler.


Of course syslog() is not a system call. However, someone should take the
effort to make syslog() work in a signal handler (ditto for stdout
operations) anyway, at least in a signal handler that is not going to
return.

A Google search shows that other developers have made the same complaint;
what has worked in the past doesn't work in glibc. POSIX aside, a lot of
things *did* work in SVR4 and BSD and now have suddenly broken on Linux.

Typical example: A daemon gets a SIGHUP. It wants to syslog() that fact,
and then exit; as opposed to vanishing without a trace. It even knows
that whatever it was doing, it wasn't critical.

But, from what you say, it can't even do a longjmp() to get out of
whatever it might have been doing in the bowels of the C library, or at
least enough to do a syslog() and exit.

It isn't as if solving this in glibc is technically impossible. It isn't.
At most it is moderately challenging due to the complexity added by
threading. Perhaps it may be alright to solve it only for non-threaded
applications, since a threaded application is (1) likely to have the
necessary control infrastructure to provide an alternative and (2) is not
likely to respond to a signal by writing a log message and exiting.

One way would be to have a mutex when non-threaded, but one which fails
instead of blocking. If the mutex fails, then escape to a special mode
that uses its own context and structures and dispenses with such luxuries
as tzset(). This would be done only in certain well-defined cases (the
ones which application developers have been complaining about!).

A more Linuxish way would be to add new calls (such as syslog_r()) for
reentrant versions, and thus forcing developers to have additional
conditionals to use those calls instead of the normal ones. NOT my
preferred solution, but perhaps more palatable to the glibc guys.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
  Réponse avec citation
Vieux 18/09/2006, 18h39   #16
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem


Mark Crispin wrote:
> One way would be to have a mutex when non-threaded, but one which fails
> instead of blocking. If the mutex fails, then escape to a special mode
> that uses its own context and structures and dispenses with such luxuries
> as tzset(). This would be done only in certain well-defined cases (the
> ones which application developers have been complaining about!).


glibc could possibly use other types of mutexes - recursive (blocks
other threads but lets the current thread succeed) or error checking
(return EDEADLK instead of blocking), but as far as I can tell (not
far) it's coded to block. Who knows what would break if I changed
that...

Back to the IMAP server, I was thinking about having the USR2 handler
just set a global KOD variable and then resume, and then have slurp()
check for the global KOD. I think I'd have to also do
siginterrupt(SIGUSR2) so that the fgets() will abort with EAGAIN.

Ideally I'd do that for all signals, but in practice it's just been
SIGUSR2.

-nik

  Réponse avec citation
Vieux 18/09/2006, 19h02   #17
Mark Crispin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem

On Mon, 18 Sep 2006, Nik Conwell wrote:
> glibc could possibly use other types of mutexes - recursive (blocks
> other threads but lets the current thread succeed) or error checking
> (return EDEADLK instead of blocking), but as far as I can tell (not
> far) it's coded to block. Who knows what would break if I changed
> that...


I don't think that there is much hope of winning by changing the mutexes.
It does need to work that way for threading.

In 2006a I changed things to make sure that it never does a syslog or
stdout I/O if the signal will return.

> Back to the IMAP server, I was thinking about having the USR2 handler
> just set a global KOD variable and then resume, and then have slurp()
> check for the global KOD. I think I'd have to also do
> siginterrupt(SIGUSR2) so that the fgets() will abort with EAGAIN.
> Ideally I'd do that for all signals, but in practice it's just been
> SIGUSR2.


Let me know how it goes.

One potential problem is that this may cause KOD not to respond rapidly
enough if it is doing something potentially time-consuming (such as a
large search) but not "critical" (thus can be aborted).

The other solution is just to stop using traditional UNIX mailbox format,
since it's the only one that needs/uses KOD.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
  Réponse avec citation
Vieux 19/09/2006, 14h01   #18
Nik Conwell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: UW 2004g on Linux fork (for mlock) problem


Mark Crispin wrote:
> On Mon, 18 Sep 2006, Nik Conwell wrote:
> > Back to the IMAP server, I was thinking about having the USR2 handler
> > just set a global KOD variable and then resume, and then have slurp()
> > check for the global KOD. I think I'd have to also do
> > siginterrupt(SIGUSR2) so that the fgets() will abort with EAGAIN.
> > Ideally I'd do that for all signals, but in practice it's just been
> > SIGUSR2.

>
> Let me know how it goes.


So far so good. Works in a simple engineered test but I'll have to see
how it shakes out in production for a couple of days. I can't test SSL
but we're not using that on the Linux servers yet. Want me to e-mail
you the patch? (will come from nik@bu.edu)

> One potential problem is that this may cause KOD not to respond rapidly
> enough if it is doing something potentially time-consuming (such as a
> large search) but not "critical" (thus can be aborted).
>
> The other solution is just to stop using traditional UNIX mailbox format,
> since it's the only one that needs/uses KOD.


Unfortunately not possible as we still have legacy stuff that expects
mbox.

  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 23h07.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,31184 seconds with 26 queries