PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > mysql.general > Re: Problem with *very* slow replication
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
Re: Problem with *very* slow replication

Réponse
 
LinkBack Outils de la discussion
Vieux 30/10/2007, 03h44   #1
Christopher E. Brown
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Problem with *very* slow replication

On Thu, 25 Oct 2007, Bob@woods.net wrote:

> Not sure that I get the whole picture.
>
> We have been running replication since about 4.0 and we have been through
> several upgrades and are now at 5.0.27.
>
> The 'show slave status' always gives us an accurate reflection of where it is
> at which is usually 0 seconds behind.
>
> Occasionally, it falls behind if the master is really busy (>2200 q/s with
> about 70% being updates/deletes/inserts).
>
> At those times the slave tops out at about 1200 q/s of which most are db mods
> of some kind and some selects since we have reports running against the
> replica and it will fall behind temporarily.
>
> Can you send show slave status and show master status as well and typical
> mytop outputs for master and slave?
>
> That might let me be able to provide more .
>
>
> Bob


Unfortunatly I had to tear down replication as it was causing problems
with the master. (The master will not delete binlogs that a slave is
still loading, when the slave is 40 file behind disk gets short).

CPU load was near zero on both systems (98% idle or better).

Disk load is minimal.

The slave is always up to date with relay file processing and reporting
zero seconds behind.

In short, everything looks fine.


What happens is that the master -> slave binlog feed runs very slow (no
more than abount 10 writes/sec).


So, afer a few days the slave is still reporting zero seconds behind, and
it is zero seconds behind the relay log.

The problem is that while the master is currently writing binlog 650, the
slave is actually zero seconds behind the feed, but the binlog feed has
fallen 20 - 30 files behind (our binlog rolls at 256M).


Since there is no load issue, I expect there is a timing or trigger issue
with the master side proc doing the binlog dump, or the slave side
receiving it.


I can stop/start replication and/or reload both servers, it still holds.

I see the replication restart, with the slave running zero seconds behind
the relay log, the binlog feed starts up right where it left off but the
feed only runs at about 10 writes a second.


Are your running native or LinuxThreads? This is smelling like threading
issue to me (we are running FreeBSD 6.2 with native threading and 5.1.19).

The exact same setup was pre-built on Linux systems (2.6.x Slackware)
before being built out on the production systems (FreeBSD 6.2).

During the testing 1000 writes/sec were no problem (small/simple table,
fits in memory). When I forced a backlog of approx 2GB by shuttong down
the slave on restart the binlog -> relay log feed ran at over 25MB/sec
until caught up.
  Réponse avec citation
Vieux 03/11/2007, 08h20   #2
Christopher E. Brown
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut SOLVED: Problem with *very* slow replication, FreeBSD 6.2



An update for those actually paying attention.

I have been fighing unusual performance issues with replication between
FreeBSD 6.2 machines.

The unusual part is that while replication would never top 10 writes per
second (even while the master was taking hundres of writes per second),
the slave always reported zero seconds behind.

This is on servers with less than 1% CPU used.

The actual problem was not with writing the binlog, or the slave SQL
thread, but the actual transfer of the binlog across the network.

After days of running, the slave would be many Gigs behind the master.

While debugging I tried many things including updating from 5.1.19 to
5.1.22, rebuilding with WITH_PROC_SCOPE_PTH=yes, and even rebuilding using
linuxthreads.


None of this worked.



The problem was rfc1323... Window scaling *SHOULD* have improved
performance given that this is a jumbo frame GigE network.

For reasons I don't understand, with rfc1323 enabled the data transfer
rate for replication is limited to a ~ 200Kbyte/sec (I do not see the same
slowdown for http or scp transfers).


To verify I rebuilt both systems back to default (native threads),
re-inited the Master<->Master replication loop, shutdown one of the
servers and inserted several million records on the live system (about
1.8Gbyte of binlog).


On restarting the second system it read the binlog into the relay log at
20 - 25 Mbyte/sec. The seconds behind master value showed sane values,
and it processed the relay-log backlog at about 6600 writes/sec until
finished.


Further testing included 3,000 inserts/sec to each of the servers
(6,000/sec total) with the master/master replication loop active. During
a run of 10,000,000 inserts to each server replication was never more than
2 seconds behind.


On Tue, 30 Oct 2007, Christopher E. Brown wrote:

> On Thu, 25 Oct 2007, Bob@woods.net wrote:
>
>> Not sure that I get the whole picture.
>>
>> We have been running replication since about 4.0 and we have been through
>> several upgrades and are now at 5.0.27.
>>
>> The 'show slave status' always gives us an accurate reflection of where it
>> is at which is usually 0 seconds behind.
>>
>> Occasionally, it falls behind if the master is really busy (>2200 q/s with
>> about 70% being updates/deletes/inserts).
>>
>> At those times the slave tops out at about 1200 q/s of which most are db
>> mods of some kind and some selects since we have reports running against
>> the replica and it will fall behind temporarily.
>>
>> Can you send show slave status and show master status as well and typical
>> mytop outputs for master and slave?
>>
>> That might let me be able to provide more .
>>
>>
>> Bob

>
> Unfortunatly I had to tear down replication as it was causing problems with
> the master. (The master will not delete binlogs that a slave is still
> loading, when the slave is 40 file behind disk gets short).
>
> CPU load was near zero on both systems (98% idle or better).
>
> Disk load is minimal.
>
> The slave is always up to date with relay file processing and reporting zero
> seconds behind.
>
> In short, everything looks fine.
>
>
> What happens is that the master -> slave binlog feed runs very slow (no more
> than abount 10 writes/sec).
>
>
> So, afer a few days the slave is still reporting zero seconds behind, and it
> is zero seconds behind the relay log.
>
> The problem is that while the master is currently writing binlog 650, the
> slave is actually zero seconds behind the feed, but the binlog feed has
> fallen 20 - 30 files behind (our binlog rolls at 256M).
>
>
> Since there is no load issue, I expect there is a timing or trigger issue
> with the master side proc doing the binlog dump, or the slave side receiving
> it.
>
>
> I can stop/start replication and/or reload both servers, it still holds.
>
> I see the replication restart, with the slave running zero seconds behind the
> relay log, the binlog feed starts up right where it left off but the feed
> only runs at about 10 writes a second.
>
>
> Are your running native or LinuxThreads? This is smelling like threading
> issue to me (we are running FreeBSD 6.2 with native threading and 5.1.19).
>
> The exact same setup was pre-built on Linux systems (2.6.x Slackware) before
> being built out on the production systems (FreeBSD 6.2).
>
> During the testing 1000 writes/sec were no problem (small/simple table, fits
> in memory). When I forced a backlog of approx 2GB by shuttong down the slave
> on restart the binlog -> relay log feed ran at over 25MB/sec until caught up.
>
>

  Réponse avec citation
Vieux 05/11/2007, 01h00   #3
Christopher E. Brown
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: SOLVED: Problem with *very* slow replication, FreeBSD 6.2

On Sat, 3 Nov 2007, bob b wrote:

> Good to hear that you found the problem.
>
> The only remaining puzzle is why the replica reported that it was up to date
> when it was several binlogs behind.
> Possibly the replica was always caught up with the last entry from the very
> slow link.
>
> Perhaps you should report this as a bug? The replication mechanism should be
> able to check the last binlog being written on the master and report that
> difference?
>
> Bob Bankay (from home)



The reporting confusion is due to the fact that the "seconds behind
master" figure is based on the relay logs and how long it will take to catch
up.

For example, I had replication shut down for 45 minutes wile feeding
millions of writes into the master. On slave restart the binlog dump
started, and it went fast. As the relay log grew so did seconds behind
master. One the relay log was up to date, the "seconds behind master" was
based on the execution rate and backlog. (Somthing like 12 minutes and
counting down)


So, a slave is down for 8hrs. It comes online and pulls the binlog in 120
seconds. The "seconds behind master" does not reflect 8hrs, but how many
seconds (at current processing rate) before the slave finishes the relay
logs.


The "seconds behind master" value is really "seconds until currency with
the relay logs" and should prolly be documented as such.


It would be nice if there was a way for the slave to find the actual
current master position and compare with the local state though.

  Réponse avec citation
Vieux 05/11/2007, 02h18   #4
Baron Schwartz
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: SOLVED: Problem with *very* slow replication, FreeBSD 6.2

Christopher E. Brown wrote:
> On Sat, 3 Nov 2007, bob b wrote:
> So, a slave is down for 8hrs. It comes online and pulls the binlog in
> 120 seconds. The "seconds behind master" does not reflect 8hrs, but how
> many seconds (at current processing rate) before the slave finishes the
> relay logs.
>
>
> The "seconds behind master" value is really "seconds until currency with
> the relay logs" and should prolly be documented as such.


This is incorrect. In most circumstances, it's basically the difference
between the timestamp of the binlog event the SQL thread is currently
processing, and the master's current timestamp (as fetched by the I/O
thread). So it really is what it sounds like: the seconds behind the
master. If it says 100, it means the slave is processing an event that
took place 100 seconds ago on the master.

You can read the source code in show_master_info() in sql/slave.cc.

Baron
  Réponse avec citation
Vieux 05/11/2007, 20h03   #5
Bob Bankay X-AST : 7731^29u18e3
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: SOLVED: Problem with *very* slow replication, FreeBSD 6.2

Running on our systems, we have had the replica load data and then
started. The longest delta was about 28 hours behind the master. The
slave status faithfully reported how far behind the master it was, when
the slave was started, even as it was loading its relay-logs from the
master which if I remember correctly took a couple of hours that first
time. During that first part when the relay logs were loading, the
seconds behind would increase and when the relay log caught up the delta
started decreasing rapidly to zero.

We are running x86-64bit hardware with RH Linux and a 1Gbit ethernet
link between master and slave (nothing exotic). The load on the link
never seems to an issue, so we have never monitored it closely.

So, if you are happy with the situation then it is solved.

Cheers,
Bob Bankay



Baron Schwartz wrote:
> Christopher E. Brown wrote:
>> On Sat, 3 Nov 2007, bob b wrote:
>> So, a slave is down for 8hrs. It comes online and pulls the binlog
>> in 120 seconds. The "seconds behind master" does not reflect 8hrs,
>> but how many seconds (at current processing rate) before the slave
>> finishes the relay logs.
>>
>>
>> The "seconds behind master" value is really "seconds until currency
>> with the relay logs" and should prolly be documented as such.

>
> This is incorrect. In most circumstances, it's basically the
> difference between the timestamp of the binlog event the SQL thread is
> currently processing, and the master's current timestamp (as fetched
> by the I/O thread). So it really is what it sounds like: the seconds
> behind the master. If it says 100, it means the slave is processing
> an event that took place 100 seconds ago on the master.
>
> You can read the source code in show_master_info() in sql/slave.cc.
>
> Baron
>


  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 03h41.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,19390 seconds with 13 queries