PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > mysql.general > excessive time spent in "statistics" status
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
excessive time spent in "statistics" status

Réponse
 
LinkBack Outils de la discussion
Vieux 31/08/2007, 15h58   #1
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut excessive time spent in "statistics" status

I have some queries, involving a largish number of JOIN, which
are apparently very slow or even take forever (a mysqladmin processlist
shows them remain in the "statistics" status for a long time, in most
cases I have to kill them after several minutes).

When I first had the problem I googled around and found some reference
(which I've lost) saying that the "statistics" status is actually what one
does with an EXPLAIN SELECT, and that this is done preliminarily to the
actual query. It also said it might occur with a large number of joins
because this analysis, for n joins MIGHT try up to n! combinations, unless
one somehow specified the priorities (but the author did not remember
how).

I thought to have overcome the problem using a feature of the CREATE
VIEW command (see below), but apparently I simply moved it to an higher n.

Now I tried to see how it scales with the number of joins, and the curious
things is that e.g. for n=9 it works fast, for n=15 it works slowly, for
n=18 works fast again and for n=20 takes an infinite time.

I'll first explain my background :

- I have a number of tables (let's call them t0, t1, t2 ...)
- all of them have an auto_increment column called seq which is also
an index

- one table (t0) is "more important" (actually it is list of
celestial X-ray sources while the other are celestial objects
in other wavebands but this is irrelevant to you).

- I have precomputed correlation tables among t0 and each of
the other. These tables are called eg. t0ti, have two columns
t0 and ti corresponding to the t0.seq and ti.seq of objects which
are "associated". They are indexed on (t0,ti). Note that an
object in t0 can be associated with 1 or more or zero (t0ti.ti null)
objects in ti.

- I originally (already under mysql 3) devised a way to identify
counterparts in MORE tables (all these associations are based on
spherical distance + other criteria). This involved creating a
working table G

This table has columns named t0 t1 ... tn (containing the pointers
t0.seq t1.seq ... for counterparts associated and validated according
to some criteria) plus other service columns

The simultaneous access was achieved in our interface by a mechanism
we called virtual tables, which essentially was

SELECT
some subset of columns in some of the t0...tn
or some expression thereof
FROM
G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq

We refer to the t0...tn as "the member tables" of G.

We have different versions of G corresponding to different sets
of member tables and different association criteria.

The largest of our cases has 26 different members.

Our mechanism was such that we defined a subset of columns in
each of the ti (or expressions thereof like distances etc.) as
"interesting", with an associated alias. Our interface usually
showed only such "virtual columns", but had a possibility to add
(naming them manually as ti.colname) to the SELECT also all
other "member columns" normally hidden.

We also allow to correlate a virtual table with a single physical
table tk (be it member or not) using the t0tk correlation table
(t0 is the "First Member").

- the above worked and still works, but has some clumsiness. When
we upgraded to mysql 5 and discovered the CREATE VIEW command
we decided to replace our virtual tables with views.

- for each G we define a view as

create algoritm=temptable view V as
SELECT
some subset of columns in some of the t0...tn or in G
or some expression thereof
FROM
G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq

- the "algorithm=temptable" was required because without it some
of our queries (see below) entered in the "statistics" status forever
already with 11 members

- for the rest the VIEWs work nicely when used standalone and are easier
for the user ...

- ... but on the other hand they HIDE the member columns which are
not explicitly named in CREATE VIEW (where one wants to keep a
manageable number of columns). Hide means here that their names
ti.colname cannot be used in SELECT !

- so we devised an option by which on ticking on "show members also"
one can also include these ti.colname in the query

de facto this doubles the joins, because the statement built is

SELECT
list of (V.colname and ti.colname with i chosen among 0 and n)
FROM
( G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq

This statement NOW works (it did not work with e.g. 11 member tables
before we switched to ALGORITHM=TEMPTABLE in the CREATE VIEW).

An "explain select" for a query on such views gives that a view with
n members "with members also enabled" involves 29+n queries

(our [working] maximum is n=26 with 55 queries)

I have also noticed that such "explain select" are relatively fast
(0.3 sec) and are somehow cached (if I repeat one, it takes zero time)

- the last step would be to make a query on a VIEW with "show members
also" ticked, correlated with another table Tk (be Tk either a
member, one of t1...tn, or a non-member)

This involves a query like this

SELECT
list of (V.colname and ti.colname and Tk.colname)
FROM
( VTk left join
(
(G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq
) on VTk.V = V.seq
) left join Tk on VTk.Tk=Tk/seq

where the correlation table VTk is also a view

CREATE ALGORITHM=TEMPTABLE VIEW VTk as
select G.seq as V, G.Tk as Tk
from G left join t0Tk on G.t0=t0Tk.t0

which exploits the precomputed correlation among the "important"
table t0 (the "First Member") and Tk. Note that t0Tk is indexed
on (t0,Tk) but the view apparently is not indexed.

- now such a query on our largest table (26 members) hangs in
statistics status forever. I experimented therefore trying to
use a variable names of left joins (this to tell when it stops
working ... in real life I *might* need all, or any of the member).

The query for "view with members also correlated with another table"
in an "explain select" requires 33+n queries for n members.

Now the funny thing is that up to e.g. 9 members "explain select"
is rather fast (0.3 sec).

For 15 members (48 queries) slows down significantly (6.3 sec) and
is not cached (even if repeated soon still takes 6.3 sec).

For 20 members it takes forever.

But for 18 members (51 queries) instead takes only 0.5 sec.

When I say 9,15,18,20 members I mean the first 9,15,18,20 of the full
member list. Such a list is in an arbitrary order (sort of historical the
various tables entered the database).

Also I note that the output of "explain select" presents the various
tables in a varying order. For instance the slow 15-member case but also
the fast 9-member case have "<derived2>" (which is the 13000-element
un-indexed VTk correlation view) interspersed with the other table, the
fast 18-member case has it first.

Questions :

- what does "explain select" actually do and why sometimes hangs ?

- can this be overcome rearranging the order of the joins (note that
the bulk of the members are all joined with G), or introducing
parentheses or with other syntax changes ?

- or has it to do with some configuration parameter, maybe related
to what is cached, cache size or other ?

Thanks in advance to whoever is able to give hints.

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
  Réponse avec citation
Vieux 01/09/2007, 16h23   #2
Shawn Green
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

Hello Lucio,

(reply below)
Lucio Chiappetti wrote:
> I have some queries, involving a largish number of JOIN, which
> are apparently very slow or even take forever (a mysqladmin processlist
> shows them remain in the "statistics" status for a long time, in most
> cases I have to kill them after several minutes).
>
> When I first had the problem I googled around and found some reference
> (which I've lost) saying that the "statistics" status is actually what one
> does with an EXPLAIN SELECT, and that this is done preliminarily to the
> actual query. It also said it might occur with a large number of joins
> because this analysis, for n joins MIGHT try up to n! combinations, unless
> one somehow specified the priorities (but the author did not remember
> how).



You can find those explanations in our manual at
http://dev.mysql.com/doc/refman/5.0/...formation.html

>
> I thought to have overcome the problem using a feature of the CREATE
> VIEW command (see below), but apparently I simply moved it to an higher n.
>
> Now I tried to see how it scales with the number of joins, and the curious
> things is that e.g. for n=9 it works fast, for n=15 it works slowly, for
> n=18 works fast again and for n=20 takes an infinite time.
>
> I'll first explain my background :
>
> - I have a number of tables (let's call them t0, t1, t2 ...)
> - all of them have an auto_increment column called seq which is also
> an index
>
> - one table (t0) is "more important" (actually it is list of
> celestial X-ray sources while the other are celestial objects
> in other wavebands but this is irrelevant to you).
>
> - I have precomputed correlation tables among t0 and each of
> the other. These tables are called eg. t0ti, have two columns
> t0 and ti corresponding to the t0.seq and ti.seq of objects which
> are "associated". They are indexed on (t0,ti). Note that an
> object in t0 can be associated with 1 or more or zero (t0ti.ti null)
> objects in ti.
>
> - I originally (already under mysql 3) devised a way to identify
> counterparts in MORE tables (all these associations are based on
> spherical distance + other criteria). This involved creating a
> working table G
>
> This table has columns named t0 t1 ... tn (containing the pointers
> t0.seq t1.seq ... for counterparts associated and validated according
> to some criteria) plus other service columns
>
> The simultaneous access was achieved in our interface by a mechanism
> we called virtual tables, which essentially was
>
> SELECT
> some subset of columns in some of the t0...tn
> or some expression thereof
> FROM
> G left join t0 on G.t0=t0.seq
> left join t1 on G.t1=t1.seq
> ...
> left join tn on G.tn=tn.seq
>
> We refer to the t0...tn as "the member tables" of G.



I normally do not get lost in symbolic descriptions such as yours.
However your description of G and how you build it leaves me in the dark.

Can you show me a few sample rows of G (symbolically, if you like) and
describe what is in each column G.t0 to G.tn? What I am curious to know
is what do each of these n object have in common that allows them to
represented as a single tuple on the G table.

You also mention other "service columns". What kinds of information are
you keeping in those?


>
> We have different versions of G corresponding to different sets
> of member tables and different association criteria.
>
> The largest of our cases has 26 different members.
><snip>
> - for each G we define a view as
>


"Each G" ? Again, that makes the concept of what a G really is more
confusing to me. I understand databases and I know more than a little
about stellar cartography, cosmology, and physics. Please don't hold back.


> <snipped (to be revisited later)>
>
> Questions :
>
> - what does "explain select" actually do and why sometimes hangs ?
>
> - can this be overcome rearranging the order of the joins (note that
> the bulk of the members are all joined with G), or introducing
> parentheses or with other syntax changes ?
>
> - or has it to do with some configuration parameter, maybe related
> to what is cached, cache size or other ?
>
> Thanks in advance to whoever is able to give hints.
>


EXPLAIN SELECT simply stops a normal SELECT statement from actually
performing the data retrieval steps and shows us (the users) a
description of the techniques the query engine was about to use to get
at the data. That means that the steps of parsing the query, tokenizing
the symbols, and optimizing the execution plan still take place. It is
during this optimization phase that most of your CPU time is being used
as the engine will work many permutations of joining one table to
another until it reaches a decision about which plan is "less expensive"
that all of the others.

http://dev.mysql.com/doc/refman/5.0/...mizations.html
http://dev.mysql.com/doc/refman/5.0/...imization.html
and the rest of the optimization chapter describe this process in
considerable details
http://dev.mysql.com/doc/refman/5.0/en/query-speed.html

Please do explain the data problems you are trying to solve in some more
detail as I may be able to you to design a less cumbersome method
of achieving the same goals.

--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ /
/ /|_/ / // /\ \/ /_/ / /__
/_/ /_/\_, /___/\___\_\___/
<___/
Join the Quality Contribution Program Today!
http://dev.mysql.com/qualitycontribution.html
  Réponse avec citation
Vieux 03/09/2007, 12h36   #3
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

As I'm new on this list and can't find a "policy statement", I assume it
is OK to reply to the list in "discussion list" fashion. If instead it is
preferred to reply to each sender privately and later I'd post a summary,
please let me know and I'll comply.

On Sat, 1 Sep 2007, Shawn Green wrote:
> Lucio Chiappetti wrote:


> > I have some queries, involving a largish number of JOIN, which are
> > [...] very slow or [...] remain in the "statistics" status [forever]


> > This involved creating a working table G
> > [...]
> > The simultaneous access was achieved in our interface by a mechanism
> > we called virtual tables, which essentially was
> >
> > SELECT
> > some subset of columns in some of the t0...tn
> > or some expression thereof
> > FROM
> > G left join t0 on G.t0=t0.seq
> > left join t1 on G.t1=t1.seq
> > ...
> > left join tn on G.tn=tn.seq
> >
> > We refer to the t0...tn as "the member tables" of G.


> I normally do not get lost in symbolic descriptions such as yours.
> However your description of G and how you build it leaves me in the
> dark.
>
> Can you show me a few sample rows of G (symbolically, if you like) and
> describe what is in each column G.t0 to G.tn? What I am curious to know
> is what do each of these n object have in common that allows them to
> represented as a single tuple on the G table.


The explanation was already (too tersely ?) contained in my sentence

> > This table has columns named t0 t1 ... tn (containing the pointers
> > t0.seq t1.seq ... for counterparts associated and validated according
> > to some criteria) plus other service columns


Let me make an example.

t0 is a table of X-ray sources, it contains sky coordinates (ra,dec),
count rates and fluxes in 5 energy bands, detection probabilities etc.
In addition it contains a column named "seq" which is just an
auto_increment sequential number, and is the unique way to identify a
given source (there is a further complication due to the fact some records
are redundant detections of the same source, but since they are not used
in building the G's you'd better ignore this).

t1 might be e.g. a table of optical sources, with sky coordinates,
magnitudes, and, of course, its own "seq". There may be different tables
of optical sources (t1, t2, t3).

Or t4 can be a table of radio sources, with sky coordinates, fluxes, and
its "seq".

Or t5 can be a table of URLs into external astronomical sites like SIMBAD
or NED (if you know them), again with sky coordinates and a "seq".

Some of all these tables have their own identifiers, but sometimes these
aren't unique, or aren't numeric. So in general these tables MAY have an
UNIQUE PRIMARY index which can be either the original identifier, or some
combination of original columns (for instance a source identifier and a
field identifier) and an unique auxiliary key which is my auto_increment
seq (numeric, built at data ingestion). This again is mostly irrelevant to
you. The point is that each table has an unique auto_increment seq.

A "G" table will simply contain the "seq's" in the "member tables". For
ease of use the column names in G will be the table names of the member
tables. I indicated them as t0 t1 t2 ... that their actual names are e.g.
"nov06", "d1t3", "ukidss", "radio", "simbad" is irrelevant.

So a record in G may contain for instance :

- its own seq 253719 (do not be worried by the fact the number is large
there are lots of gaps for records removed during construction)
- the seq in the X-ray table (t0) : 1521
- the seq in an optical table (t1) : 1229
- the seq in another optical table (t2) : 42168
- the seq in the radio table (t3) : null
- the seq in an IR table (t4) : 9

Another record with seq 260429 can have the same t0=1521, but e.g. the seq
in the IR table t4=11, and all other t1 t2 t3 null.

Etc. etc.

Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or
whatever) potential counterparts, one is optical t2=1229 which is the same
as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11,
etc. etc.

All associations are pre-computed via some sort of other (proximity)
analysis.

> You also mention other "service columns". What kinds of information are
> you keeping in those?


Information which is irrelevant to the present discussion, except for a
marginal point (see below). Like for instance a numeric rank which says
that the association 1521/1229/42168/null/9 is preferred, and the
association 1521/null/null/null/11 is unlikely, or to be rejected. Or
flags produced during the identification. Or the chance probabilities that
the association of a source in t0 and t1 or t2 is real considered the
distance and the density of objects having a given magnitude.

> > We have different versions of G corresponding to different sets
> > of member tables and different association criteria.


> "Each G" ? Again, that makes the concept of what a G really is more
> confusing to me. I understand databases and I know more than a little
> about stellar cartography, cosmology, and physics. Please don't hold
> back.


Each G in the sense of different versions. I may have one G starting from
a t0 with X-ray sources in an area of the sky, and another with a t0 in
another area or coming from a different analysis. Or even with the same t0
I may have a G whose members are t0 t1 t2 t3 (identification done two
years ago with tables available then) and a G whose members are t0 t1 t5
t6 t7 t8 t9 t10 (identification done now, with more tables, or with newer
versions of some tables). Again this is irrelevant for MY PROBLEM, except
for the following statement :

- I have no problems with the SELECT statements I reported above for
whatever number of t0...tn member tables. Neither if I issue the
statement as such, nor if I encapsulate it in a CREATE VIEW

- I start encounter problems if I want to join the VIEW created with the
above statement with something else, when n is slighly large (e.g. a G
with 11 members), but if I use ALGORITHM=TEMPTABLE the problem goes
away, at least for n=26 (my largest G).

- I have a problem again for a more complex join, which arises around
n=20.

Actually it's look like further experimenting shows a solution, but
I'm still unsure whether that gives the same results of the old one :

such a solution is : replace all LEFT JOIN by STRAIGHT_JOIN

Maybe you'd just want to comment on that ?

> Please do explain the data problems you are trying to solve in some more
> detail as I may be able to you to design a less cumbersome method
> of achieving the same goals.


I thank you for your offer. However we do not want to revolutionize all
our system and interfaces (Shawn, if you are interested I can give you
privately more specific pointers), which has been working well since some
years and used inside our consortium, and has just "gone public" for some
datasets.

The "data problem" is likely to have no or little relation with the
astronomical nature of the data, or the background explained above (which
howver I hope was of some interest), but can be summarized as follows.
Note the problems occur ONLY in the latest steps, which are those of
lesser importance for the user (or important for less users) :

- there are n tables (all with their "seq" column), which contain
several columns, of which some are particularly interesting, and
some other are less interesting

- these tables are associated via a "G" table which links all the seqs
of associated objects. This is precomputed once forever.

- the majority of the users will be interested in a small number of
interesting columns taken from some of the member tables (or
expressions thereof). Howevever we wanted to screen them from entering
in a SELECT statement columns explicitly by names like nov06.seq,
nov06.fluxb, w1t3.magip or even
dist(nov06.ra_corr,nov06.dec_corr,w1t3.ra,w1t3.dec l)*3600 (where dist
is an UDF) ... and I have even worse expressions.

So we originally devised our own java interface to somehow hide the
expansion of such aliases, and later embedded the aliases in a
CREATE VIEW.

The "FROM" of such statements is the one listed above, involving
G and a sequence of left joins on t0 to tn.

This works nicely and fast in both the old (no view) and new (with
view) way.

- a minority of users could be interested in accessing also some of
the columns in the member tables, which are NOT listed in our
non-hidden list or are NOT members of the view.

In the old arrangement, we simply had a check box which enabled
"view member also" and un-hided the hidden columns in a menu, Also
the user could simply type their name (say nov06.snrcd) in the
SELECT if he knew the name.

Such statement was the SAME involving G and a sequence of left joins on
t0 to tn. It is just a matter of listing more columns in the
"select" before the "from".

However if the new arrangement uses a view V to hide the entire
statement, if I WANT USERS TO BE ABLE ALSO TO SEE HIDDEN COLUMNS
(if you want THIS is the data problem !!!) I have to explicitly
repeat the list of joins. I cannot issue

select nov06.snrcd from V

because nov06.snrcd was not included in the definition of the view
(and there are too many columns in the n members t0...tn to be all
named in a viable view). So I generate a statement like that listed
at about 2/3 of the post which started the thread

SELECT
list of (V.colname and ti.colname with i chosen among 0 and n)
FROM
( G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq

Again this works provided one uses ALGORITHM=TEMPTABLE in the CREATE
VIEW. Despite the fact some redundant queries are issued.

- a real minority of users (me only ? for debugging purposes ? it
happened me once that I detected and fixed a typo that way )
could be interested in accessing not only the interesting columns
in the view AND the hidden columns in the members, but also
columns in a FURTHER TABLE tk (e.g. which is not yet a member, or
is a new release of some member which I have to evaluate whether
to replace). This involves a more complex join ... which is listed
towards the end of the original post and I won't repeat here

Such join involves one further view which in turn is a join of
the G and a correlation table between tk and the X-ray table t0
(essentially listing all seqs in t0 and tk whose object are closer
than a predefined sky distance).

It is this latter statement (of rare use, hence my reluctance to
revolutionize all the rest) which enters the "statistics state"
forever for large number of members.

However, as I said, replacing LEFT JOIN by STRAIGHT_JOIN seems
to solve it.

Now coming to more general issues, I have read (or re-read) most of the
manual pages you quoted, but your explanation below is what condenses more
effectively what happens.

> EXPLAIN SELECT simply stops a normal SELECT statement from actually
> performing the data retrieval steps and shows us [...]


I used in the past EXPLAIN SELECT mainly to check whether introducing an
index (or sometimes a parenthesized order) improved the performance,
although why it did it was sort of black magic.

> It is during this optimization phase that most of your CPU time is being
> used as the engine will work many permutations of joining one table to
> another until it reaches a decision about which plan is "less expensive"
> that all of the others.


So essentially this matches what a lost Google reference said, that in
case of many joins an excessive times may be spent checking up to n!
combinations unless one somehow "forces the order".

What I gather from some hints in the documentation (and my very rough and
quick test) is that STRAIGHT_JOIN might be the way to force the order (but
will it have some unpleasant side effects or is it safe ?)

Thanks again.

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
() ascii ribbon campaign - against html mail
/\ http://arc.pasp.de/
  Réponse avec citation
Vieux 03/09/2007, 17h41   #4
Shawn Green
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

Hello Lucio,

Thank you for the excellent description of you problem. I believe I
completely understand both the data you are handling and the problems
you are facing.

I would not ask you to change your schema at this point. There is far
too much work put into it at this phase to suggest a redesign. What I
hope to do is to you to make you queries work better by applying
hints, modifiers, and limits to the optimizer.

http://dev.mysql.com/doc/refman/5.0/en/join.html states:
"STRAIGHT_JOIN is identical to JOIN, except that the left table is
always read before the right table. This can be used for those (few)
cases for which the join optimizer puts the tables in the wrong order."

You asked what would happen if you replaced all of your view's LEFT JOIN
clauses with STRAIGHT_JOIN clauses. When used in the FROM clause (as
part of the definitions of where the data comes from) STRAIGHT_JOIN
would be equivalent to a JOIN which is equivalent to an INNER JOIN. This
is not what you are trying to achieve.

However, there is another place you can put the STRAIGHT_JOIN modifier:
in the SELECT clause
(http://dev.mysql.com/doc/refman/5.0/en/select.html) . Quoting again:
" STRAIGHT_JOIN forces the optimizer to join the tables in the order in
which they are listed in the FROM clause. You can use this to speed up a
query if the optimizer joins the tables in non-optimal order. See
Section 6.2.1, “Optimizing Queries with EXPLAIN”. STRAIGHT_JOIN also can
be used in the table_references list. See Section 12.2.7.1, “JOIN Syntax”.

SELECT STRAIGHT_JOIN ....
FROM ...
LEFT JOIN ...
WHERE ...
...

This will allow you to keep your LEFT JOINs in the <table reference>
portion of your query (everything between FROM and WHERE) but avoid all
of the permutations the optimizer performse related to trying to analyze
which table to join first to which other table. Why are some sets of
tables optimized more quickly than the others? My best guess would be
that some mix of WHERE conditions and table indexes make it much faster
to eliminate combinatorial permutations of JOIN sequences than others.

There is another variable you could use to minimize how many table
permutations the optimizer will examine. This is the variable
optimizer_search_depth:
(http://dev.mysql.com/doc/refman/5.0/...r_search_depth)

(My apologies if that link wraps incorrectly). You can set this variable
on a per-session basis so it would be possible to tune it for each call
to your views. I know that would be quite the administrative hassle to
implement setting this variable for each call to your views but I
provide it here for completeness.

Please try out "SELECT STRAIGHT_JOIN" and let me know if your situation
improves.

BTW - we encourage everyone to reply to the full list on all responses
(unless they are confidential) so that all members can gain from the
knowledge transfer.
--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ /
/ /|_/ / // /\ \/ /_/ / /__
/_/ /_/\_, /___/\___\_\___/
<___/
Join the Quality Contribution Program Today!
http://dev.mysql.com/qualitycontribution.html


  Réponse avec citation
Vieux 04/09/2007, 10h49   #5
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

On Mon, 3 Sep 2007, Shawn Green wrote:

> What I hope to do is to you to make you queries work better by
> applying hints, modifiers, and limits to the optimizer.


Many thanks, Shawn.

> "STRAIGHT_JOIN is identical to JOIN, except ...
>
> You asked what would happen if you replaced all of your view's LEFT JOIN
> clauses with STRAIGHT_JOIN clauses. [...] This is not what you are
> trying to achieve.


I realized that. I overlooked "is identical to JOIN" since I generally use
only LEFT JOINs to benefit of null returns (to us it is equally important
to know "source A has these 1/2/n counterparts" and "source A has no
counterparts" in the other catalogue/s).

I originally tested that STRAIGHT_JOIN made the "explain select" arrive to
an end without looping forever, but as soon as I tried a real select I
realized the different behaviour.

> However, there is another place you can put the STRAIGHT_JOIN modifier:
> in the SELECT clause


I discovered that too, and it looks promising. I plan to test whether that
makes any difference on the query result and execution time in some
representative cases.

> Why are some sets of tables optimized more quickly than the others? My
> best guess would be that some mix of WHERE conditions and table indexes
> make it much faster to eliminate combinatorial permutations of JOIN
> sequences than others.


I also realized that the WHERE part enters the optimization too ... this
is of course unpredictable a priori, as it depends on the particular user
needs.

> This is the variable optimizer_search_depth:


This also looks interesting thanks. I did some quick experimenting. For
some reasons a SHOW VARIABLE tells me its normal value is 62. Apparently
this does not depend on the database I select (I thought my real "data"
database with the views and G's could have an higher value than e.g. my
own "administrative" DB or the mysql DB, but it is always 62.

I tried lowering it to 10/15/20 and even to 0 (which according to 5.2.3
in the manual should be a sort of "automatic" value). In all cases the
explain select on our maximal statement (the one which loops forever in
statistics status under default conditions) reaches an end rather fast,
and returns the same result (irrespective of 0/10/15/20).

The ORDER returned by playign with optimizer_search_depth is however
different from the one returned with SELECT STRAIGHT_JOIN. The difference
is just in the order (the type, key and "extra" looks the same, in general
"ref" or "eq_ref" and "using index") so I suppose it should not make much
difference.

> You can set this variable on a per-session basis so it would be possible
> to tune it for each call to your views.


While I'm doing the tests under the mysql linemode client, our production
environment is a Java jsp interface under tomcat. I have to check with my
colleague dealing with java programming what is a "session" for him
(probably a JDBC connection in connection pool).

Actually what looks appealing in optimizer_search_depth vs SELECT
STRAIGHT_JOIN is the possibility of fixing (no tuning per session)
optimizer_search_depth to a decent default (0 ?) once forever, because
this requires no changes to the code (however if SELECT STRAIGHT_JOIN will
require changes concentrated in a few places).

I'll do some experimenting and report back.

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For
more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
----------------------------------------------------------------------- ()
ascii ribbon campaign - against html mail /\ http://arc.pasp.de/
  Réponse avec citation
Vieux 04/09/2007, 14h34   #6
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

On Tue, 4 Sep 2007, Lucio Chiappetti wrote:

> I'll do some experimenting and report back.


In lack of better ways of doing a tie-break, I've done the following
tests (with the linemode client), checking both the results of a query and
the total time spent. I tested 16 different combinations of arbitrary
WHERE conditions and correlation with an arbitrary external tables, using
my "maximal" G (the one with 26 members).

For each I tested 3 cases (total 16*3=48) :

a) the query on the "virtual" table correlated with the external
(the virtual is my G left join t1 ... left join tn). This involves
the lowest number of joins (29) and is of course the fastest (and
also the easiest to get access to "hidden" columns by name, but
is not what I intend our users use because the way to select columns
is clumsy (they should use a VIEW on the 26 members instead ...
normally with no correlation on external that will be equally fast),
but represents my REFERENCE, i.e. I checked the results of the other
test to be the same as this.

b) the query on the VIEW V with "members also" (which implies a redundant
join of V with G left join t1 ... left join tn) and correlated with
an external table. Doubles the joins (59). I used here
SELECT STRAIGHT_JOIN.

c) the same query of (b) but with a normal select, and preceded by
setting optimizer_search_depth=0 (auto).

I did each test in a fresh mysql session to prevent cached valued to
affect the timings.

The good news are that all query go to end without getting stuck in the
statistics state, and that (a),(b),(c) for the same query return the same
results.

Considering the timing instead :

(a) are obviously the fastest (from 0.0 to 0.63 seconds at worst)

(b) the queries with SELECT STRAIGHT_JOIN ("not optimized" ?) are
*in general* the slowest but not too slow, i.e. complete within from
0.8 to 1.9 sec. There are however a couple of cases involving one
external table in which they are slower (though not unfeasible),
from 3 to 12 sec in one case, and from 24 to 45 sec in another.

(c) the queries with optimizer_search_depth=0 have a speed comparable
with (b). *In general* they are marginally faster than (b) (0.7
to 1.6 sec) when (b) is reasonably fast. When (b) is slow,
however (c) is TWICE AS SLOWER (5 to 23 and 48 to 89 sec)

The explain select does not give obvious clues while those particular
table combinations are slower, and, considered that the queries involving
"view + member also + external" concern an absolute minority of users[*]
it is not worth spending more time investigating.
[*] actually a larger minority (I hope it remains such) of our "public"
users accesses the DB once to "take away" all data without performing any
selection :-(

My inclination therefore would be to prefer optimizer_search_depth=0
to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat
environment, because of the marginal increase in speed despite the
occasional worsening. But the two look almost equivalent.

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
() ascii ribbon campaign - against html mail
/\ http://arc.pasp.de/
  Réponse avec citation
Vieux 04/09/2007, 20h04   #7
Shawn Green
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

Lucio Chiappetti wrote:
> On Tue, 4 Sep 2007, Lucio Chiappetti wrote:
>
>> I'll do some experimenting and report back.

>
> In lack of better ways of doing a tie-break, I've done the following
> tests (with the linemode client), checking both the results of a query and
> the total time spent. I tested 16 different combinations of arbitrary
> WHERE conditions and correlation with an arbitrary external tables, using
> my "maximal" G (the one with 26 members).
>
> For each I tested 3 cases (total 16*3=48) :
>
> a) the query on the "virtual" table correlated with the external
> (the virtual is my G left join t1 ... left join tn). This involves
> the lowest number of joins (29) and is of course the fastest (and
> also the easiest to get access to "hidden" columns by name, but
> is not what I intend our users use because the way to select columns
> is clumsy (they should use a VIEW on the 26 members instead ...
> normally with no correlation on external that will be equally fast),
> but represents my REFERENCE, i.e. I checked the results of the other
> test to be the same as this.
>
> b) the query on the VIEW V with "members also" (which implies a redundant
> join of V with G left join t1 ... left join tn) and correlated with
> an external table. Doubles the joins (59). I used here
> SELECT STRAIGHT_JOIN.
>
> c) the same query of (b) but with a normal select, and preceded by
> setting optimizer_search_depth=0 (auto).
> <snip>


In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer
SELECT statement or within the CREATE VIEW statement>? If you only tried
it one way, you could try it the other, too.

>
>[*] actually a larger minority (I hope it remains such) of our "public"
> users accesses the DB once to "take away" all data without performing any
> selection :-(
>


It saddens me to see people abuse your hard work in this way.

> My inclination therefore would be to prefer optimizer_search_depth=0
> to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat
> environment, because of the marginal increase in speed despite the
> occasional worsening. But the two look almost equivalent.
>


Another option would be to build your view dynamically based on the
criteria that a user selects through a web-based interface.

One other way to approach this project would be to normalize your
relationships and have one skinny but very tall table similar to

CREATE TABLE map_table (
object1_id int
, object1_type int
, object2_id int
, object2_type int
, confidence tinyint
)

where confidence would be a whole number from 0 to 100. You would need
only one of these to replace each G table you are generating now.

However, that would be a major difference in how you currently use your
data and I would not suggest this for a near-term solution.
--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ /
/ /|_/ / // /\ \/ /_/ / /__
/_/ /_/\_, /___/\___\_\___/
<___/
Join the Quality Contribution Program Today!
http://dev.mysql.com/qualitycontribution.html
  Réponse avec citation
Vieux 05/09/2007, 09h29   #8
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

On Tue, 4 Sep 2007, Shawn Green wrote:
> > On Tue, 4 Sep 2007, Lucio Chiappetti wrote:


> > For each I tested 3 cases (total 16*3=48) :
> >
> > a) the query on the "virtual" table correlated with the external
> > (the virtual is my G left join t1 ... left join tn). [...]
> > represents my REFERENCE,


this case will also exploit full (or default) optimization

> > b) the query on the VIEW V with "members also" (which implies a
> > redundant join of V with G left join t1 ... left join tn) and
> > correlated with an external table. [...] I used here SELECT
> > STRAIGHT_JOIN.


> > c) the same query of (b) but with a normal select, and preceded by
> > setting optimizer_search_depth=0 (auto).


> In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer
> SELECT statement or within the CREATE VIEW statement?


No. I left the CREATE VIEW alone (standard LEFT JOINs) in all cases,
since that will be fully optimized, and used by a majority of users.
SELECT STRAIGHT_JOIN was used only in (b) ["no optimization"] and not in
(c) ["residual optimization"].

Our inclination would now be to use SELECT STRAIGHT_JOIN because it's the
one involving only "elegant" changes to the code. Consider again that the
case "view + member also + other table" will be rare (very few users)
compare to "view alone".

Anyhow all this was rather instructive.

Now I'll start experimenting with UNIONs too ...

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
() ascii ribbon campaign - against html mail
/\ http://arc.pasp.de/
  Réponse avec citation
Vieux 14/09/2007, 09h33   #9
Jeremy Cole
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

Hi Shawn, Lucio,

> SELECT STRAIGHT_JOIN ....
> FROM ...
> LEFT JOIN ...
> WHERE ...
> ...


Just to correct a point here... if a query uses only LEFT JOIN or RIGHT
JOIN, the join order is fixed by the query's order itself, so using
STRAIGHT_JOIN should have no effect whatsoever. Equally important,
since the join order is fixed when you use LEFT JOIN, you *must* list
the joins in the correct order when writing the query, otherwise you
will see very poor performance.

MySQL's optimizer cannot reorder the joins because it has the potential
to change the result of the query.

Regards,

Jeremy

--
high performance mysql consulting
www.provenscaling.com
  Réponse avec citation
Vieux 14/09/2007, 10h44   #10
Lucio Chiappetti
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

On Fri, 14 Sep 2007, Jeremy Cole wrote:

> Just to correct a point here... if a query uses only LEFT JOIN or RIGHT JOIN,
> the join order is fixed by the query's order itself, so using STRAIGHT_JOIN
> should have no effect whatsoever. Equally important, since the join order is


but experimentally using SELECT STRAIGHT_JOIN *does make* a difference. If
I use it there is no time spent in the "statistics" phase, otherwise there
is. Shawn can comment more conclusively since he knows the inner working
of mysql. I just report a finding by experiment.

> should have no effect whatsoever. Equally important, since the join order is
> fixed when you use LEFT JOIN, you *must* list the joins in the correct order
> when writing the query, otherwise you will see very poor performance.


However the "correct order" is not always obvious to be determined a
priori (like in my cases where I have a "glorified correlation" table G
which shall go first, but all member tables which go next with equal rank
(except the "first member" which is more equal than the other "=) ). The
user can then build a query which uses only SOME of the members. And I've
noticed that the result of EXPLAIN SELECT (the order) changed according to
the content of the query (e.g. the particular WHERE condition).

So for me experimentally use of SELECT STRAIGHT_JOIN is an effective
solution.

--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
() ascii ribbon campaign - against html mail
/\ http://arc.pasp.de/
  Réponse avec citation
Vieux 17/09/2007, 21h42   #11
Pete Harlan
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

On Fri, Sep 14, 2007 at 01:33:51AM -0700, Jeremy Cole wrote:
> Hi Shawn, Lucio,
>
> >SELECT STRAIGHT_JOIN ....
> >FROM ...
> >LEFT JOIN ...
> >WHERE ...
> >...

>
> Just to correct a point here... if a query uses only LEFT JOIN or RIGHT
> JOIN, the join order is fixed by the query's order itself, so using
> STRAIGHT_JOIN should have no effect whatsoever.


True, because you say "only" left or right joins.

> Equally important, since the join order is fixed when you use LEFT
> JOIN, you *must* list the joins in the correct order when writing
> the query, otherwise you will see very poor performance.


"Only" is missing from here, which could be misleading. MySQL will
reorder the t0, t1 and t2 joins in:

select ...
from t0
join t1 on ...
join t2 on ...
left join t3 on ...
where ...

> MySQL's optimizer cannot reorder the joins because it has the potential
> to change the result of the query.


Do have an example in mind?

Thanks,

--Pete
  Réponse avec citation
Vieux 17/09/2007, 21h56   #12
Baron Schwartz
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: excessive time spent in "statistics" status

Just a quick note to add to this conversation:

Pete Harlan wrote:
> On Fri, Sep 14, 2007 at 01:33:51AM -0700, Jeremy Cole wrote:
>> MySQL's optimizer cannot reorder the joins because it has the potential
>> to change the result of the query.


Not in all cases. If a LEFT JOIN query also has a WHERE clause that
disallows NULL rows from the right-hand table, it will (may? I'm not
sure if it always will, but at least some times it will) treat it as an
INNER JOIN, and it's therefore a candidate for reordering. I've also
seen cases where a LEFT JOIN becomes a CROSS JOIN
(http://bugs.mysql.com/bug.php?id=30842).

All RIGHT JOIN are rewritten to equivalent LEFT JOIN, so the same
optimizations can apply.

As a side note, I've really benefited from attending Timour's talk on
the optimizer at the conference this spring. His slides are good
reading too:
http://conferences.oreillynet.com/pr...nov_timour.pdf

Baron
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 17h31.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,39890 seconds with 20 queries