|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hello,
I have a small problem between maths and databases. I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? Best regards Alain |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
Alain Reymond wrote:
> Hello, > > I have a small problem between maths and databases. > > I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. > > Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. > > The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. > > I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. > > Are there techniques to manage that kind of problem ? > > Best regards > > > Alain What is in the 55 columns? If you are talking about the distance between places on the earth, then there are only two columns of importance -- latitude and longitude. What is the purpose of this calculation? Is this a homework problem, perhaps? |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
sheldonlg wrote:
> Alain Reymond wrote: >> Hello, >> >> I have a small problem between maths and databases. >> >> I have a table in a database with 55 columns of reals and at least >> 200.000 lines (but up to 2.000.000 if I upload everything). So, I am >> in a 55 dimensions space with as many points as lines in my database. >> >> Here is my problem : I try to find the closest neighbour (limited to >> 1.000 points) of a point taken at random in the database. The distance >> is a classical euclidian distance in a 55 dimensions space. >> >> The brutal method is to browse the entire database, calculate the >> distance to the point for each row, order it and limit to the first >> 1000 rows. This is a single SQL instruction but it consumes a lot of >> resources. >> I suppose that it is possible to reduce the amount of calculus by >> eliminating points for which we know that they are to far away. Let's >> take an analogy : if you want to evaluate the closest cities to >> Chamonix - the famous French ski station! -, you will eliminate the >> cities situated in the south hemisphere as you know before evaluating >> anything that they are to far. >> >> Are there techniques to manage that kind of problem ? >> >> Best regards >> >> >> Alain > > What is in the 55 columns? If you are talking about the distance > between places on the earth, then there are only two columns of > importance -- latitude and longitude. > > What is the purpose of this calculation? Is this a homework problem, > perhaps? > As he said - this is 55 dimension space. Sounds like an advanced math/physics problem. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
Jerry Stuckle a écrit :
> sheldonlg wrote: >> Alain Reymond wrote: >>> Hello, >>> >>> I have a small problem between maths and databases. >>> >>> I have a table in a database with 55 columns of reals and at least >>> 200.000 lines (but up to 2.000.000 if I upload everything). So, I am >>> in a 55 dimensions space with as many points as lines in my database. >>> >>> Here is my problem : I try to find the closest neighbour (limited to >>> 1.000 points) of a point taken at random in the database. The >>> distance is a classical euclidian distance in a 55 dimensions space. >>> >>> The brutal method is to browse the entire database, calculate the >>> distance to the point for each row, order it and limit to the first >>> 1000 rows. This is a single SQL instruction but it consumes a lot of >>> resources. >>> I suppose that it is possible to reduce the amount of calculus by >>> eliminating points for which we know that they are to far away. >>> Let's take an analogy : if you want to evaluate the closest cities >>> to Chamonix - the famous French ski station! -, you will eliminate >>> the cities situated in the south hemisphere as you know before >>> evaluating anything that they are to far. >>> >>> Are there techniques to manage that kind of problem ? >>> >>> Best regards >>> >>> >>> Alain >> >> What is in the 55 columns? If you are talking about the distance >> between places on the earth, then there are only two columns of >> importance -- latitude and longitude. >> >> What is the purpose of this calculation? Is this a homework problem, >> perhaps? >> > > As he said - this is 55 dimension space. Sounds like an advanced > math/physics problem. Thanks for your answer. In fact, there is a solution using kd-trees. Best regards Alain |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
>> I have a small problem between maths and databases.
>> >> I have a table in a database with 55 columns of reals and at least >200.000 lines (but up to 2.000.000 if I upload everything). So, I am in >a 55 dimensions space with as many points as lines in my database. >> >> Here is my problem : I try to find the closest neighbour (limited to >1.000 points) of a point taken at random in the database. The distance >is a classical euclidian distance in a 55 dimensions space. If there's a limit on the distance then there's a limit on the difference in each individual dimension (the "bounding box"). How many points can you eliminate with a clause: WHERE x13 BETWEEN 23.15 AND 25.15 ? Even if you don't do this on all 55 dimensions, try doing it on a couple with the most variation in range. Put an index on each of those columns. Hope that MySQL will use it before applying the computationally expensive formula to the rows. I believe MySQL will only use one index but it should pick the best one. >> The brutal method is to browse the entire database, calculate the >distance to the point for each row, order it and limit to the first 1000 >rows. This is a single SQL instruction but it consumes a lot of >resources. >> I suppose that it is possible to reduce the amount of calculus by >eliminating points for which we know that they are to far away. Let's >take an analogy : if you want to evaluate the closest cities to Chamonix >- the famous French ski station! -, you will eliminate the cities >situated in the south hemisphere as you know before evaluating anything >that they are to far. Exactly - but you need to know that your entire database isn't from Australia before you do that, or have an explicit distance limit. >> Are there techniques to manage that kind of problem ? |
|
![]() |
| Outils de la discussion | |
|
|