|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hi,
Need some ideas for this problem: If you had a file containing boundaries [A,B] (up to 25 million) and a file containing a list of points P (up to 10 billion). What would be the fastest way to determine which of the listed boundaries contain a point P? Standard ugly std::count_if and map inserts and lookup are far to slow for this problem. Any other ideas? (problem is a simpler version of a derivative default map issue I'm having) Thanks for any . |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
ahmadcorp.m@googlemail.com wrote:
> If you had a file containing boundaries [A,B] (up to 25 million) and a > file containing a list of points P (up to 10 billion). What would be > the fastest way to determine which of the listed boundaries contain a > point P? The 25 million points will happily fit in memory. Make a binary tree where each node represents an interval min to max, and the two sub-trees represent intervals min to mid and mid to max. Then store intervals that cross mid in the node. Pseudo-code: class Node { point_t min; point_t mid; point_t max; Node_ptr leftnode; Node_ptr rightnode; set< pair<point_t,point_t> > cross_mid; public: Node(point_t min_, point_t max_): min(min_), mid((min_+max_)/2), max(max_), leftnode(new Node(min,mid)), // you need to postpone this until rightnode(new Node(mid,max)) // it's needed {} void insert(pair<point_t,point_t> b) { if (b.second<mid) left->insert(b) else if (b.first>mid) right->insert(b) else cross_mid.insert(b); } set< pair<point_t,point_t> > find(point_t p) { return cross_mid.find(p) // left as an exercise :-) union ( (p<mid) ? leftnode : rightnode ) -> find(p); } }; Like all binary trees, this suffers if it becomes unbalanced. Consider inserting in a random order, perhaps. (Are your files ordered?) The subdivision will work best if the ranges are fairly evenly distributed. In this case, the inserts should be O(N log N) and the lookups each O(log N + M). If the ranges are strongly clustered some other approach might be better. Phil. |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
Thanks for the info. I have also considered a binary tree, however I
forgot to mention a key point that might not work for a binary tree. We need to determine all applicable boundary values for a specific point. Eg. point 3 is contained within 2000 of the 25 million boundaries specified. A binary tree would imply multiple searches to find all applicable boundaries. Or some trickery based on the depth of node on a tree. |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
ahmadcorp.m@googlemail.com wrote:
> Thanks for the info. I have also considered a binary tree, however I > forgot to mention a key point that might not work for a binary tree. > > We need to determine all applicable boundary values for a specific > point. Eg. point 3 is contained within 2000 of the 25 million > boundaries specified. > > A binary tree would imply multiple searches to find all applicable > boundaries. No, why? As you descend the tree, you accumulate all the ranges that include your point. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
On 5 Feb, 16:29, "ahmadcor...@googlemail.com"
<ahmadcor...@googlemail.com> wrote: > Thanks for the info. I have also considered a binary tree, however I > forgot to mention a key point that might not work for a binary tree. > > We need to determine all applicable boundary values for a specific > point. Eg. point 3 is contained within 2000 of the 25 million > boundaries specified. > After wiki-ing a while, I've found the following: http://en.wikipedia.org/wiki/Interval_tree http://en.wikipedia.org/wiki/Segment_tree Phil's solution is very similar to "Alternative 1" in the first article and I think it's the most appropriate for your problem. Dario |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
On 6 Feb, 08:00, Dario Saccavino <kath...@gmail.com> wrote:
> On 5 Feb, 16:29, "ahmadcor...@googlemail.com" > > <ahmadcor...@googlemail.com> wrote: > > Thanks for the info. I have also considered a binary tree, however I > > forgot to mention a key point that might not work for a binary tree. > > > We need to determine all applicable boundary values for a specific > > point. Eg. point 3 is contained within 2000 of the 25 million > > boundaries specified. > > After wiki-ing a while, I've found the following: > > http://en.wikipedia.org/wiki/Interva...i/Segment_tree > > Phil's solution is very similar to "Alternative 1" in the first > article and I think it's the most appropriate for your problem. > > Dario Thanks for the references. The algorithm is much clearer now. Thanks to all. Immensely appreciate it. |
|
![]() |
| Outils de la discussion | |
|
|