boost::regex - open ranges a no no? was: Slow Regex Code
Mirco Wahab wrote:
I modified the expression:
> ...
> boost::regex reg("\\b\\d{2,}\\b");
> ...
to:
...
boost::regex reg("\\b\\d\\d+\\b");
...
with tremendeous improvements:
> [Windows XP-32, Athlon-64/3200+,@2290MHz]
> - Visual Studio 2008 + Boost 1.35.0 9.3 sec
> - Perl 5.10 (Active-) 10.4 sec
[Windows XP(32bit), Athlon-64/3200+ @2290MHz]
Visual Studio 2008 + Boost 1.35.0 1.8 sec
Perl 5.10.003 (AP, use64bitint=undef) 9.5 sec
> [Linux 2.6.23, Pentium4,@2660MHz]
> - gcc 4.3, -O2, Boost 1.33.1 13.2 sec
> - Perl 5.8.8 8.2 sec
[Linux 2.6.23(32bit), Pentium4/NW @2660MHz]
gcc 4.3.1 -O2, Boost 1.33.1 1.2 sec (user)
Perl 5.8.8 (32bit, use64bitint=undef) 6.2 sec (user)
> [Linux 2.6.23, Core2/Q6600,@3240MHz]
> - gcc 4.3, -O2, Boost 1.33.1 6.3 sec
> - Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec
[Linux 2.6.23(32bit), Core2/Q6600,@3240MHz]
gcc 4.3.1 -O2, Boost 1.33.1 0.55sec (user)
Perl 5.8.8 (32bit, use64bitint=undef) 2.4 sec (user)
> [Linux 2.6.24, Core2/Q9300,@3338MHz]
> - gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??)
> - Perl 5.10 (i586, use64bitint=undef) 10.4 sec
[Linux 2.6.25(32bit), Core2/Q9300,@3338MHz]
gcc 4.3.1, -O3, Boost 1.34.1 0.42sec (user)[*]
Perl 5.10.0 (32bit, use64bitint=undef) 4.0 sec (user)
[*] => after kernel update & gcc update,
g++ -O3 -c boostrg.cxx -o boostrg.o
works now
modified Code, C++:
==>
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
int number_count(const char *block, unsigned int len)
{
boost::match_flag_type flags = boost::match_default;
boost::regex reg("\\b\\d\\d+\\b");
boost::cmatch what;
const char *from = block, *to = block+len;
int n = 0;
while( boost::regex_search(from, to, what, reg, flags) ) {
from = what[0].second;
++n;
}
return n;
}
int main ()
{
std::ifstream in("nietzsche8.txt"); // this is a 112 MB file,
// it's 8 x the Nietzsche
if(in) { // fulltext in plain ASCII
in.seekg(0, std::ios::end); // get to EOF
unsigned int len = in.tellg(); // read file pointer
in.seekg(0, std::ios::beg); // back to pos 0
char *block = new char [len+1]; // don't be stingy
in.read(block, len); // slurp the file
int n = number_count(block, len); // process data
std::cout << "The text (" << len/1024 << "KB) has "
<< n << " numbers >= 10!" << std::endl;
delete [] block; // play fair
}
return 0;
}
<==
modified Code, Perl:
==>
open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
my $block;
do { local $/; $block = <$fh> };
close $fh;
my $n;
++$n while $block =~ /\b\d\d+\b/g; # process data
print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";
<==
At least for me, a very interesting difference.
Boost::Regex gives Perl a significant margin.
Regards
Mirco
|