|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl and Python. I'm sure I'm doing something incorrect. Any tips? #include <boost/regex.hpp> #include <iostream> // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35 /usr/local/lib/libboost_regex-gcc41-mt-s.a // g++ numbers.cpp -o numbers.exe -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib void number_search(const std::string& portion) { static const boost::regex Numbers("\\b\\d{9}\\b"); static const boost::regex& rNumbers = Numbers; boost::smatch matches; std::string::const_iterator Start = portion.begin(); std::string::const_iterator End = portion.end(); while (boost::regex_search(Start, End, matches, rNumbers)) { std::cout << matches.str() << std::endl; Start = matches[0].second; } } int main () { std::string portion; while (std::getline(std::cin, portion)) { number_search(portion); } return 0; } |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Jun 8, 6:32 pm, brad <byte8b...@gmail.com> wrote:
> Still learning C++. I'm writing some regex using boost. It > works great. Only thing is... this code seems slow to me > compared to equivelent Perl and Python. Seems slow, or is measurably slower. There are two possibilities: 1. it only seems slower, because the rest of the code is significantly faster, or 2. it really is slower, because perl and python can compile it into some sort of efficient byte code, since they already have an "execution" machine for such byte code loaded. Note that pure (non-extended) regular expressions can be made to run considerably faster, since they can be converted to a pure DFA. My own regular expression class does this. For most purposes, however, boost:regex will be fast enough, and worth the added flexibility. (My own regular expression class was designed for a very specific use. Where it doesn't need the extensions, but it does need some additional features which aren't in Boost. For most general use, boost::regex is preferable.) -- James Kanze (GABI Software) email:james.kanze@gmail.com Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34 |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
brad wrote:
> // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35 > /usr/local/lib/libboost_regex-gcc41-mt-s.a > // g++ numbers.cpp -o numbers.exe > -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib For starters, you could try adding some optimization flags, such as -O3 and -march=<your architecture> (eg. -march=pentium4). (No, I don't know if that will make the regexp matching faster, but it doesn't hurt to try.) |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
On Sun, 08 Jun 2008 12:32:30 -0400, brad <byte8bits@gmail.com> wrote:
>I'm writing some regex using boost. It works great. >Only thing is... this code seems slow to me compared to equivelent Perl >and Python. I'm sure I'm doing something incorrect. Any tips? Try PCRE. -- Roland Pibinger "The best software is simple, elegant, and full of drama" - Grady Booch |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
brad wrote:
> Still learning C++. I'm writing some regex using boost. It works great. > Only thing is... this code seems slow to me compared to equivelent Perl > and Python. I'm sure I'm doing something incorrect. Any tips? It's not necessarily slower. But most probably. This caught my attention, so I did some tests. Your code mainly messes around with the initialization stuff within the function. This has nothing to do w/boost regex. I modified your code to do the following: - slurp (read-into-buffer) a >120MB text file (actually, it's the Nietzsche full text, 8 times copied ;-) - find all "free" numbers >= 10 (that have 2 digits and word boundaries on the left & right sides) - show the total count of these numbers - do the same in Perl. The results (multicore results are "single-threaded"): [Windows XP-32, Athlon-64/3200+,@2290MHz] - Visual Studio 2008 + Boost 1.35.0 9.3 sec - Perl 5.10 (Active-) 10.4 sec [Linux 2.6.23, Pentium4,@2660MHz] - gcc 4.3, -O2, Boost 1.33.1 13.2 sec - Perl 5.8.8 8.2 sec [Linux 2.6.23, Core2/Q6600,@3240MHz] - gcc 4.3, -O2, Boost 1.33.1 6.3 sec - Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec [Linux 2.6.24, Core2/Q9300,@3338MHz] - gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??) - Perl 5.10 (i586, use64bitint=undef) 10.4 sec The latter system is not installed completely (it's a test w/SuSE 11 Release Candidate), so the results may get better soon there ;-) Code, C++: ==> #include <boost/regex.hpp> #include <fstream> #include <iostream> int number_count(const char*block, size_t len) { boost::match_flag_type flags = boost::match_default; boost::regex reg("\\b\\d{2,}\\b"); boost::cmatch m; const char *from = block, *to = block+len; int n = 0; while( boost::regex_search(from, to, m, reg, flags) ) { from = m[0].second, ++n; } return n; } int main () { std::ifstream in("nietzsche8.txt"); // this is a 112 MB file, // it's 8 x the Nietzsche if(in) { // fulltext in plain ASCII in.seekg(0, std::ios::end); // get to EOF unsigned int len = in.tellg(); // read file pointer in.seekg(0, std::ios::beg); // back to pos 0 char *block = new char [len+1]; // don't be stingy in.read(block, len); // slurp the file int n = number_count(block, len); // process data std::cout << "The text (" << len/1024 << "KB) has " << n << " numbers >= 10!" << std::endl; delete [] block; // play fair } return 0; } <== Code, Perl: ==> open my $fh, '<', 'nietzsche8.txt' or die "what? $!"; my $block; do { local $/; $block = <$fh> }; close $fh; my $n; ++$n while $block =~ /\b\d{2,}\b/g; # process data print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n"; <== Regards Mirco |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
On 8 Jun., 18:32, brad <byte8b...@gmail.com> wrote:
> Still learning C++. I'm writing some regex using boost. It works great. > Only thing is... this code seems slow to me compared to equivelent Perl > and Python. I'm sure I'm doing something incorrect. Any tips? > > #include <boost/regex.hpp> > #include <iostream> > > // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35 > /usr/local/lib/libboost_regex-gcc41-mt-s.a > // g++ numbers.cpp -o numbers.exe > -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib > > void number_search(const std::string& portion) > { > > static const boost::regex Numbers("\\b\\d{9}\\b"); > static const boost::regex& rNumbers = Numbers; > boost::smatch matches; > > std::string::const_iterator Start = portion.begin(); > std::string::const_iterator End = portion.end(); > > while (boost::regex_search(Start, End, matches, rNumbers)) > { > std::cout << matches.str() << std::endl; > Start = matches[0].second; > } > } > > int main () > { > std::string portion; > while (std::getline(std::cin, portion)) > { > number_search(portion); > } > return 0; > } As others have pointed out, there are probably two factors here: - you might not be optimising your code. This can easily cause a factor of 5-10. - you might be measuring other parts of the library. I/O is the obvious answer, and if you are using Microsofts newer C++ compilers you might also be caught by the secure stl-code that is only disabled when you add a special define to your build. I would not expect this kind of code to be fast compared to e.g. Perl. Perl is sort of built with regex in mind, and that part probably is heavily optimised - maybe even written (partly) in assembly. /Peter |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
<peter.koch.larsen@gmail.com> wrote: >Perl. >Perl is sort of built with regex in mind, and that part probably is >heavily optimised - maybe even written (partly) in assembly. Perl regex apparently is much slower than Tcl. |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
Razii wrote:
> On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch > <peter.koch.larsen@gmail.com> wrote: >> Perl is sort of built with regex in mind, and that part probably is >> heavily optimised - maybe even written (partly) in assembly. > > Perl regex apparently is much slower than Tcl. This is like saying: a rocket is much faster than an airplaine. It is true sometimes but means nothing. From my own experience, P5-REs are much more ver- satile compared to TCL-RE (P5-REs are not 'regular' anymore) and in the hands of an experienced pro- grammer, this difference (which might be notable some- times if many alternations are involved) approaches zero. For example - there used to be an algorithm oriented language implementation comparision (http://shootout.alioth.debian.org) where you may find all sorts of results. In a reverse-DNA dump test (http://shootout.alioth.debian.org/gp...vcomp&lang=all) Perl completes in 2 seconds, TCL in 11 seconds. In another Regex- heavy test (http://shootout.alioth.debian.org/gp...exdna&lang=all), TCL runs in 3.3 seconds, whereas the first (allowed) Perl impelentation comes in in 12 seconds. But, using a more Perl-like approach (not allowed in this contest), the Perl program (Perl #3, Perl #6 on the bottom) will complete in 1.2 seconds. Regards Mirco |
|
![]() |
| Outils de la discussion | |
|
|