 |
|
 |
|
Next: Sort of OT - anyone using Suse?
|
| Author |
Message |
External

Since: Sep 14, 2004 Posts: 1625
|
(Msg. 1) Posted: Fri Aug 20, 2004 9:23 pm
Post subject: Writing an indexed search engine Archived from groups: alt>www>webmaster (more info?)
|
|
|
After due nagging, I decided that as this is a field I know about (or
so I thought <g>) I really ought to try and put it into practice.
The existing scenario is a follows:
User enters a search phrase (single word or actual phrase) and this is
searched for by brute force within 1500 files using a simple system of
(roughly):
open file
read line
does line contain search phrase (case insensitive) ?
if so, output this article
if we've done 10 articles, finish otherwise
carry on with next line
next file
Surprisingly, this is actually quite fast on a Linux server, less than
2 seconds to return results for a found word or phrase.
However, far more efficient (in theory, and I'm not arguing) to
produce an index:
<WORD><delimiter><FILE NAME>
such as:
CAT#B1.HTM
DOG#B2.HTM
for every word contained within articles within the 1500 pages, except
that:
1) Stop words (a, the, an, and, of.... and the like) are omitted
2) Only one entry in the index is made for each word/file name
combination
Then, rather than using brute force to search through every one of
1500 files, one scans the index for the search word or first
(non-stop) word in the search phrase until either:
a) a match is found or
b) the index entry word is alphanumerically higher - eg we've past
where in the index the search word would be if it existed, so we can
abort.
When a match is found, retrieve the name of the file containing the
search word, and do a brute force search on that file as before. Then
continue down the index until we reach the end or find an
alphanumerically higher entry (eg we pass all the entries).
Works fine on the Windows test machine (that surprises you doesn't
it!) even in this rough form, it reduces the search time by about
5/8ths - eg tests take 5 seconds rather than eight to finish.
So why is it that it fails on the Unix machine? By fail, I mean the
program operates correctly, but is so slow that searches using the
index are far slower than using the old brute force method.
It's friday night, I'm off to get pissed.
Any sad gits with a passion for programming might like to play with
this while I'm gone...
Oh, and yes I have considered producing an array of filenames which
satisfy ALL non-stop words in the search phrase, and then just
searching them, thereby searching even less files.
Matt >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Oct 09, 2003 Posts: 26
|
(Msg. 2) Posted: Fri Aug 20, 2004 9:23 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Matt Probert wrote:
> However, far more efficient (in theory, and I'm not arguing) to
> produce an index:
>
> <WORD><delimiter><FILE NAME>
In my engine I've found this to cause a huge file.
I'm in the process of converting the <FILE NAME>
list into a list of integers that index into an array of all filenames.
> such as:
> CAT#B1.HTM
> DOG#B2.HTM
Which would, in my system, now be:
CAT#5,25,38,99,505
DOG#88,509,1284
with the numbers being an index into an array of filenames
(which you would, of course, read from a file). Should reduce
the size of the index files a lot.
> Works fine on the Windows test machine (that surprises you doesn't
> it!) even in this rough form, it reduces the search time by about
> 5/8ths - eg tests take 5 seconds rather than eight to finish.
>
> So why is it that it fails on the Unix machine? By fail, I mean the
> program operates correctly, but is so slow that searches using the
> index are far slower than using the old brute force method.
What language? Mine is in Perl, and runs like a bat out of hell
on a 'nix box. Could be the language you're using isn't implemented
as well?<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Jun 30, 2004 Posts: 148
|
(Msg. 3) Posted: Fri Aug 20, 2004 9:23 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On 2004-08-20, Matt Probert <comments DeleteThis @probertencyclopaedia.com> wrote:
> Oh, and yes I have considered producing an array of filenames which
> satisfy ALL non-stop words in the search phrase, and then just
> searching them, thereby searching even less files.
>
> Matt
Just trying to be as unhelpful as possible, there's a perl program called
ksearch that's very very good for sites about your size. Can store data in
text or berkley db files. They even give you a small, working form to query
the data and, if memeory serves, even highlights the key words in the
results, however, not sure it does phrase searches.
You can setup your own stop word file or let ksearch do it for you by
declaring words that appear x numbers of times in documents as junk words.
kscripts.com I think.
ken<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: May 08, 2004 Posts: 952
|
(Msg. 4) Posted: Fri Aug 20, 2004 9:23 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
|
|
| Back to top |
|
 |  |
External

Since: Sep 14, 2004 Posts: 1625
|
(Msg. 5) Posted: Fri Aug 20, 2004 10:33 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Fri, 20 Aug 2004 18:23:57 GMT comments DeleteThis @probertencyclopaedia.com
(Matt Probert) broke off from drinking a cup of tea at The Probert
Encyclopaedia to write:
>Works fine on the Windows test machine (that surprises you doesn't
>it!) even in this rough form, it reduces the search time by about
>5/8ths - eg tests take 5 seconds rather than eight to finish.
>
>So why is it that it fails on the Unix machine? By fail, I mean the
>program operates correctly, but is so slow that searches using the
>index are far slower than using the old brute force method.
Okay, found the answer to that. I'm a bit rusty, having been out of
development for a while.
Can YOU guess what was wrong? <g>
Matt<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Feb 13, 2004 Posts: 1055
|
(Msg. 6) Posted: Fri Aug 20, 2004 10:33 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Matt Probert wrote:
> On Fri, 20 Aug 2004 18:23:57 GMT
> comments.DeleteThis@probertencyclopaedia.com
> (Matt Probert) broke off from drinking a cup of tea at The
> Probert Encyclopaedia to write:
>
>>Works fine on the Windows test machine (that surprises you
>>doesn't it!) even in this rough form, it reduces the search
>>time by about 5/8ths - eg tests take 5 seconds rather than
>>eight to finish.
>>
>>So why is it that it fails on the Unix machine? By fail, I
>>mean the program operates correctly, but is so slow that
>>searches using the index are far slower than using the old
>>brute force method.
>
> Okay, found the answer to that. I'm a bit rusty, having
> been out of development for a while.
>
> Can YOU guess what was wrong? <g>
I can only see one mistake in what you wrote, which has (I
think) nothing to do with your problem though:
If something gets reduced by 5/8ths, what remains is not 5
instead of 8, but 3 instead of 8.
--
Els <a style='text-decoration: underline;' href="http://locusmeus.com/" target="_blank">http://locusmeus.com/</a>
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Rod Stewart - Maggie May<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Sep 14, 2004 Posts: 1625
|
(Msg. 7) Posted: Fri Aug 20, 2004 10:38 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Fri, 20 Aug 2004 18:17:06 GMT m <NOXwebmasterx.DeleteThis@xmbstevensx.com>
broke off from drinking a cup of tea at mbstevens.com to write:
>Matt Probert wrote:
>
>> However, far more efficient (in theory, and I'm not arguing) to
>> produce an index:
>>
>> <WORD><delimiter><FILE NAME>
>
>In my engine I've found this to cause a huge file.
>I'm in the process of converting the <FILE NAME>
>list into a list of integers that index into an array of all filenames.
>
>> such as:
>> CAT#B1.HTM
>> DOG#B2.HTM
>
>Which would, in my system, now be:
>CAT#5,25,38,99,505
>DOG#88,509,1284
>
>with the numbers being an index into an array of filenames
>(which you would, of course, read from a file). Should reduce
>the size of the index files a lot.
Should have mentioned, with 1500+ files we have them all with short
names, such as A1.HTM, A2.HTM &c.
I appreciate what you're saying, but its not applicable in this
instance.
>
>> Works fine on the Windows test machine (that surprises you doesn't
>> it!) even in this rough form, it reduces the search time by about
>> 5/8ths - eg tests take 5 seconds rather than eight to finish.
>>
>> So why is it that it fails on the Unix machine? By fail, I mean the
>> program operates correctly, but is so slow that searches using the
>> index are far slower than using the old brute force method.
>
>What language? Mine is in Perl, and runs like a bat out of hell
>on a 'nix box. Could be the language you're using isn't implemented
>as well?
Oh yes, it's Perl.
The 1500+ files occupy about 40mB.
When the CPU is not overloaded (eg NOT > 6% utilization <g>) it takes
about 10 seconds from remote request submission to remote request
satisfaction, irrespective of whether or not the search was
successful.
Mind you, when the CPU is overloaded, it's a bit slow! <BG>
Can you guess what I had wrong? I feel such a fool!
Matt<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Oct 09, 2003 Posts: 26
|
(Msg. 8) Posted: Fri Aug 20, 2004 10:38 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Matt Probert wrote:
> When the CPU is not overloaded (eg NOT > 6% utilization <g>) it takes
> about 10 seconds from remote request submission to remote request
> satisfaction, irrespective of whether or not the search was
> successful.
>
> Mind you, when the CPU is overloaded, it's a bit slow! <BG>
>
> Can you guess what I had wrong? I feel such a fool!
Hmm. Snail-gremlins on the telephone lines? We sure have
lots of them where I am.<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Jul 14, 2003 Posts: 1188
|
(Msg. 9) Posted: Fri Aug 20, 2004 10:38 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Matt Probert wrote:
>
>
> Oh yes, it's Perl.
>
> The 1500+ files occupy about 40mB.
>
> When the CPU is not overloaded (eg NOT > 6% utilization <g>) it takes
> about 10 seconds from remote request submission to remote request
> satisfaction, irrespective of whether or not the search was
> successful.
>
> Mind you, when the CPU is overloaded, it's a bit slow! <BG>
>
> Can you guess what I had wrong? I feel such a fool!
>
> Matt
Matt,
Are you doing a sequential search on the index file, or a binary
search? If it's sequential, that could be a large part of your
slowdown.
You can do a binary search on variable length keys (like you have), but
it's more difficult.
Alternatively - have you thought of doing the search in C? It should
run significantly faster than Perl (great though Perl is).
--
To reply, delete the 'x' from my email
Jerry Stuckle,
JDS Computer Training Corp.
jstucklex DeleteThis @attglobal.net
Member of Independent Computer Consultants Association - <a style='text-decoration: underline;' href="http://www.icca.org" target="_blank">www.icca.org</a><!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Sep 14, 2004 Posts: 1625
|
(Msg. 10) Posted: Fri Aug 20, 2004 10:42 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Fri, 20 Aug 2004 18:22:51 -0000 Kenneth <Kenneth.TakeThisOut@nowhere.special>
broke off from drinking a cup of tea at None to write:
>On 2004-08-20, Matt Probert <comments.TakeThisOut@probertencyclopaedia.com> wrote:
>> Oh, and yes I have considered producing an array of filenames which
>> satisfy ALL non-stop words in the search phrase, and then just
>> searching them, thereby searching even less files.
>>
>> Matt
>
>Just trying to be as unhelpful as possible, there's a perl program called
>ksearch that's very very good for sites about your size. Can store data in
>text or berkley db files. They even give you a small, working form to query
>the data and, if memeory serves, even highlights the key words in the
>results, however, not sure it does phrase searches.
>
>You can setup your own stop word file or let ksearch do it for you by
>declaring words that appear x numbers of times in documents as junk words.
>
>kscripts.com I think.
Yes, and a very traditional search script it is too!
We deliberately operate a different system.
Have you tried our search system? It returns whole articles, not
disjointed bits. We think it's a much more user-friendly system for an
encyclopaedia.
But thanks anyway for being unhelpful <bg>
Matt<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Jun 30, 2004 Posts: 148
|
(Msg. 11) Posted: Fri Aug 20, 2004 10:42 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On 2004-08-20, Matt Probert <comments.RemoveThis@probertencyclopaedia.com> wrote:
> On Fri, 20 Aug 2004 18:22:51 -0000 Kenneth <Kenneth.RemoveThis@nowhere.special>
> broke off from drinking a cup of tea at None to write:
>
>>On 2004-08-20, Matt Probert <comments.RemoveThis@probertencyclopaedia.com> wrote:
>>> Oh, and yes I have considered producing an array of filenames which
>>> satisfy ALL non-stop words in the search phrase, and then just
>>> searching them, thereby searching even less files.
>>>
>>> Matt
>>
>>Just trying to be as unhelpful as possible, there's a perl program called
>>ksearch that's very very good for sites about your size. Can store data in
>>text or berkley db files. They even give you a small, working form to query
>>the data and, if memeory serves, even highlights the key words in the
>>results, however, not sure it does phrase searches.
>>
>>You can setup your own stop word file or let ksearch do it for you by
>>declaring words that appear x numbers of times in documents as junk words.
>>
>>kscripts.com I think.
>
> Yes, and a very traditional search script it is too!
>
> We deliberately operate a different system.
>
> Have you tried our search system? It returns whole articles, not
> disjointed bits. We think it's a much more user-friendly system for an
> encyclopaedia.
>
> But thanks anyway for being unhelpful <bg>
>
> Matt
>
As always, more than happy to contribute my share of useless information to
the Usenet achieves.
ken<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Sep 14, 2004 Posts: 1119
|
(Msg. 12) Posted: Fri Aug 20, 2004 10:52 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Spake Matt Probert unto thee:
> Can YOU guess what was wrong? <g>
You'd set the "run_slow = true;" option in the config? No idea...
enlighten us please?
--
Dylan Parry
<a style='text-decoration: underline;' href="http://webpageworkshop.co.uk" target="_blank">http://webpageworkshop.co.uk</a> - FREE Web tutorials and references
'I am a Bear of Very Little Brain, and long words bother me.' -- A A Milne<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Apr 25, 2004 Posts: 91
|
(Msg. 13) Posted: Fri Aug 20, 2004 11:01 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
*Dylan Parry* wrote:
> Spake Matt Probert unto thee:
>
>> Can YOU guess what was wrong? <g>
>
> You'd set the "run_slow = true;" option in the config? No idea...
> enlighten us please?
Haven't been paying much attention to the thread so far, but indexing
the index file could produce some interesting delays.
--
Andrew Urquhart
- Contact me: <a style='text-decoration: underline;' href="http://andrewu.co.uk/contact/" target="_blank">http://andrewu.co.uk/contact/</a>
- 'Staccato signals of constant information
A loose affiliation of millionaires and billionaires' - Paul Simon<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Oct 09, 2003 Posts: 26
|
(Msg. 14) Posted: Fri Aug 20, 2004 11:01 pm
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
Andrew Urquhart wrote:
> *Dylan Parry* wrote:
>> Spake Matt Probert unto thee:
>>
>>> Can YOU guess what was wrong? <g>
>>
>> You'd set the "run_slow = true;" option in the config? No idea...
>> enlighten us please?
>
> Haven't been paying much attention to the thread so far, but indexing
> the index file could produce some interesting delays.
Would it ever. I generate mine using a copy of the site on my local
machine, then upload an updated copy occasionally. The only part
of the system on the actual site is the stuff that reads the index
files.<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
External

Since: Sep 14, 2004 Posts: 1625
|
(Msg. 15) Posted: Sat Aug 21, 2004 10:22 am
Post subject: Re: Writing an indexed search engine [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Fri, 20 Aug 2004 19:52:52 +0100 Dylan Parry
<usenet.RemoveThis@dylanparry.com> broke off from drinking a cup of tea at to
write:
>Spake Matt Probert unto thee:
>
>> Can YOU guess what was wrong? <g>
>
>You'd set the "run_slow = true;" option in the config? No idea...
>enlighten us please?
>
>--
I'd forgotten that when doing a linear search that items right at thge
start of the file will be found very quickly, while doing an indexed
search or binary search they can take longe - but searches are more
consistent in their execution time.
My test data all came from the start of the file, hence the linear
search was so fast!
What a wally! <g>
Matt<!-- ~MESSAGE_AFTER~ --> >> Stay informed about: Writing an indexed search engine |
|
| Back to top |
|
 |  |
| Related Topics: | ASP search engine - looking for free asp search engine, preferably with rankings, customization options, etc. any recommendations ? thanx much !
Search Engine Help - My apologies if I am in the wrong forum. As my business has grown the search engine placement piece is being neglected. I do not expect any magic bullet but I would like some thoughts from you folks who manage quite a few sites. Thanks in Advance
A better search engine? - Never a stranger to controversy, I should like to propose that there is a better search engine than Google.... I nominate www.alltheweb.com for the title of best search engine. Not from a technical perspective, but from the dual aspects of finding web....
Help With CGI Site Search Engine - Hey guys, I am trying to find a good CGI search script/engine that you can run on a site. I know you can get code and use Googles once they have spidered your site. But I want something that's my own and not someone elses. I have just spent some time...
site search engine - I am looking for a site search engine not requiring an outside link to operate. In other words totally independent, working with php and mysql. I have about 600 pages to scan, still growing. Can anyone suggest a good standalone shareware or professional... |
|
You can post new topics in this forum You can reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|
 |
|
|