PDA

View Full Version : Anybody ever designed a search engine?


Simplex3
01-18-2006, 03:11 PM
I'm looking for some insight. I need to design a moderately complex search engine and I'm looking to learn from someone else's experience before I eat it myself. Basically I'm trying to index text from a database, none of this is static html anywhere. Here's the scenario:

The app is hosted by regions, cross-region searches won't happen.
A customer may belong to multiple regions.
A customer may have multiple offerings.

The search engine needs to search a few text fields for each offering that is available in a given region based on keyword relevance and return a listing of the ids, in order of relevance.

Any pointers are greatly appreciated.

Ari Chi3fs
01-18-2006, 03:13 PM
would www.atomz.com be a solution that you could utilize?

Simplex3
01-18-2006, 03:17 PM
would www.atomz.com be a solution that you could utilize?
Actually, that's way more than I need here but it's very interresting for another project I have going...

Thanks for the link.

dirk digler
01-18-2006, 03:19 PM
Doesn't Google have where you can do local searches? I have seen alot of websites do local google searches.

Simplex3
01-18-2006, 03:21 PM
Doesn't Google have where you can do local searches? I have seen alot of websites do local google searches.
Yes. I did fail to mention one thing, though. My customer doesn't want their entire database exposed to google, yahoo, etc. This has to be internal to the application.

SLAG
01-18-2006, 03:26 PM
http://www.google.com/enterprise/

dirk digler
01-18-2006, 03:32 PM
http://www.google.com/enterprise/

Damn that shit is expensive

kepp
01-18-2006, 03:37 PM
I've worked on a couple different search engines and designed/implemented one from the ground up. They can pretty much get as complicated as you want.

* What volume of traffic are we talking about? If the volume is light, you can have a very simple implementation that will work fine. For instance, my current employer needed me to design an engine that would handle 100 million requests/day, so that kind of did away with the lightweight options.
* Is there an existing infrastructure that it has to conform to? Windows or Linux? IIS or Apache? etc...
* Will it be type-in queries or some sort of category or index-based search?
* I take it that this will be a web page/site on an intranet?
* What kind of data are you talking about? I know you said "text", but what industry?

Pointers? Hmmm...

* You can't spend too much time designing your database. You can have super-fast code, but if your relations and indexes are bad, it won't do squat.
* The volume also comes into play when choosing your database (if you're not tied to a pre-existing installation). For example, from my experience, MySQL becomes unstable when a table reaches around 60 million rows. Oracle doesn't seem to have that problem. However MySQL is way faster than Oracle and Oracle is super-pricey. A lot of tradeoffs here.
* If you can, its easier to use the "free" route: Linux, MySQL, PHP/Perl. Although this won't be as fast as other implementations, it is easy to implement and maintain.

I'd kind of need more info for more specific help.

htismaqe
01-18-2006, 03:43 PM
I haven't ever designed a search engine, but I did work on a program that could search headers on Usenet and download just .GIF and .JPG attachments automatically...

:D

Simplex3
01-18-2006, 03:59 PM
...
I'd kind of need more info for more specific help.
Wow.

Here's some more info.

It's *nix based, FreeBSD 6 to be exact. It's going to be MySQL for now, if it breaks the 60M row barrier buying Oracle won't be an issue. The web-app will be running apache2/php5 but the search engine can update on a schedule rather than with every insert, so Perl is fine for that task, too (only two *nix languages I'm really comfortable with).

I'm very familiar with database design, tuning indexes, etc. I should be fine there once I figure out how to design the engine itself.

* The traffic volume will be fewer than 1000 queries per minute at all times. Likely fewer than 100 per minute.

* These are all keyword based queries against free-form text. Nothing pre-set.

* The search will originate from a web page, but will go through a search object. The entire app is OOP.

* It won't be industry specific, it's pretty much whatever they want to put in there. On the plus side, I'm only required to index and search 4 character fields across two tables. I'm not comfortable just using a sql query per search because I have to join three tables to get the results by region.

I need them indexed by region so I'm guessing I'll have a set of index tables for each region so that will cut down on total rows per table. There will never be a cross-region search.

The only other monkey-wrench I have is that the content can be scheduled. Each item has a start and end date that must be adhered to.

Simplex3
01-18-2006, 04:00 PM
I haven't ever designed a search engine, but I did work on a program that could search headers on Usenet and download just .GIF and .JPG attachments automatically...

:D
You were looking for pictures of people's dogs, right?

htismaqe
01-18-2006, 04:08 PM
You were looking for pictures of people's dogs, right?

We weren't looking for anything in particular. We were providing a "service". :D

If you were looking for dogs, they could be found in alt.sex.bestiality.*.

ferrarispider95
01-18-2006, 08:43 PM
just build a form out of php and query the database, mysql & php are super easy

kepp
01-19-2006, 07:56 AM
It's *nix based, FreeBSD 6 to be exact. It's going to be MySQL for now, if it breaks the 60M row barrier buying Oracle won't be an issue. The web-app will be running apache2/php5 but the search engine can update on a schedule rather than with every insert, so Perl is fine for that task, too (only two *nix languages I'm really comfortable with).

I'm very familiar with database design, tuning indexes, etc. I should be fine there once I figure out how to design the engine itself.

* The traffic volume will be fewer than 1000 queries per minute at all times. Likely fewer than 100 per minute.

* These are all keyword based queries against free-form text. Nothing pre-set.

* The search will originate from a web page, but will go through a search object. The entire app is OOP.

* It won't be industry specific, it's pretty much whatever they want to put in there. On the plus side, I'm only required to index and search 4 character fields across two tables. I'm not comfortable just using a sql query per search because I have to join three tables to get the results by region.

I need them indexed by region so I'm guessing I'll have a set of index tables for each region so that will cut down on total rows per table. There will never be a cross-region search.

The only other monkey-wrench I have is that the content can be scheduled. Each item has a start and end date that must be adhered to.

That setup will be more than enough to handle 100/min.

* I try to let the SQL handle most of the work - especially if you're using MySQL because its fast. I wouldn't be wary of joining multiple tables as long as the DB is tuned/indexed properly. My main query joins 6 tables and it does fine.
* You don't necessarily need a separate table for each region. Just have a region_code or region_id field in the table you use for the indexing. That will make it easier to maintain and will allow for cross-region searching JUST IN CASE someone changes their mind.
* Are you actually going to "index" the 4 character fields, or are you just going to use "WHERE field like 'abc%'" queries? If you're actually going to index them, you may want to have a separate table just for the indexes whose rows 'point' to corresponding rows in the other table(s).
* The date scheduling can actually be pretty easy. You'll have a "main" table where you keep the stuff that will be the objects of you searches. Just add 'start_date' & 'end_date' fields and include the appropriate constraints in your query.
* Since your indexes will possibly point to rows in more than one table, sub-selects and/or unions will come in handy. I know Oracle supports them, but the last time I used MySQL, it didn't. You may want to check into that.

This is getting a little long so I'll make another post here in a while...

kepp
01-19-2006, 08:09 AM
Actually, after thinking a little about it, if you use a 'region_id' field in your main table, you wouldn't need to use sub-selects. You'd have something like this for your tables:

CREATE TABLE object_table
(
object_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
region_id INT UNSIGNED NOT NULL,
text_field VARCHAR(4) NOT NULL,
start_time DATETIME NOT NULL,
end_time DATETIME NOT NULL,
INDEX idx1 (object_id, region_id, start_time, end_time)
) TYPE = InnoDB;

CREATE TABLE index_table
(
indexed_text VARCHAR(4) NOT NULL,
object_id INT UNSIGNED NOT NULL,
INDEX idx2 (indexed_text)
) TYPE = InnoDB;

...and a main query like this:

SELECT ot.object_id
FROM object_table ot, index_table it
WHERE it.indexed_text like ... or = ...
AND ot.object_id = it.object_id
AND ot.start_time <= NOW()
AND ot.end_time >= NOW()
...

Then, if you needed to add other constraints like the status of an account or something, you just add the appropriate table to the FROM clause and add another AND in, an you're done.