Anybody ever designed a search engine?

Simplex3 · 01-18-2006, 04:11 PM

I'm looking for some insight. I need to design a moderately complex search engine and I'm looking to learn from someone else's experience before I eat it myself. Basically I'm trying to index text from a database, none of this is static html anywhere. Here's the scenario:

The app is hosted by regions, cross-region searches won't happen.
A customer may belong to multiple regions.
A customer may have multiple offerings.

The search engine needs to search a few text fields for each offering that is available in a given region based on keyword relevance and return a listing of the ids, in order of relevance.

Any pointers are greatly appreciated.

teedubya · 01-18-2006, 04:13 PM

would www.atomz.com be a solution that you could utilize?

Simplex3 · 01-18-2006, 04:17 PM

Quote:

Originally Posted by Ali Chi3fs

would www.atomz.com be a solution that you could utilize?

Actually, that's way more than I need here but it's very interresting for another project I have going...

Thanks for the link.

**dirk digler** · 01-18-2006, 04:19 PM

Doesn't Google have where you can do local searches? I have seen alot of websites do local google searches.

Simplex3 · 01-18-2006, 04:21 PM

Quote:

Originally Posted by dirk digler

Doesn't Google have where you can do local searches? I have seen alot of websites do local google searches.

Yes. I did fail to mention one thing, though. My customer doesn't want their entire database exposed to google, yahoo, etc. This has to be internal to the application.

SLAG · 01-18-2006, 04:26 PM

http://www.google.com/enterprise/

**dirk digler** · 01-18-2006, 04:32 PM

Quote:

Originally Posted by SLAG02

http://www.google.com/enterprise/

Damn that shit is expensive

kepp · 01-18-2006, 04:37 PM

I've worked on a couple different search engines and designed/implemented one from the ground up. They can pretty much get as complicated as you want.

* What volume of traffic are we talking about? If the volume is light, you can have a very simple implementation that will work fine. For instance, my current employer needed me to design an engine that would handle 100 million requests/day, so that kind of did away with the lightweight options.
* Is there an existing infrastructure that it has to conform to? Windows or Linux? IIS or Apache? etc...
* Will it be type-in queries or some sort of category or index-based search?
* I take it that this will be a web page/site on an intranet?
* What kind of data are you talking about? I know you said "text", but what industry?

Pointers? Hmmm...

* You can't spend too much time designing your database. You can have super-fast code, but if your relations and indexes are bad, it won't do squat.
* The volume also comes into play when choosing your database (if you're not tied to a pre-existing installation). For example, from my experience, MySQL becomes unstable when a table reaches around 60 million rows. Oracle doesn't seem to have that problem. However MySQL is way faster than Oracle and Oracle is super-pricey. A lot of tradeoffs here.
* If you can, its easier to use the "free" route: Linux, MySQL, PHP/Perl. Although this won't be as fast as other implementations, it is easy to implement and maintain.

I'd kind of need more info for more specific help.

htismaqe · 01-18-2006, 04:43 PM

I haven't ever designed a search engine, but I did work on a program that could search headers on Usenet and download just .GIF and .JPG attachments automatically...

Simplex3 · 01-18-2006, 04:59 PM

Quote:

Originally Posted by kepp

...
I'd kind of need more info for more specific help.

Wow.

Here's some more info.

It's *nix based, FreeBSD 6 to be exact. It's going to be MySQL for now, if it breaks the 60M row barrier buying Oracle won't be an issue. The web-app will be running apache2/php5 but the search engine can update on a schedule rather than with every insert, so Perl is fine for that task, too (only two *nix languages I'm really comfortable with).

I'm very familiar with database design, tuning indexes, etc. I should be fine there once I figure out how to design the engine itself.

* The traffic volume will be fewer than 1000 queries per minute at all times. Likely fewer than 100 per minute.

* These are all keyword based queries against free-form text. Nothing pre-set.

* The search will originate from a web page, but will go through a search object. The entire app is OOP.

* It won't be industry specific, it's pretty much whatever they want to put in there. On the plus side, I'm only required to index and search 4 character fields across two tables. I'm not comfortable just using a sql query per search because I have to join three tables to get the results by region.

I need them indexed by region so I'm guessing I'll have a set of index tables for each region so that will cut down on total rows per table. There will never be a cross-region search.

The only other monkey-wrench I have is that the content can be scheduled. Each item has a start and end date that must be adhered to.

Simplex3 · 01-18-2006, 05:00 PM

Quote:

Originally Posted by htismaqe

I haven't ever designed a search engine, but I did work on a program that could search headers on Usenet and download just .GIF and .JPG attachments automatically...

You were looking for pictures of people's dogs, right?

htismaqe · 01-18-2006, 05:08 PM

Quote:

Originally Posted by Simplex3

You were looking for pictures of people's dogs, right?

We weren't looking for anything in particular. We were providing a "service".

If you were looking for dogs, they could be found in alt.sex.bestiality.*.

ferrarispider95 · 01-18-2006, 09:43 PM

just build a form out of php and query the database, mysql & php are super easy

kepp · 01-19-2006, 08:56 AM

Quote:

Originally Posted by Simplex3

It's *nix based, FreeBSD 6 to be exact. It's going to be MySQL for now, if it breaks the 60M row barrier buying Oracle won't be an issue. The web-app will be running apache2/php5 but the search engine can update on a schedule rather than with every insert, so Perl is fine for that task, too (only two *nix languages I'm really comfortable with).

I'm very familiar with database design, tuning indexes, etc. I should be fine there once I figure out how to design the engine itself.

* The traffic volume will be fewer than 1000 queries per minute at all times. Likely fewer than 100 per minute.

* These are all keyword based queries against free-form text. Nothing pre-set.

* The search will originate from a web page, but will go through a search object. The entire app is OOP.

* It won't be industry specific, it's pretty much whatever they want to put in there. On the plus side, I'm only required to index and search 4 character fields across two tables. I'm not comfortable just using a sql query per search because I have to join three tables to get the results by region.

I need them indexed by region so I'm guessing I'll have a set of index tables for each region so that will cut down on total rows per table. There will never be a cross-region search.

The only other monkey-wrench I have is that the content can be scheduled. Each item has a start and end date that must be adhered to.

That setup will be more than enough to handle 100/min.

* I try to let the SQL handle most of the work - especially if you're using MySQL because its fast. I wouldn't be wary of joining multiple tables as long as the DB is tuned/indexed properly. My main query joins 6 tables and it does fine.
* You don't necessarily need a separate table for each region. Just have a region_code or region_id field in the table you use for the indexing. That will make it easier to maintain and will allow for cross-region searching JUST IN CASE someone changes their mind.
* Are you actually going to "index" the 4 character fields, or are you just going to use "WHERE field like 'abc%'" queries? If you're actually going to index them, you may want to have a separate table just for the indexes whose rows 'point' to corresponding rows in the other table(s).
* The date scheduling can actually be pretty easy. You'll have a "main" table where you keep the stuff that will be the objects of you searches. Just add 'start_date' & 'end_date' fields and include the appropriate constraints in your query.
* Since your indexes will possibly point to rows in more than one table, sub-selects and/or unions will come in handy. I know Oracle supports them, but the last time I used MySQL, it didn't. You may want to check into that.

This is getting a little long so I'll make another post here in a while...

kepp · 01-19-2006, 09:09 AM

Actually, after thinking a little about it, if you use a 'region_id' field in your main table, you wouldn't need to use sub-selects. You'd have something like this for your tables:

CREATE TABLE object_table
(
object_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
region_id INT UNSIGNED NOT NULL,
text_field VARCHAR(4) NOT NULL,
start_time DATETIME NOT NULL,
end_time DATETIME NOT NULL,
INDEX idx1 (object_id, region_id, start_time, end_time)
) TYPE = InnoDB;

CREATE TABLE index_table
(
indexed_text VARCHAR(4) NOT NULL,
object_id INT UNSIGNED NOT NULL,
INDEX idx2 (indexed_text)
) TYPE = InnoDB;

...and a main query like this:

SELECT ot.object_id
FROM object_table ot, index_table it
WHERE it.indexed_text like ... or = ...
AND ot.object_id = it.object_id
AND ot.start_time <= NOW()
AND ot.end_time >= NOW()
...

Then, if you needed to add other constraints like the status of an account or something, you just add the appropriate table to the FROM clause and add another AND in, an you're done.

01-18-2006, 04:11 PM
Simplex3 MVP Join Date: Sep 2003 Casino cash: $10004900	Anybody ever designed a search engine? I'm looking for some insight. I need to design a moderately complex search engine and I'm looking to learn from someone else's experience before I eat it myself. Basically I'm trying to index text from a database, none of this is static html anywhere. Here's the scenario: The app is hosted by regions, cross-region searches won't happen. A customer may belong to multiple regions. A customer may have multiple offerings. The search engine needs to search a few text fields for each offering that is available in a given region based on keyword relevance and return a listing of the ids, in order of relevance. Any pointers are greatly appreciated.
Posts: 28,527

01-18-2006, 04:13 PM	#2
teedubya Most Valuable Poster Join Date: Oct 2003 Casino cash: $9480002	would www.atomz.com be a solution that you could utilize?
Posts: 36,652

01-18-2006, 04:19 PM	#4
dirk digler Please squeeze Join Date: Jul 2003 Location: Clinton, MO Casino cash: $3154644	Doesn't Google have where you can do local searches? I have seen alot of websites do local google searches.
Posts: 66,341

01-18-2006, 04:26 PM	#6
SLAG Superbowl MVP Join Date: Oct 2005 Location: OOOOOOOOOOOOOLATHE Casino cash: $9910252	http://www.google.com/enterprise/ __________________ Ad astra per aspera
Posts: 11,177

01-18-2006, 04:37 PM	#8
kepp MVP Join Date: Aug 2005 Casino cash: $5299212	I've worked on a couple different search engines and designed/implemented one from the ground up. They can pretty much get as complicated as you want. * What volume of traffic are we talking about? If the volume is light, you can have a very simple implementation that will work fine. For instance, my current employer needed me to design an engine that would handle 100 million requests/day, so that kind of did away with the lightweight options. * Is there an existing infrastructure that it has to conform to? Windows or Linux? IIS or Apache? etc... * Will it be type-in queries or some sort of category or index-based search? * I take it that this will be a web page/site on an intranet? * What kind of data are you talking about? I know you said "text", but what industry? Pointers? Hmmm... * You can't spend too much time designing your database. You can have super-fast code, but if your relations and indexes are bad, it won't do squat. * The volume also comes into play when choosing your database (if you're not tied to a pre-existing installation). For example, from my experience, MySQL becomes unstable when a table reaches around 60 million rows. Oracle doesn't seem to have that problem. However MySQL is way faster than Oracle and Oracle is super-pricey. A lot of tradeoffs here. * If you can, its easier to use the "free" route: Linux, MySQL, PHP/Perl. Although this won't be as fast as other implementations, it is easy to implement and maintain. I'd kind of need more info for more specific help.
Posts: 14,496

01-18-2006, 04:43 PM	#9
htismaqe 'Tis my eye! Join Date: Aug 2000 Location: Chiefsplanet Casino cash: $10269900	I haven't ever designed a search engine, but I did work on a program that could search headers on Usenet and download just .GIF and .JPG attachments automatically...
Posts: 100,022

01-18-2006, 09:43 PM	#13
ferrarispider95 She reads at a sophomore level Join Date: Jul 2005 Location: KANSAS Casino cash: $10004945	just build a form out of php and query the database, mysql & php are super easy __________________ www.steerplanet.com Show Steers Directory and Forum www.emporiaks.org Free Emporia, KS Apartment Rental Listings
Posts: 1,493

01-19-2006, 09:09 AM	#15
kepp MVP Join Date: Aug 2005 Casino cash: $5299212	Actually, after thinking a little about it, if you use a 'region_id' field in your main table, you wouldn't need to use sub-selects. You'd have something like this for your tables: CREATE TABLE object_table ( object_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, region_id INT UNSIGNED NOT NULL, text_field VARCHAR(4) NOT NULL, start_time DATETIME NOT NULL, end_time DATETIME NOT NULL, INDEX idx1 (object_id, region_id, start_time, end_time) ) TYPE = InnoDB; CREATE TABLE index_table ( indexed_text VARCHAR(4) NOT NULL, object_id INT UNSIGNED NOT NULL, INDEX idx2 (indexed_text) ) TYPE = InnoDB; ...and a main query like this: SELECT ot.object_id FROM object_table ot, index_table it WHERE it.indexed_text like ... or = ... AND ot.object_id = it.object_id AND ot.start_time <= NOW() AND ot.end_time >= NOW() ... Then, if you needed to add other constraints like the status of an account or something, you just add the appropriate table to the FROM clause and add another AND in, an you're done.
Posts: 14,496