Friday, February 10, 2012

Best way for a Set-based query

I have a database where each row has a set of keywords separated by
commas eg .

ROW 1 ID,<other columns>,"green,hat,man,boat"
ROW 2 ID,<other columns>,"green,boat,seaside"
ROW 3 ID,<other columns>,"red,hat"

I want users to be able to find similar records to the one they are
looking it (this is art BTW). ie if they are looking at row 1 and
click "Find Similar Works of Art" I want to do a query where I find
records that have a commonality of keywords and obviously the ones
that match the most keywords should be shown first. The keywords are
currently just stored comma-separated in a single column but I am not
attached to this. Nor are they guaranteed to come in any order (ie
they are not alpha sorted) but they could be.

Number of records is around 15000 on a hosted server.

Any tips for best query/algorithm/data storage to achieve this? or
references to useful sources?

Thanks,
RBRoland Burr wrote:

> I have a database where each row has a set of keywords separated by
> commas eg .
> ROW 1 ID,<other columns>,"green,hat,man,boat"
> ROW 2 ID,<other columns>,"green,boat,seaside"
> ROW 3 ID,<other columns>,"red,hat"
> I want users to be able to find similar records to the one they are
> looking it (this is art BTW). ie if they are looking at row 1 and
> click "Find Similar Works of Art" I want to do a query where I find
> records that have a commonality of keywords and obviously the ones
> that match the most keywords should be shown first. The keywords are
> currently just stored comma-separated in a single column but I am not
> attached to this. Nor are they guaranteed to come in any order (ie
> they are not alpha sorted) but they could be.
> Number of records is around 15000 on a hosted server.
> Any tips for best query/algorithm/data storage to achieve this? or
> references to useful sources?
> Thanks,
> RB

normally you break the keywords out into a child table, so that you have two
tables (there may be typos in my SQL, this is off the top of my head):

CREATE TABLE artwork (
ID char(10) PRIMARY KEY,
...other columns...
)

CREATE TABLE keywords (
ID char(10) FOREIGN KEY REFERENCES artwork(ID),
keyword char(10)
)

Index on keyword and the SQL queries should write themselves.

--
Kenneth Downs
Secure Data Software, Inc.
(Ken)nneth@.(Sec)ure(Dat)a(.com)|||Roland Burr (rburr49@.hotmail.com) writes:
> I have a database where each row has a set of keywords separated by
> commas eg .
> ROW 1 ID,<other columns>,"green,hat,man,boat"
> ROW 2 ID,<other columns>,"green,boat,seaside"
> ROW 3 ID,<other columns>,"red,hat"
> I want users to be able to find similar records to the one they are
> looking it (this is art BTW). ie if they are looking at row 1 and
> click "Find Similar Works of Art" I want to do a query where I find
> records that have a commonality of keywords and obviously the ones
> that match the most keywords should be shown first. The keywords are
> currently just stored comma-separated in a single column but I am not
> attached to this. Nor are they guaranteed to come in any order (ie
> they are not alpha sorted) but they could be.

I'm glad to hear that you are not attached to the format, because
this sort of format, is really ugly to work with in SQL, as it
violates a basic principle for relational database design. As Kenneth
said, store the keywords as rows.

Find the number of matches given one item would be as easy as this:

SELECT k1.id, COUNT(*)
JOIN keywords k1
WHERE EXISTS (SELECT *
FROM keywordss k2
WHERE k2.id = @.currentid
AND k1.keyword = k2.keyword)
ORDER BY 2 DESC

But I will have to add the disclaimed that I have not tested the
above.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||>> I have a database where each row has a set of keywords separated by
commas <<

You are in violation of First Normal Form, so we are not even in a
relational model at the start of this.

>> a query where I find records [sic] that have a commonality of
keywords <<

CREATE TABLE Keywords
(title_id INTEGER NOT NULL,
keyword VARCHAR(15) NOT NULL,
PRIMARY KEY (title_id, keyword));

load this table with your search words:

CREATE TABLE Searchwords
(keyword VARCHAR(15) NOT NULL PRIMARY KEY);

now the query is easy.

SELECT K1.title_id, COUNT(*) AS matches
Keywords AS K1,
Searchwords AS S1
WHERE K1.keyword = S1.keyword
GROUP BY K1.title_id
ORDER BY matches DESC;

>>... and obviously the ones that match the most keywords should be
shown first.<<

Actually, ordering is a display issue for the front end and not the
database, but you can add an ORDER BY clause. Having said all of this,
have you considered using a text search tool instead of SQL? Cheap,
fast and much more powerful for this knd of thing.

No comments:

Post a Comment