Best way to copy database: Big table(?) or split between tables?

Hi Guys

I have an application that runs on several sites that has a table with 36 columns mostly ints och small varchars.

I currently have only one table that stores the data and five indexes and since the table on one location (and others soon) has about 18 million rows I have been trying to come up with a better solution (but only if needed, I dont think I have to tell you that I am a programmer and not an dba).
The db file size with all the indexes is more then 10gb, in it self is not an problem but is it a bad solution to have it that way?

The questions are:

Are there any big benefits if i split it into several smaller tables or even smaler databases and make the SPs that gets the data aware that say 2006 years data is in table a and so on?
Its quite important that there are fast SELECTS and that need is far more important then to decrease the size of the database file and so on.

How many rows is okay to have in one table (with 25 columns) before its too big?

Thanks in advance.

Best regards
Johan, Sweden.

CREATE TABLE [dbo].[Cdr](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Abandon] [varchar](7) NULL,
[Bcap] [varchar](2) NULL,
[BlId] [varchar](16) NULL,
[CallChg] [varchar](6) NULL,
[CallIdentifier] [uniqueidentifier] NULL,
[ChgInfo] [varchar](5) NULL,
[ClId] [varchar](16) NULL,
[CustNo] [smallint] NULL,
[Digits] [varchar](32) NULL,
[DigitType] [varchar](1) NULL,
[Dnis1] [varchar](6) NULL,
[Dnis2] [varchar](6) NULL,
[Duration] [int] NULL,
[FgDani] [varchar](13) NULL,
[HoundredHourDuration] [varchar](3) NULL,
[Name] [varchar](40) NULL,
[NameId] [int] NOT NULL,
[Npi] [varchar](2) NULL,
[OrigAuxId] [varchar](11) NULL,
[OrigId] [varchar](7) NULL,
[OrigMin] [varchar](16) NULL,
[Origten0] [varchar](3) NULL,
[RecNo] [int] NULL,
[RecType] [varchar](1) NOT NULL,
[Redir] [varchar](1) NULL,
[TerId] [varchar](7) NOT NULL,
[TermAuxId] [varchar](11) NULL,
[TermMin] [varchar](16) NULL,
[Termten0] [varchar](3) NULL,
[Timestamp] [datetime] NOT NULL,
[Ton] [varchar](1) NULL,
[Tta] [int] NULL,
[Twt] [int] NULL,
[DateValue] [int] NULL,
[TimeValue] [int] NULL,
[Level] [varchar](50) NOT NULL CONSTRAINT [DF_Cdr_Level] DEFAULT ('x:'),
CONSTRAINT [PK_Cdr] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 10) ON [PRIMARY]
) ON [PRIMARY]http://databases.about.com/od/specificproducts/a/firstnormalform.htm

http://en.wikipedia.org/wiki/First_normal_form|||Thanks for the quick reply.
Okay Ive read the articles, but the data in this table is records from a machine and are not duplicate, the name column is duplicated but I figured that it was a overkill to move one column out, all the other fields are diffrent from time to time. And all column always only has one value or null, so there is no need to have a parent-child relation to another table, or?

The acctual question was not if I would split the columns in multiple tables but if I should split the rows in multiple tables.|||What's the justification for splitting the table data into multiple tables of the exact structure?
EDIT: That should probably read - "What are your thoughts behind doing this?"|||it's called horizontal table partitioning and it can be used to create partitioned views and this works best to increase performance if you are doing it over multiple file groups on multiple disks.|||The question is does that increase performance in a single disk/filegroup scenario.

If I ask it like this, is it any problems related with a table with 18 million rows?
If not I am quite happy with the current solution, but if I were to split the data in several tables maby one for the last months data in one "active" table (since most querys are on the most recent data) and all the other in an "archive" table.

And If a split would be a good chooise, when to query a several months whats the best way to look in the two tables? Is it to have a SP that handels all this or should my data layer handle the access. Ex two SPs one for active and one for historic data and combine them with two calls if needed?|||If I ask it like this, is it any problems related with a table with 18 million rows?No - the number of rows is irrelevent. You need to see how things perform and act accordingly based on that. And although partitioning is an option there would be hundreds of things you'd want to consider first. Just to give you perspective, I work with non-partitioned tables with 1/2 billion rows and I expect a lot of the other posters here do also.

Also Don is right, your table would benefit from normalisation and not just first normal form. Remember that it is the number of pages read\ written from disk not the number of rows that count. So if you want a performance justification (rather than logical justification) for normalising then a normalised database will distribute your data across multiple tables, reducing the row size and leading to more rows per page. YMMV of course depending on your queries.

What sort of queries are running on it? Single row lookups or reports pulling in lots of data or a combination?|||Woah - this is probably terrible:
CONSTRAINT [PK_Cdr] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 10) ON [PRIMARY]
) ON [PRIMARY]Unless your rows get a lot of updates making the data in the columns larger after the initial insert then the fillfactor on a monotonically increasing clustered index should be approaching 100. All your clustered index leaf pages are 90% empty (depending on data modifications).

This is the sort of thing I mean by partitioning being one of your last options - lots of things to consider first :)|||Fancy scripting out the other indexes?|||Thanks guys seem like I have to really need to sit down and study some database architecture.

I currently only have one SP that handels the querys that always are used in reporting, soo when a row is in the db its "never" changes. But since this tables store telephone call data records that are collected at runtime, it gets alot of small inserts quite often. Say 10-200 records every five minutes. The insert interval is different at diffrent sites. But as a I said once its in there its "never" gonna change.
So the queries gets everyting from a couple of million rows to about 100 based on how long period your report is covering.

How would you suggest I should normalize this table then, or point me to some good resource where I can find out.

I'm sorry that this maby is newbe questions but thats really what I am so:)
Really appreciates your help.

Here are the rest of the indexes:

/****** Object: Index [IX_DateValue] Script Date: 11/21/2007 15:22:20 ******/
CREATE NONCLUSTERED INDEX [IX_DateValue] ON [dbo].[Cdr]
(
[DateValue] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = ON, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

/****** Object: Index [IX_Level] Script Date: 11/21/2007 15:22:38 ******/
CREATE NONCLUSTERED INDEX [IX_Level] ON [dbo].[Cdr]
(
[Level] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = ON, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

/****** Object: Index [IX_OrigId] Script Date: 11/21/2007 15:22:48 ******/
CREATE NONCLUSTERED INDEX [IX_OrigId] ON [dbo].[Cdr]
(
[OrigId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = ON, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

/****** Object: Index [IX_TerId] Script Date: 11/21/2007 15:22:56 ******/
CREATE NONCLUSTERED INDEX [IX_TerId] ON [dbo].[Cdr]
(
[TerId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = ON, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

/****** Object: Index [IX_TimeValue] Script Date: 11/21/2007 15:23:08 ******/
CREATE NONCLUSTERED INDEX [IX_TimeValue] ON [dbo].[Cdr]
(
[TimeValue] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]|||Bump fillfactor for the clustered index to 100.

This is my fave db design article at the mo:
http://www.tonymarston.net/php-mysql/database-design.html

Please post your proc. If that is literally all you do with this table then we can tune it rather well. I suspect we will want to change your clustered index....|||I have some reading to do:)

This is the quite simple SP the table function SplitDelimitedVarChar takes a carchar and a split char and creates a table that I use to get the correct dates and [level]'s (called exchanges in the SP).

And the way that I have come up with this code is by testing diffrent solutions and this has been the fastest, and as you can se its a quite simple select query so the columsn [DateValue] and [TimeValue] are ints that are created at insert based on the DATETIME column [timestamp] and are just the int representation of the date and time, and is there because its easier.

CREATE PROCEDURE [dbo].[spStudio_Get_Cdr]
@.beginDate DATETIME,
@.endDate DATETIME,
@.beginTime INT,
@.endTime INT,
@.subscribers VARCHAR(MAX),
@.exchanges VARCHAR(MAX) = '1:'
AS
BEGIN
SET NOCOUNT ON;

DECLARE @.exch TABLE(Item Varchar(50))
INSERT INTO @.exch
SELECT Item FROM [SplitDelimitedVarChar] (@.exchanges, '|') ORDER BY Item

DECLARE @.subs TABLE(Item Varchar(19))
INSERT INTO @.subs
SELECT Item FROM [SplitDelimitedVarChar] (@.subscribers, '|') ORDER BY Item

SELECT [id]
,[Abandon]
,[Bcap]
,[BlId]
,[CallChg]
,[CallIdentifier]
,[ChgInfo]
,[ClId]
,[CustNo]
,[Digits]
,[DigitType]
,[Dnis1]
,[Dnis2]
,[Duration]
,[FgDani]
,[HoundredHourDuration]
,[Name]
,[NameId]
,[Npi]
,[OrigAuxId]
,[OrigId]
,[OrigMin]
,[Origten0]
,[RecNo]
,[RecType]
,[Redir]
,[TerId]
,[TermAuxId]
,[TermMin]
,[Termten0]
,[Timestamp]
,[Ton]
,[Tta]
,[Twt]
,[Level]
FROM
[dbo].[Cdr] AS C
INNER JOIN @.exch AS E
ON
C.[Level] = E.[Item]
WHERE
(C.[DateValue] BETWEEN FLOOR(CAST(@.beginDate AS FLOAT)) AND FLOOR(CAST(@.endDate AS FLOAT)))
AND
(C.[TimeValue] BETWEEN @.beginTime AND @.endTime)
AND
(EXISTS(SELECT * FROM @.subs WHERE [Item] = C.[OrigId])
OR
EXISTS(SELECT * FROM @.subs WHERE [Item] = C.[TerId]))

END

thanks in advance|||Changing from a datetime to 2 different float values is NOT easier than doing a datediff or a dateadd.|||Agreed.

Jeff has some pertinent comments here - at least two or three articles apply. http://weblogs.sqlteam.com/jeffs/category/283.aspx
More reading ;)|||The idea of your sproc then is that you don't search for a range of "from 2:00pm 1st of Jan 2007 to 4:00pm 8th of April" but "1st of Jan 2007 to 8th of April, between the times of 2:00pm to 4:00 pm only". Correct?|||Yes that correct, the report can be say the whole year of 2006 but only between 07:00 and 16:00, not 20060101 07:00 o 20061231 16:00.
Exactly as you described.

Is there a better way to get this result Im ofcource going to try it.|||You can use Display Estimated Execution Plan to analyze your query performance.

cheers
iful|||if you do decide to partition (which probably is not necessary) and you are on 2005, have a look at partition functions and partition schemes. you don't need to use partitioned views anymore.

This approach to horizontal partitioning is more convenient IMO because you only have one set of indexes, etc, to manage. It behaves as a single table spread among multiple filegroups.

Tuesday, March 20, 2012

Big table(?) or split between tables?

No comments:

Post a Comment

Best way to copy database

Blog Archive

About Me