Wednesday, March 7, 2012

beware of avg() for large datasets

Not a question, just a post about something I ran into today that surprised me a little.

I have a huge dataset - about 700m rows, one column of which is a tinyint. I needed to take an average of that column, so I did this:

select avg(mytinyint) from mytable

which returned:

Msg 8115, Level 16, State 2, Line 6
Arithmetic overflow error converting expression to data type int.

It turns out, if the sum() over the column is bigger than MAX_INT = 2147483647, then avg() will throw this error. This surprised me a little - I guess I was assuming sql server would use bigint internally when computing integer avg, but it uses int.

so you have to do this:

select avg(cast(mytinyint as bigint)) from mytable

what's annoying to me is, it's impossible that an average of type T could be larger or smaller than the max/min allowed value of a type T. This means sql server devs could have internally always used bigint for sum and count, and then returned cast(sum/count as T) from avg() without possibility of overflow (unless sum or count was bigger than the max bigint! you are on your own if that happens i guess!).

Because of this, you have to cast your column to a bigint to make it work if your dataset is large enough, even if the column is only 1 measly byte! kind of a waste of space if you ask me. :)

here's a little example of what I'm talking about, that doesn't require you to import a 700m row dataset :)

declare @.t table (id int)
insert into @.t select 1 union all select 2147483647
select avg(cast(id as bigint)) from @.t -- works
select avg(id) from @.t -- failshmm...:rolleyes:|||Very interesting. I will check it out tomorrow. Thanks for posting it.

No comments:

Post a Comment