Averaging subsets in a large dataset

dac5039 · September 2008

Hey guys,

I have 17899 pieces of data recorded over an hour. Each reading took place 0.2 seconds after the previous. I need to average the data (vertical wind velocity) every 10 seconds, which would be every 10/0.2 = 50 wind velocity readings.

I had a for loop working for the first 359 subsets.. but the last one has less than 50 readings in it, so it fails. One variable is defined as the velocity and another is time (in 0.2 second increments). How should I setup this loop? With respect to time(every 10s)? velocity(every 50 readings)?

I also have to do this for 30 seconds, 1 minute, 2 minutes, and 10 minutes... Can I make a loop that does all of this?

Thanks a lot for the help.

Garg · September 2008

I've never used Matlab. But your code should have a check built in to see if the record whose value it's pulling in is the last record. That way it doesn't crash when it tries to keep reading records when there aren't any more.

You could create a counter variable and have it add one to its value every time you grab a data value. If you know that you have 17899 records, you could have your code exit when your counter = 17899.

BLuKnight · September 2008

Never used Matlab either. However, you should be able to setup multiple variables to do this. So we have 5 solutions we're trying to find answers to. Each of those is based around finding the average based on the interval of readings.

To do this, you'll only one while loop which goes through all the data. So, while less than number of records. Inside this loop you'll need a bunch of counters. Let's work with the 10 seconds first.

So, in your loop you should have a iteration counter, total counter and number of readings counter. The iteration counter should increment each time you hit your loop. When it hits 50, you'll do the following:

1. Increment the Number of Readings by 1
2. Add the reading to the total counter
3. Reset the Iteration counter to 0

After you're out the the loop, you'll take the total counter and divide it by the number of readings counter.

Best of all, by duplicating the same efforts with different variable names, all you need to do is change the iteration counter's conditional from 50 to whatever it needs to be and only pass through the data once. (BTW, this is a great thing to learn if you're heading towards a programming career. I have to process gigabytes of flat files in the smallest time possible. Each time you have to iterate through the data exponentially increases the run time of the program.)

Good luck!

shwaip · September 2008

you don't want to use for loops to iterate over large amounts of data in matlab. they are very inefficient...

this assumes your wind velocity data is in a vector called "data"

dt = 0.2;
time = [30, 60, 120, 600];
vals = cell(1,length(time));

data = data(:);
ind=1;
for ii=time
  len = ii/dt;
  maxind = floor(length(data)/len)*len; %this will stop it from going over the end of the array
  arr = reshape(data(1:maxind),len,maxind/len);
  avg = sum(arr)/len;
  vals{ind}=avg;
  ind = ind+1;
end

your average values will be in the vals{} cell array.

have fun.

shwaip · September 2008

BLuKnight wrote:

Never used Matlab either. However, you should be able to setup multiple variables to do this. So we have 5 solutions we're trying to find answers to. Each of those is based around finding the average based on the interval of readings.

To do this, you'll only one while loop which goes through all the data. So, while less than number of records. Inside this loop you'll need a bunch of counters. Let's work with the 10 seconds first.

So, in your loop you should have a iteration counter, total counter and number of readings counter. The iteration counter should increment each time you hit your loop. When it hits 50, you'll do the following:

1. Increment the Number of Readings by 1
2. Add the reading to the total counter
3. Reset the Iteration counter to 0

After you're out the the loop, you'll take the total counter and divide it by the number of readings counter.

Best of all, by duplicating the same efforts with different variable names, all you need to do is change the iteration counter's conditional from 50 to whatever it needs to be and only pass through the data once. (BTW, this is a great thing to learn if you're heading towards a programming career. I have to process gigabytes of flat files in the smallest time possible. Each time you have to iterate through the data exponentially increases the run time of the program.)

Good luck!

this is good advice for other programming languages, but not matlab.

dac5039 · September 2008

Shwaip,

Thanks a lot, that loop is great. The averages are slightly off tho, I think because of the rounding to negative infinity you have in the "maxind" variable definition? I just averaged the first few sets of 50 readings in Excel and the averages are off from your code a little bit. Is there anything I can try to fix this?

Also, the first reading is at 0, not 0.2 seconds. Thought that might be a problem also.

Thanks a lot,

Dave

shwaip · September 2008

how off is "slightly off".

The code is independent of timestamps.

dac5039 · September 2008

Theyre off by a few hundredths or so. Which is why I think the rounding down did it.

shwaip · September 2008

i am 99.999% sure my code is correct.

all the floor() function is doing is making sure that the last section has the appropriate number of points in it for the average, it will not affect the values at all.

Garg · September 2008

Do Excel and Matlab have different levels of default precision?

dac5039 · September 2008

I figured it out...

The code you wrote averages correctly through the 357th set of data (357th set of 50 w readings). but there are two more sets that don't get averaged. Apparently theres a digit off by one somewhere or something haha. Heres my actual code..

dt = 0.2;%timestep
time = [10, 30, 120, 600];%averaging times
vals = cell(1,length(time));%array for average values

data = w(: );%vertical velocity data
ind=1;%loop controller
for ii=time
  len = ii/dt;%number of w values in each average(50,150,600,3000)
  maxind = floor(length(data)/len)*len; %number of averages (357,119,29,5)
  arr = reshape(data(1:maxind),len,maxind/len);%puts w values into a matrix 
  avg = sum(arr)/len;
  vals{ind}=avg;
  ind = ind+1;
end

I dont know why the final 98 readings don't get averaged.. any ideas?

Thanks

shwaip · September 2008

Gargoyle wrote:

Do Excel and Matlab have different levels of default precision?

yes, but it wouldn't account for that much of a difference, i don't think.

dac5039 · September 2008

Any ideas shwaip?

Khaos · September 2008

I dont know why the final 98 readings don't get averaged.. any ideas?

Disclaimer: I don't know a thing about MatLab.

The code posted takes only the number of whole sets and reshapes them into a matrix, so of course the remaining data are not averaged.

For example

17899 / 10 * 0.2 = 357.98

50 * 0.98 = 49

or..

17899 % ((int)(17899 / 10 * 0.2)) = 49

So the code schwaip posted discards 49 data that do not fit into a whole set as defined by the period, in the case of my example, 10 seconds

This may be desirable, since the first 357 sets represent data for 10 seconds, and the remaining 49 data represent only a 9.8 second period. It is not technically accurate to compare the two as they represent different things and there is no real way to substitute missing data.

If you want to process the remaining data into an average, then you can isolate them with the following psuedo code, reusing schwaips variable names in bold and assuming 1 based vectors:

if (Length([B]data[/B]) != [B]maxind[/B])
{
    LastSetLength = Length([B]data[/B]) - [B]maxind[/B];
    LastSetAvg = Sum([B]data[/B][[B]maxind[/B] + 1 to Length([B]data[/B])]) / LastSetLength;
    LastSetPeriod = LastSetLength * [B]dt[/B];
}

You could include this figure with an annotation stating that it is derived from data for a different period of time, and give the period. I don't see much worth in that, though.

Edit: Apparently vectors in MatLab are 1-based.

shwaip · September 2008

right.

basically, the behavior you are seeing is what i wrote the code to do.

you can't get an average over 10s of data when all you have is 9.8s (well, you can, but that's not the point of looking at longer term averages)

Averaging subsets in a large dataset

Comments