PDA

View Full Version : Averaging subsets in a large dataset


dac5039
12 Sep 2008, 12:44am
Hey guys,

I have 17899 pieces of data recorded over an hour. Each reading took place 0.2 seconds after the previous. I need to average the data (vertical wind velocity) every 10 seconds, which would be every 10/0.2 = 50 wind velocity readings.

I had a for loop working for the first 359 subsets.. but the last one has less than 50 readings in it, so it fails. One variable is defined as the velocity and another is time (in 0.2 second increments). How should I setup this loop? With respect to time(every 10s)? velocity(every 50 readings)?

I also have to do this for 30 seconds, 1 minute, 2 minutes, and 10 minutes... Can I make a loop that does all of this?

Thanks a lot for the help.

Gargoyle
12 Sep 2008, 02:19am
I've never used Matlab. But your code should have a check built in to see if the record whose value it's pulling in is the last record. That way it doesn't crash when it tries to keep reading records when there aren't any more.

You could create a counter variable and have it add one to its value every time you grab a data value. If you know that you have 17899 records, you could have your code exit when your counter = 17899.

BLuKnight
12 Sep 2008, 06:30am
Never used Matlab either. However, you should be able to setup multiple variables to do this. So we have 5 solutions we're trying to find answers to. Each of those is based around finding the average based on the interval of readings.

To do this, you'll only one while loop which goes through all the data. So, while less than number of records. Inside this loop you'll need a bunch of counters. Let's work with the 10 seconds first.

So, in your loop you should have a iteration counter, total counter and number of readings counter. The iteration counter should increment each time you hit your loop. When it hits 50, you'll do the following:

1. Increment the Number of Readings by 1
2. Add the reading to the total counter
3. Reset the Iteration counter to 0

After you're out the the loop, you'll take the total counter and divide it by the number of readings counter.

Best of all, by duplicating the same efforts with different variable names, all you need to do is change the iteration counter's conditional from 50 to whatever it needs to be and only pass through the data once. (BTW, this is a great thing to learn if you're heading towards a programming career. I have to process gigabytes of flat files in the smallest time possible. Each time you have to iterate through the data exponentially increases the run time of the program.)

Good luck!

shwaip
12 Sep 2008, 09:15am
you don't want to use for loops to iterate over large amounts of data in matlab. they are very inefficient...

this assumes your wind velocity data is in a vector called "data"


dt = 0.2;
time = [30, 60, 120, 600];
vals = cell(1,length(time));

data = data(:);
ind=1;
for ii=time
len = ii/dt;
maxind = floor(length(data)/len)*len; %this will stop it from going over the end of the array
arr = reshape(data(1:maxind),len,maxind/len);
avg = sum(arr)/len;
vals{ind}=avg;
ind = ind+1;
end


your average values will be in the vals{} cell array.

have fun.

shwaip
12 Sep 2008, 09:18am
Never used Matlab either. However, you should be able to setup multiple variables to do this. So we have 5 solutions we're trying to find answers to. Each of those is based around finding the average based on the interval of readings.

To do this, you'll only one while loop which goes through all the data. So, while less than number of records. Inside this loop you'll need a bunch of counters. Let's work with the 10 seconds first.

So, in your loop you should have a iteration counter, total counter and number of readings counter. The iteration counter should increment each time you hit your loop. When it hits 50, you'll do the following:

1. Increment the Number of Readings by 1
2. Add the reading to the total counter
3. Reset the Iteration counter to 0

After you're out the the loop, you'll take the total counter and divide it by the number of readings counter.

Best of all, by duplicating the same efforts with different variable names, all you need to do is change the iteration counter's conditional from 50 to whatever it needs to be and only pass through the data once. (BTW, this is a great thing to learn if you're heading towards a programming career. I have to process gigabytes of flat files in the smallest time possible. Each time you have to iterate through the data exponentially increases the run time of the program.)

Good luck!

this is good advice for other programming languages, but not matlab.

dac5039
12 Sep 2008, 04:50pm
Shwaip,

Thanks a lot, that loop is great. The averages are slightly off tho, I think because of the rounding to negative infinity you have in the "maxind" variable definition? I just averaged the first few sets of 50 readings in Excel and the averages are off from your code a little bit. Is there anything I can try to fix this?

Also, the first reading is at 0, not 0.2 seconds. Thought that might be a problem also.

Thanks a lot,

Dave

shwaip
12 Sep 2008, 05:34pm
how off is "slightly off".

The code is independent of timestamps.

dac5039
12 Sep 2008, 05:56pm
Theyre off by a few hundredths or so. Which is why I think the rounding down did it.

shwaip
12 Sep 2008, 07:53pm
i am 99.999% sure my code is correct.

all the floor() function is doing is making sure that the last section has the appropriate number of points in it for the average, it will not affect the values at all.

Gargoyle
12 Sep 2008, 09:57pm
Do Excel and Matlab have different levels of default precision?

dac5039
12 Sep 2008, 10:20pm
I figured it out...

The code you wrote averages correctly through the 357th set of data (357th set of 50 w readings). but there are two more sets that don't get averaged. Apparently theres a digit off by one somewhere or something haha. Heres my actual code..

dt = 0.2;%timestep
time = [10, 30, 120, 600];%averaging times
vals = cell(1,length(time));%array for average values

data = w(: );%vertical velocity data
ind=1;%loop controller
for ii=time
len = ii/dt;%number of w values in each average(50,150,600,3000)
maxind = floor(length(data)/len)*len; %number of averages (357,119,29,5)
arr = reshape(data(1:maxind),len,maxind/len);%puts w values into a matrix
avg = sum(arr)/len;
vals{ind}=avg;
ind = ind+1;
end

I dont know why the final 98 readings don't get averaged.. any ideas?

Thanks

shwaip
12 Sep 2008, 11:47pm
Do Excel and Matlab have different levels of default precision?

yes, but it wouldn't account for that much of a difference, i don't think.

dac5039
14 Sep 2008, 07:18pm
Any ideas shwaip?

Khaos
14 Sep 2008, 08:17pm
I dont know why the final 98 readings don't get averaged.. any ideas?
Disclaimer: I don't know a thing about MatLab.

The code posted takes only the number of whole sets and reshapes them into a matrix, so of course the remaining data are not averaged.

For example

17899 / 10 * 0.2 = 357.98

50 * 0.98 = 49

or..

17899 % ((int)(17899 / 10 * 0.2)) = 49

So the code schwaip posted discards 49 data that do not fit into a whole set as defined by the period, in the case of my example, 10 seconds

This may be desirable, since the first 357 sets represent data for 10 seconds, and the remaining 49 data represent only a 9.8 second period. It is not technically accurate to compare the two as they represent different things and there is no real way to substitute missing data.

If you want to process the remaining data into an average, then you can isolate them with the following psuedo code, reusing schwaips variable names in bold and assuming 1 based vectors:

if (Length(data) != maxind)
{
LastSetLength = Length(data) - maxind;
LastSetAvg = Sum(data[maxind + 1 to Length(data)]) / LastSetLength;
LastSetPeriod = LastSetLength * dt;
}

You could include this figure with an annotation stating that it is derived from data for a different period of time, and give the period. I don't see much worth in that, though.

Edit: Apparently vectors in MatLab are 1-based.

shwaip
14 Sep 2008, 08:35pm
right.

basically, the behavior you are seeing is what i wrote the code to do.

you can't get an average over 10s of data when all you have is 9.8s (well, you can, but that's not the point of looking at longer term averages)