PDA

View Full Version : Batch-processing in MATLAB


geodave
17 Jul 2008, 2:13am
Hello All!

I'm fairly new to MATLAB and have been given a task that is quite challenging to me.

I have a text file that contains four columns and many rows (on the order of a few hundred thousand rows). I'm trying to write a script in MATLAB that will read the rows and columns from this file, and write only the first three columns into a new text file.

Here's what the original dataset in the text file looks like:

-16.517754 3.610515 -0.847929 30
-16.472557 3.611480 -0.845726 28
-16.477274 3.617941 -0.846026 30
-16.433626 3.616477 -0.843872 32
-16.431351 3.626801 -0.843872 30
-16.424358 3.630670 -0.843572 32
-16.406473 3.637529 -0.842770 32
-16.406305 3.642901 -0.842820 34
-16.403439 3.655784 -0.842820 34
-16.409687 3.659884 -0.843171 32Basically, I want columns 1,2, and 3, but not column 4, to be written in the new text file.

I have written the following script in MATLAB that works very well for a small dataset (a few thousand rows). It creates two text files; one with a header (x, y, z) and another without a header.

clear all

% CHANGE THIS TO YOUR FILE'S NAME
xyz_data = load('serrano_all_data.txt');

x = xyz_data(:,1);
y = xyz_data(:,2);
z = xyz_data(:,3);

B = [x y z]; % new 3-column matrix

% no header
dlmwrite('xyz_no_header.txt',B); % comma delimited

% with headers
fid = fopen('xyz_header.txt','w+t');
fprintf(fid,'x,y,z\n',B); % writes headers to text file
fclose(fid);
dlmwrite('xyz_ArcMap.txt',B,'-append'); % comma delimited, appends new xyz matrix to text fileHowever, when I try running this script on a much larger dataset (16 million rows), I get the following error in MATLAB:

??? Error using ==> horzcat
Out of memory. Type HELP MEMORY for your options.

Error in ==> three_column_utility at 22
B = [x y z]; % new 3-column matrixI have a suspicion I'm getting this error because the dataset is very large. When I asked a friend who is more familiar with MATLAB than I, he suggested I process the data in batch rather than all at once. For example, the script should read the first thousand lines and write them into the new text file using the new format (3 columns instead of 4 columns), then move on to the next thousand lines and append those to the bottom of the first thousand lines in the new text file using the new format (3 columns instead of 4 columns), and so on, until the end of the file is reached (using the feof command, I think).

My problem is that I'm not quite sure how to do this (if this is the right approach, that is). Any help/suggestions/tips would be greatly appreciated!

shwaip
17 Jul 2008, 2:46am
are you doing this in windows or linux?

geodave
17 Jul 2008, 3:41am
I'm doing this in Windows.

shwaip
17 Jul 2008, 4:35am
if you had been using linux, it could have been a 1-line bash command :P

rather than saying:

x = xyz(:,1);
y = xyz(:,2);
z = xyz(:,3);
B = [ x y z ];


you can just address the columns of xyz:


xyz(:,1:3);

So, your problem is that you're creating 3 copies of the data in your memory. below should only be 1 copy.

i think this should work.



clear all

% CHANGE THIS TO YOUR FILE'S NAME
xyz_data = load('serrano_all_data.txt');


% no header
dlmwrite('xyz_no_header.txt',xyz_data(:,1:3)); % comma delimited

% with headers
fid = fopen('xyz_header.txt','w+t');
fprintf(fid,'x,y,z\n'); % writes headers to text file
fclose(fid);
dlmwrite('xyz_ArcMap.txt',xyz_data(:,1:3),'-append'); % comma delimited, appends new xyz matrix to text file

geodave
17 Jul 2008, 8:44am
Thanks for looking at this. I copied and pasted your modified version of my script, and ran it using the small sample of the dataset. It worked nicely. But, when I ran it using the large dataset (~0.6 Gb text file, four columns by ~16 million rows), I received the following error:

??? Error using ==> load
Out of memory. Type HELP MEMORY for your options.

Error in ==> new_utility at 4
xyz_data = load('serrano_all_data.txt');

It's different than the error I used to get when I used my old script in that MATLAB wasn't happy using "horzcat", whereas now MATLAB isn't happy using "load".

What are your thoughts on this? Again, thanks for spending the time in helping me with this.

shwaip
17 Jul 2008, 7:00pm
Hi.

Basically, you're running out of memory. You're right that the file is way too long.

try this (it'll probably be slow):



clear all

% open files
fp = fopen('serrano_all_data.txt');
fp_head = fopen('xyz_header.txt','w+t');
fp_nohead = fopen('xyz_noheader.txt','w+t');

%write headers
fprintf(fp_head,'x,y,z\n');

while 1
line = fgetl(fp); %read line
if ~ischar(line),break,end; %make sure that we got data
spl = regexp(line,' ','split'); %split on space
fprintf(fp_head,'%f,%f,%f',spl{1},spl{2},spl{3}); %write data to file with head
fprintf(fp_nohead,'%f,%f,%f',spl{1},spl{2},spl{3}); %write data to file with no head
end

fclose all; %close pointers

geodave
18 Jul 2008, 12:58am
I tried your latest version of the script, and it was taking quite a long time to process the text file (I had to force MATLAB to quit after about 1 hour from the start of the run). However, I tried running your older modified version of the script from the post where you suggested I address the columns of x, y, and z as follows:

xyz(:,1:3); using a computer with greater RAM... and it worked! It took ~39 minutes to process the ~0.6 Gb text file, but it did it flawlessly!

I appreciate your help with this. Thank you!