Batch-processing in MATLAB

geodavegeodave Arizona
edited July 2008 in Science & Tech
Hello All!

I'm fairly new to MATLAB and have been given a task that is quite challenging to me.

I have a text file that contains four columns and many rows (on the order of a few hundred thousand rows). I'm trying to write a script in MATLAB that will read the rows and columns from this file, and write only the first three columns into a new text file.

Here's what the original dataset in the text file looks like:
-16.517754 3.610515 -0.847929 30
-16.472557 3.611480 -0.845726 28
-16.477274 3.617941 -0.846026 30
-16.433626 3.616477 -0.843872 32
-16.431351 3.626801 -0.843872 30
-16.424358 3.630670 -0.843572 32
-16.406473 3.637529 -0.842770 32
-16.406305 3.642901 -0.842820 34
-16.403439 3.655784 -0.842820 34
-16.409687 3.659884 -0.843171 32
Basically, I want columns 1,2, and 3, but not column 4, to be written in the new text file.

I have written the following script in MATLAB that works very well for a small dataset (a few thousand rows). It creates two text files; one with a header (x, y, z) and another without a header.
clear all

% CHANGE THIS TO YOUR FILE'S NAME
xyz_data = load('serrano_all_data.txt');

x = xyz_data(:,1);
y = xyz_data(:,2);
z = xyz_data(:,3);

B = [x y z];  % new 3-column matrix
 
% no header
dlmwrite('xyz_no_header.txt',B);  % comma delimited

% with headers
fid = fopen('xyz_header.txt','w+t');
fprintf(fid,'x,y,z\n',B);  % writes headers to text file
fclose(fid);
dlmwrite('xyz_ArcMap.txt',B,'-append');  % comma delimited, appends new xyz matrix to text file
However, when I try running this script on a much larger dataset (16 million rows), I get the following error in MATLAB:
??? Error using ==> horzcat
Out of memory. Type HELP MEMORY for your options.

Error in ==> three_column_utility at 22
B = [x y z];  % new 3-column matrix
I have a suspicion I'm getting this error because the dataset is very large. When I asked a friend who is more familiar with MATLAB than I, he suggested I process the data in batch rather than all at once. For example, the script should read the first thousand lines and write them into the new text file using the new format (3 columns instead of 4 columns), then move on to the next thousand lines and append those to the bottom of the first thousand lines in the new text file using the new format (3 columns instead of 4 columns), and so on, until the end of the file is reached (using the feof command, I think).

My problem is that I'm not quite sure how to do this (if this is the right approach, that is). Any help/suggestions/tips would be greatly appreciated!

Comments

  • shwaipshwaip bluffin' with my muffin Icrontian
    edited July 2008
    are you doing this in windows or linux?
  • geodavegeodave Arizona
    edited July 2008
    I'm doing this in Windows.
  • shwaipshwaip bluffin' with my muffin Icrontian
    edited July 2008
    if you had been using linux, it could have been a 1-line bash command :P

    rather than saying:
    x = xyz(:,1);
    y = xyz(:,2);
    z  = xyz(:,3);
    B = [ x y z ];
    

    you can just address the columns of xyz:
    xyz(:,1:3);
    
    So, your problem is that you're creating 3 copies of the data in your memory. below should only be 1 copy.

    i think this should work.
    clear all
    
    % CHANGE THIS TO YOUR FILE'S NAME
    xyz_data = load('serrano_all_data.txt');
    
    
    % no header
    dlmwrite('xyz_no_header.txt',xyz_data(:,1:3));  % comma delimited
    
    % with headers
    fid = fopen('xyz_header.txt','w+t');
    fprintf(fid,'x,y,z\n');  % writes headers to text file
    fclose(fid);
    dlmwrite('xyz_ArcMap.txt',xyz_data(:,1:3),'-append');  % comma delimited, appends new xyz matrix to text file
    
  • geodavegeodave Arizona
    edited July 2008
    Thanks for looking at this. I copied and pasted your modified version of my script, and ran it using the small sample of the dataset. It worked nicely. But, when I ran it using the large dataset (~0.6 Gb text file, four columns by ~16 million rows), I received the following error:
    ??? Error using ==> load
    Out of memory. Type HELP MEMORY for your options.
    
    Error in ==> new_utility at 4
    xyz_data = load('serrano_all_data.txt');
    

    It's different than the error I used to get when I used my old script in that MATLAB wasn't happy using "horzcat", whereas now MATLAB isn't happy using "load".

    What are your thoughts on this? Again, thanks for spending the time in helping me with this.
  • shwaipshwaip bluffin' with my muffin Icrontian
    edited July 2008
    Hi.

    Basically, you're running out of memory. You're right that the file is way too long.

    try this (it'll probably be slow):
    clear all
    
    % open files
    fp = fopen('serrano_all_data.txt');
    fp_head = fopen('xyz_header.txt','w+t');
    fp_nohead = fopen('xyz_noheader.txt','w+t');
    
    %write headers
    fprintf(fp_head,'x,y,z\n');
    
    while 1
          line = fgetl(fp); %read line
          if ~ischar(line),break,end; %make sure that we got data
          spl = regexp(line,' ','split'); %split on space
          fprintf(fp_head,'%f,%f,%f',spl{1},spl{2},spl{3}); %write data to file with head
          fprintf(fp_nohead,'%f,%f,%f',spl{1},spl{2},spl{3}); %write data to file with no head
    end
    
    fclose all; %close pointers
    
  • geodavegeodave Arizona
    edited July 2008
    I tried your latest version of the script, and it was taking quite a long time to process the text file (I had to force MATLAB to quit after about 1 hour from the start of the run). However, I tried running your older modified version of the script from the post where you suggested I address the columns of x, y, and z as follows:
    xyz(:,1:3);
    
    using a computer with greater RAM... and it worked! It took ~39 minutes to process the ~0.6 Gb text file, but it did it flawlessly!

    I appreciate your help with this. Thank you!
Sign In or Register to comment.