C++ program to search a html file for words

leishi85leishi85 Grand Rapids, MI Icrontian
edited December 2004 in Internet & Media
i'm starting to work on a program that search a html file for words.

The user is going to input the html file to search for, adn then put in how many words, and then type in the words.

When searching throught the html file, it's only going to look through the body part of the file for the searching words, and then make the words bold and color the background.

and this is going to be case insensitive.

the problem is that i'm not quiet sure what approach to take. i know i have to use strings and vectors, and mainuplate the input data, so case won't be an issue. but not too sure how to do this.

Comments

  • rykoryko new york
    edited October 2004
    well, i can't help you out with creating a program to do what you want, but i can tell you that i have used this product called dtsearch, which will do exactly what you want. I used it to search through millions of court depositions to find key words, and it would give you returns based on percentage of matches in a small left-hand column, and it would high-light the key word(s) in the body of the original text. It is very cool and customizable.

    Not sure how much the app is now-- i used it over a year ago at my old job, so i didn't have to worry about price, and i am too lazy to look right now.

    Might get you going in the right direction.... :thumbsup:
  • leishi85leishi85 Grand Rapids, MI Icrontian
    edited October 2004
    thanks, but i'm just looking for some help on making the program. just something to search a html file.
  • mmonninmmonnin Centreville, VA
    edited October 2004
    Thats one thing we never went over that I wished we had, output files and reading from them.
  • JBJB Carlsbad, CA
    edited October 2004
    id say you just want to read the file in line by line looking for the < body > tag. Once you find that write a function that uses the .find function to locate the word you are looking for. If you find that word, edit the string (by inserting the appropriate html) and then save your new string to the output in place of what you just read in. Nice and easy :)

    //edit: for case insensitive just use read in the string and make a temporay copy using .toupper or .tolower so you have a case insensitive string, but you will still have the original when you are finally ready to write out to your file.
  • ThraxThrax 🐌 Austin, TX Icrontian
    edited October 2004
    Or you could just open it in a browser and hit ctrl + f.

    Sorry. :p;D
  • BLuKnightBLuKnight Lehi, UT Icrontian
    edited October 2004
    I did this project last year. It was more of a web crawler actually. It indexed the words as well as links in other web pages. Then it made a list by first letter.

    My recommendation is break up the problem. If you've already learned the basis for object oriented design and classes, you're ahead of the game.

    Break up your code into modules. In this case you'll probably want to create a parser. This module will be responsible for bringing in the input from the web page. Next you'll need a module that determines if the incoming word is to be acted upon. IE, if you haven't hit the <body> tag, then it should be looking at the word. As this module finds the word, it should do something special with that. (I'm not giving away all of the answers).

    Then you should have a module to handle the output. Depending how this module is used, it will be told to output normally or by highlighting the word.

    Hope that helps.
  • leishi85leishi85 Grand Rapids, MI Icrontian
    edited October 2004
    yeah, all the helps from you guys helps, but if i have more problems when writing the code, i will post in here, and see if u guys can help me with the code writing part.
  • leishi85leishi85 Grand Rapids, MI Icrontian
    edited November 2004
    having some problems with the code im writing,

    it's not compiling and not too sure if it's gonna work as it should

    any idea guys??
    [php]
    #include <iostream>
    #include <fstream>
    #include <cctype>
    #include <string>
    #include <vector>
    using namespace std;

    int main()
    {
    cout << "File name: ";
    string name;
    cin >> name;
    ifstream in(name.c_str());
    string fileData;
    in >> fileData;
    while (in.fail())
    {
    cout << "Please enter a new file name: " << endl;
    cin >> name;
    ifstream in(name.c_str());
    in >> fileData;
    }

    //=============================================================================
    //brings all string into CAPITALS (only function used in program)


    for (int i=0; i < name.size(); i= i + 1)
    {
    name=toupper(name);
    }

    //==============================================================================
    //pick i number of words to search for ... prompts for each

    vector<string>keywords;
    int keynum;
    string keyword;
    cout << "Type in the number of key words you want to look for: " << endl;
    cin >> keynum;
    for (int count = 0; count <= keynum; count++)
    {
    cout << "Please enter the keyword to search for: " << endl;
    cin >> keynum;
    for (int count = 0; i < keyword.size(); i = i+1)
    {
    keyword = toupper(keyword);
    }
    keywords.push_back(keyword);
    }

    //==============================================================================
    //bring the html file in line by line and vreak apart by whitespace

    int count1=0 ;
    int c=0;
    string line;
    getline(name, line);
    bool condition=false;
    vector<string> words;
    while (condition==false)
    }

    while (count1 < line.size())
    {
    char spacesearch = line[count1];
    if (isspace(spacesearch))
    {
    string word = line.substring(count1-c, count 1-1);
    c = 1;
    // write word into a vector
    words.push_back(word);
    }
    //identifies the point at which </body exists and ends the entire loop
    if (spacesearch=='<' and line[count1+1]=='/' and line[count1=2]=='B')
    {
    condition=true;
    }
    count1++;
    c++;
    }
    getline(name,line);
    }

    //==============================================================================
    //bring in one key word from the vector keyword(s) and compare it to each word
    //found in the word(s) vector
    //then bring in the next key word and repeat
    int wordsize=words.size();
    int h=0;
    int j=0;
    while (h<=keynum)
    {
    string searchword=keywords[h];
    while (j<=wordsize)
    {
    string comparison=words[j];
    if (searchword==comparison)
    {
    //colorizination
    }
    j++;
    }
    h++;
    }
    return(0);
    }
    //==============================================================================
    //==============================================================================
    [/php]
  • BLuKnightBLuKnight Lehi, UT Icrontian
    edited November 2004
    Well, I managed to solve all the errors except for two. I see if I can work on these when I get home. Also, I onle checked for errors, I didn't run the code to see if it works right.

    I also put in comments so you can see where I modified or corrected code. This should make it easier to understand where you made a few mistakes in syntax.

    I hope this helps. C++ is a lot of fun and after you get C down, PHP is a snap. What compiler are you using for your C++ programming?

    [PHP]

    #include <iostream>
    #include <fstream>
    #include <cctype>
    #include <string>
    #include <vector>
    using namespace std;

    int main()
    {
    cout << "File name: ";
    string name;
    cin >> name;
    ifstream in(name.c_str());
    string fileData;
    in >> fileData;
    while (in.fail())
    {
    cout << "Please enter a new file name: " << endl;
    cin >> name;
    ifstream in(name.c_str());
    in >> fileData;
    }

    //================================================== ===========================
    //brings all string into CAPITALS (only function used in program)

    // Changed i=i+1 to i++
    for (int i=0; i < name.size(); i++)
    {
    name=toupper(name);
    }

    //================================================== ============================
    //pick i number of words to search for ... prompts for each

    vector<string>keywords;
    int keynum;
    string keyword;
    cout << "Type in the number of key words you want to look for: " << endl;
    cin >> keynum;
    for (int count = 0; count <= keynum; count++)
    {
    cout << "Please enter the keyword to search for: " << endl;
    cin >> keynum;
    // ERROR HERE
    // replaced int count = 0 with int i = 0
    for (int i = 0; i < keyword.size(); i = i+1)
    {
    keyword = toupper(keyword);
    }
    keywords.push_back(keyword);
    }

    //================================================== ============================
    //bring the html file in line by line and vreak apart by whitespace

    int count1=0;
    int c=0;
    string line;
    getline(name, line);
    bool condition=false;
    vector<string> words;

    /* Old code
    while (condition==false)
    }
    */
    // New Code
    while (condition==false) {
    while (count1 < line.size()) {
    char spacesearch = line[count1];
    if (isspace(spacesearch)) {
    // ERROR
    //string word = line.substring(count1-c, count 1-1);
    // FIX
    string word = line.substr(count1-c, count1-1);

    c = 1;
    // write word into a vector
    words.push_back(word);
    }
    //identifies the point at which </body exists and ends the entire loop

    // ERROR if (spacesearch=='<' and line[count1+1]=='/' and line[count1=2]=='B') {
    if (spacesearch=='<' && line[count1+1]=='/' && line[count1=2]=='B') {
    condition=true;
    }
    count1++;
    c++;
    }
    getline(name,line);
    }

    //================================================== ============================
    //bring in one key word from the vector keyword(s) and compare it to each word
    //found in the word(s) vector
    //then bring in the next key word and repeat
    int wordsize=words.size();
    int h=0;
    int j=0;
    while (h<=keynum)
    {
    string searchword=keywords[h];
    while (j<=wordsize)
    {
    string comparison=words[j];
    if (searchword==comparison)
    {
    //colorizination
    }
    j++;
    }
    h++;
    }
    return(0);
    }
    //================================================== ============================
    //================================================== ============================

    [/PHP]
  • BLuKnightBLuKnight Lehi, UT Icrontian
    edited November 2004
    Okay, I've managed to find out what was causing the istream errors. I've posted a comments here and there. I think you'll need to look into the area of your code where it searches for keywords. I've posted some information above. Let me know what else I can do.

    [PHP]
    #include <iostream>
    #include <fstream>
    #include <cctype>
    #include <string>
    #include <vector>
    using namespace std;

    int main() {
    cout << "File name: ";
    string name;
    cin >> name;
    /*
    ifstream in(name.c_str());
    string fileData;
    in >> fileData;
    while ( in.fail() ) {
    cout << "Please enter a new file name: " << endl;
    cin >> name;
    ifstream in(name.c_str());
    in >> fileData;
    }
    */

    fstream fin;
    fin.open(name.c_str(), fstream::in);
    while ( !fin.is_open() ) {
    cout << "Please enter a new file name: " << endl;
    cin >> name;
    fin.open(name.c_str(), fstream::in);
    }

    //================================================== ===========================
    //brings all string into CAPITALS (only function used in program)

    // Changed i=i+1 to i++
    for (int i=0; i < name.size(); i++) {
    name=toupper(name);
    }

    //================================================== ============================
    //pick i number of words to search for ... prompts for each

    vector<string> keywords;
    int keynum;
    string keyword;
    cout << "Type in the number of key words you want to look for: " << endl;
    cin >> keynum;
    for (int count = 0; count <= keynum; count++)
    {
    cout << "Please enter the keyword to search for: " << endl;
    // ERROR
    // OLD cin >> keynum;
    cin >> keyword;
    // ERROR HERE
    // replaced int count = 0 with int i = 0
    for (int i = 0; i < keyword.size(); i = i+1) {
    keyword = toupper(keyword);
    }
    keywords.push_back(keyword);
    }

    //================================================== ============================
    //bring the html file in line by line and vreak apart by whitespace

    /*
    * NOTE:
    * This part of your code needs work. I'm not sure, but I think you should be
    * ignoring HTML tags and you also need to check if you've hit the <BODY> tag yet.
    * You also need to check to see if you've hit the </BODY> tag.
    */

    int count1=0;
    int c=0;
    string line;
    // ERROR: Fixed getline
    //getline(name, line);
    getline(fin, line);
    bool condition=false;
    vector<string> words;

    /* Old code
    while (condition==false)
    }
    */
    // New Code
    while (condition==false) {
    while (count1 < line.size()) {
    char spacesearch = line[count1];
    if (isspace(spacesearch)) {
    // ERROR
    //string word = line.substring(count1-c, count 1-1);
    // FIX
    string word = line.substr(count1-c, count1-1);
    c = 1;
    // write word into a vector
    words.push_back(word);
    }
    //identifies the point at which </body exists and ends the entire loop

    // ERROR if (spacesearch=='<' and line[count1+1]=='/' and line[count1=2]=='B') {
    if (spacesearch=='<' && line[count1+1]=='/' && line[count1=2]=='B') {
    condition=true;
    }
    count1++;
    c++;
    }
    // ERROR: Fixed getline
    //getline(name, line);
    getline(fin, line);
    }

    //================================================== ============================
    //bring in one key word from the vector keyword(s) and compare it to each word
    //found in the word(s) vector
    //then bring in the next key word and repeat
    int wordsize=words.size();
    int h=0;
    int j=0;
    while (h<=keynum) {
    string searchword=keywords[h];
    while (j<=wordsize) {
    string comparison=words[j];
    if (searchword==comparison) {
    //colorizination
    }
    j++;
    }
    h++;
    }
    return(0);
    }
    //================================================== ============================
    //================================================== ============================

    [/PHP]
  • leishi85leishi85 Grand Rapids, MI Icrontian
    edited November 2004
    thanks a lot. blueknight
  • edited December 2004
    hi,
    Using the simple code where a=1, b=2, c=3 etc a word can be assigned a (non-unique) score. For example computer=105. I want to write a program that can count the number of words in a text file that have a score specified by the user of the program. I want to use the program to find the number of words in a file that have a score of exactly 100.

    Any help would be much appreciated
    Mike
Sign In or Register to comment.