String Parsing in C++

Programming HOW TOs and in-depth guides for programmers of all levels. Programming is an essential skill for hackers, so start learning today!
Post Reply
User avatar
Pleo
Guru
Posts: 587
Joined: Thu Aug 28, 2003 5:24 am
Location: eax

String Parsing in C++

Post by Pleo » Tue Sep 12, 2006 12:44 am

Tutorial/Explanation

Introductory Note
Just as a note, the language used is C++ and is written and tested on
Slackware Linux 10.2. If compiled with either the -pedantic or -Wall it will
come up with some warnings but they are easy enough to get rid of.

The Source Code

If you download http://www-personal.une.edu.au/ht/URLTrans.cpp you
will get the unabridge source code. This means that you get a hell of a lot of
commenting. Most of it relates directly to the code and an example main function
showing usage of the function. It would be very easy to modify the function to
obtain other information from the URL's as well. I am only getting usernames and
passwords here because I wasnted a simple function.

Introduction

This was written in response to a question on the hackerthreads.org forum
asking whether there is a program written that can parse a file of url's of
the format <protocol>://<username>:<password>@<site> to obtain another output
file with just the usernames and passwords. It seems to me that a topic such as
this which is something that needs to be dealt with fairly regularly needed to
be spoken about. So I have written this guide to string parsing using C++.

Run Down
Before you can parse strings to obtain information you first need to examine
input to determine what rules control it's formatting. It can be seen with this
example <protocol>://<username>:<password>@<site> that the protocol is
separated from the rest of the URL by the three character "://", consequently
to obtain the protocol we can call the C++ string class to "find" this first
instance of these characters (This is done by the call
somestring.find( "://", 0 ) which will search for those characters starting at
the beginning (character 0) of the string. The username is then from the end of
the "://" through until the character ':', the password until the character '@'
and from there the rest is the site for which the username and password is used.

My Code

Code: Select all

// A Typedefined variable to store each row of the table (user, pass)
typedef vector<string> UserPassRow;
// A Typedefined variable to store the table, making use of the previously
// defined row
typedef vector<UserPassRow> UserPassTbl;
This is required by the function as it defines the type of variable the
function uses to store the table of usernames and passwords.

Code: Select all

UserPassTbl URLTranslate( ifstream& fin )
{
    string URLLine;          // The line from input
    int protocol,            // The reference to the characters "://"
        EndPass,             // The reference to the character '@'
        divideChar;          // The reference to the character ':'
        UserPassTbl UP;      // The table to store usernamnes and passwords
        UserPassRow UPRow(2);// A row to use as a template for adding to the
                             // table

    // This while loop will continue until the end of the input file is reached
    while( !fin.eof() )
    {
        // Fetch a line from the input file into the string URLLine
        getline( fin, URLLine );

        // Search the line for the end of the protocol section of the URL
        protocol = URLLine.find( "://", 0 ) + 3;
        // If there isn't one, ignore the line as a malformed URL
        if( protocol == -1 )
            continue;
        // Erase the protocol part of the string (It's in the way ;)
        URLLine.erase( 0, protocol );

        // Search for the beginning of the address section of the URL
        EndPass = URLLine.find( '@', 0 );
        // If there isn't one ignore the line as a malformed URL
        if( EndPass == -1 )
            continue;
        // Erase the site section (it's in the way as well)
        URLLine.erase( EndPass, URLLine.size() - EndPass + 1);

        // Find the character the divided the username and the password
        divideChar = URLLine.find( ':', 0 );
        // If there isn't one ignore the line as a malformed URL
        if( divideChar == -1 )
            continue;

        // Put the username and password in the row
        UPRow[0] = URLLine.substr( 0, divideChar );
        UPRow[1] = URLLine.substr( divideChar + 1,
            URLLine.size() - divideChar );

        // Add the row to the table
        UP.push_back( UPRow );
    }

    return UP;
}

An Explanation of the Code

Code: Select all

UserPassTbl URLTranslate( ifstream& fin )
I'm going to assume the majority of readers know how to declare a function, but
I'll explain the use of the '&' character. This character tells the compiler to
have this variable passed by telling the function where it is stored in memory.
Normally functions recieve a copy of any variables they are passed. Streams
need to be passed by reference (with the &) for a number of reasons, you can
google if you want to know.

The variables are fairly self explanatory. As is the condition in the while
loop.

Code: Select all

// If there isn't one, ignore the line as a malformed URL
if( protocol == -1 )
continue;
The reason we see whether it is -1 is because find returns an integer reference
to where that sequence of characters can be found. If they don't exist within
the string it will return -1. As a result, seeing as we know all proper URL's
contain these characters we can ignore that input and not waste more time with
a useless line and instead move onto lines which are valid URL's.

Code: Select all

URLLine.erase( 0, protocol + 3 );
We need to add 3 to protocol so that it will include the "://" as find returns
a reference to the character immediately before the sequence of characters we
were looking for.

To use erase you somestring.erase( reference to character to start at, how many
characters to erase );

Code: Select all

// Assign the first spot in the row for the table as the username
UPRow[0] = URLLine.substr( 0, divideChar );
// Assign the second spot in the row for the table as the password
UPRow[1] = URLLine.substr( divideChar + 1,
URLLine.size() - divideChar );
This will put the username (from 0 - the first character - to the divide
character ':') into the first spot of UPRow and the password (the
divide character to the end of the string) into the second part. substr is
called in an indentical fashion to erase.

Code: Select all

UP.push_back( UPRow );
This will add the new row to the table allowing the vector class to handle
memory allocation if it runs out of space.

Conclusion

As I stated earlier the most important thing about parsing strings to obtain
information within them is to understand the rules by which the string
separates it's individual pieces of data. By finding these rules you can then
reproduce them and so gain access to that data.

My Thoughts
To be honest I think there is more to be learnt by examining my code than by
reading my ramblings, however if either have helped you I feel like I have done
something to contribute as opposed to just reading other peoples contributions.
There once was a lawyer named Rex
Who was small in the organs of sex.
When charged with exposure
He replied with composure,
De minimus non curat lex.

User avatar
gohcht
Corporal
Posts: 125
Joined: Tue Jul 18, 2006 9:50 pm
Location: right next to the pacific to be specific

Post by gohcht » Sat Sep 16, 2006 12:38 am

I'm going to assume the majority of readers know how to declare a function, but
I'll explain the use of the '&' character.
Better to assume we are all idiots, at least me, thanks for a little more in depth or dummy-fied explaination. I will study your code as suggested and see if it flicks on that lightswitch I have yet to find for programming. I can think of 20 other similar tasks but different formats/source languages I use 20 steps to do when it seems I can be saving much time later spending some now.

Thanks Pleo....
I am Just like a SUPERHERO, just with no powers or motivation, and when I am not off saving the world, I like to get drunk and screw.

User avatar
Pleo
Guru
Posts: 587
Joined: Thu Aug 28, 2003 5:24 am
Location: eax

Post by Pleo » Sat Sep 16, 2006 6:56 pm

Haha if you want to wait a couple of days. As soon as I finish packing my college room I will be re-organising the tutorial, making it easier to understand and having a deeper run through of string class functions, etc.
There once was a lawyer named Rex
Who was small in the organs of sex.
When charged with exposure
He replied with composure,
De minimus non curat lex.

Post Reply