Introductory Note
Just as a note, the language used is C++ and is written and tested on
Slackware Linux 10.2. If compiled with either the -pedantic or -Wall it will
come up with some warnings but they are easy enough to get rid of.
The Source Code
If you download http://www-personal.une.edu.au/ht/URLTrans.cpp you
will get the unabridge source code. This means that you get a hell of a lot of
commenting. Most of it relates directly to the code and an example main function
showing usage of the function. It would be very easy to modify the function to
obtain other information from the URL's as well. I am only getting usernames and
passwords here because I wasnted a simple function.
Introduction
This was written in response to a question on the hackerthreads.org forum
asking whether there is a program written that can parse a file of url's of
the format <protocol>://<username>:<password>@<site> to obtain another output
file with just the usernames and passwords. It seems to me that a topic such as
this which is something that needs to be dealt with fairly regularly needed to
be spoken about. So I have written this guide to string parsing using C++.
Run Down
Before you can parse strings to obtain information you first need to examine
input to determine what rules control it's formatting. It can be seen with this
example <protocol>://<username>:<password>@<site> that the protocol is
separated from the rest of the URL by the three character "://", consequently
to obtain the protocol we can call the C++ string class to "find" this first
instance of these characters (This is done by the call
somestring.find( "://", 0 ) which will search for those characters starting at
the beginning (character 0) of the string. The username is then from the end of
the "://" through until the character ':', the password until the character '@'
and from there the rest is the site for which the username and password is used.
My Code
Code: Select all
// A Typedefined variable to store each row of the table (user, pass)
typedef vector<string> UserPassRow;
// A Typedefined variable to store the table, making use of the previously
// defined row
typedef vector<UserPassRow> UserPassTbl;
function uses to store the table of usernames and passwords.
Code: Select all
UserPassTbl URLTranslate( ifstream& fin )
{
string URLLine; // The line from input
int protocol, // The reference to the characters "://"
EndPass, // The reference to the character '@'
divideChar; // The reference to the character ':'
UserPassTbl UP; // The table to store usernamnes and passwords
UserPassRow UPRow(2);// A row to use as a template for adding to the
// table
// This while loop will continue until the end of the input file is reached
while( !fin.eof() )
{
// Fetch a line from the input file into the string URLLine
getline( fin, URLLine );
// Search the line for the end of the protocol section of the URL
protocol = URLLine.find( "://", 0 ) + 3;
// If there isn't one, ignore the line as a malformed URL
if( protocol == -1 )
continue;
// Erase the protocol part of the string (It's in the way ;)
URLLine.erase( 0, protocol );
// Search for the beginning of the address section of the URL
EndPass = URLLine.find( '@', 0 );
// If there isn't one ignore the line as a malformed URL
if( EndPass == -1 )
continue;
// Erase the site section (it's in the way as well)
URLLine.erase( EndPass, URLLine.size() - EndPass + 1);
// Find the character the divided the username and the password
divideChar = URLLine.find( ':', 0 );
// If there isn't one ignore the line as a malformed URL
if( divideChar == -1 )
continue;
// Put the username and password in the row
UPRow[0] = URLLine.substr( 0, divideChar );
UPRow[1] = URLLine.substr( divideChar + 1,
URLLine.size() - divideChar );
// Add the row to the table
UP.push_back( UPRow );
}
return UP;
}
An Explanation of the Code
Code: Select all
UserPassTbl URLTranslate( ifstream& fin )
I'll explain the use of the '&' character. This character tells the compiler to
have this variable passed by telling the function where it is stored in memory.
Normally functions recieve a copy of any variables they are passed. Streams
need to be passed by reference (with the &) for a number of reasons, you can
google if you want to know.
The variables are fairly self explanatory. As is the condition in the while
loop.
Code: Select all
// If there isn't one, ignore the line as a malformed URL
if( protocol == -1 )
continue;
to where that sequence of characters can be found. If they don't exist within
the string it will return -1. As a result, seeing as we know all proper URL's
contain these characters we can ignore that input and not waste more time with
a useless line and instead move onto lines which are valid URL's.
Code: Select all
URLLine.erase( 0, protocol + 3 );
a reference to the character immediately before the sequence of characters we
were looking for.
To use erase you somestring.erase( reference to character to start at, how many
characters to erase );
Code: Select all
// Assign the first spot in the row for the table as the username
UPRow[0] = URLLine.substr( 0, divideChar );
// Assign the second spot in the row for the table as the password
UPRow[1] = URLLine.substr( divideChar + 1,
URLLine.size() - divideChar );
character ':') into the first spot of UPRow and the password (the
divide character to the end of the string) into the second part. substr is
called in an indentical fashion to erase.
Code: Select all
UP.push_back( UPRow );
memory allocation if it runs out of space.
Conclusion
As I stated earlier the most important thing about parsing strings to obtain
information within them is to understand the rules by which the string
separates it's individual pieces of data. By finding these rules you can then
reproduce them and so gain access to that data.
My Thoughts
To be honest I think there is more to be learnt by examining my code than by
reading my ramblings, however if either have helped you I feel like I have done
something to contribute as opposed to just reading other peoples contributions.