William Jiang

JavaScript,PHP,Node,Perl,LAMP Web Developer – http://williamjxj.com; https://github.com/williamjxj?tab=repositories

Tag Archives: regular express

A example of Regular Express

According to my knowledge, Perl has 2 legs: Regular Express and Hash Table. Having them, Perl is very powerful to solve real issues; without them, Perl is just a Shell ++.

  • Regular Express (RE)
    All the procedure languages have Regular Express: PHP, Perl, Python, Ruby.
    As far as I know, they are very similar: all inherit and expand from Perl. Perl is the originator of RE.
    e.g, PHP has RE, and does a significant improvement that it can match many regular expressions at once which Perl can not, here is a example of PHP’s preg_replace function which can operate array, as well as scalar variable.

    
    <?php
    $patterns = array ('/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/',
                       '/^\s*{(\w+)}\s*=/');
    $replace = array ('\3/\4/\1\2', '$\1 =');
    echo preg_replace($patterns, $replace, '{startDate} = 1999-5-27');
    ?>
    - output: $startDate = 5/27/1999
    

    The above $patterns and $replaces can be array, which is not implemented in Perl.
    However, that doesn’t mean RE in PHP is more super in Perl.
    Actually, Perl’s m//, s///, tr/// plus other functions (grep, map) make parsing much easier and quicker than others.

  • Hash Table
    This Hash Table (%hash or reference: $hash_ref) is different from Java’s HashTable.
    Java makes every see the bottom, it’s data structure are complex and stupid (sometimes).
    Perl’s Hash Table (as well as its array) are super.
    It makes things easier, and intuitionistic.

We focus on RE. Here is a comparation of PHP’s preg_* function and Perl version RE functions.

php perl
preg_match m//
preg_replace s///, tr///
preg_filter s///, tr///
preg_grep grep, s///,tr///
preg_match_all match, m//
preg_quote s///
preg_split s///

From the table, we can see Perl’s RE is more clear and compact.
There are many pm modules in CPAN which extends RE for parsing and extract data, e.g for parsing XML, normally they use SAX and DOM methods.

Here I wrote a simple example, let’s say download a webpage from craigslist.org, than parse the page, to get extracted data:

(1) Firstly, we download the webpage by using generic commend ‘wget’.
wget http://vancouver.en.craigslist.ca/web/
(2) Secondly, after the $html page is downloaded in memory, we can use Perl’s RE to extract data, as the following:


$html =~ m{
		Date:
		(.*?)		# Date
		<br
		(?:.*?)
		Reply\s+to:
		(?:.*?)
		<a\s(?:.*?)>
		(.*?)		# email
		</a>
		(?:.*?)
		<div\sid="userbody">
		(.*?)		# content
		</div>
	}sgix ) {
		my ( $date, $email, $t3 ) = ( $1, $2, $3 );
		...
}

in this example, we need 3 information: phone, url, and email address. The following sub-routines do the job and get accurate result.
RE is used to perfectly launched in such case.
(a) parse html and extract phone number:



sub get_phone
{
	my ($self, $html) = @_;
	return unless $html;
	$html =~ s/<img.*?>//g;
	my ($phone) = $html =~ m{(?:\b|<b>)?([\d\-\(\)\.]{10,})(?:\b|</b>|\s)}s;
	return unless($phone);
	return if ($phone=~m/\.{10,}/); 	# more..........
	return if ($phone=~m/(?:\d\s){3,}/);  # 5 0 0 0 0 0
	$phone =~ s/^\s+// if ($phone=~m/^\s+/); # ' 123'
	$phone =~ s/\s+$// if ($phone=~m/\s+$/); # '123 '
	$phone =~ s/^\.+// if ($phone=~m/^\./);	 # '.1(604)'
	$phone =~ s/^-+// if ($phone=~m/^-/);	 # '-1(604)'
	$phone = '(' . $phone if ($phone=~m"\)" && $phone!~m"\(");
	$phone =~ s/\s/-/g if ($phone=~m/\s/);		#  '123 456 7890'
	$phone =~ s/-\($// if ($phone=~m/-\($/);  # '6789-('
	return $phone;
}

(b) extract url from html.



sub get_url
{
	my ($self, $html) = @_;
	return unless $html;
	my ($url) = $html =~ m{((http://|www\.)(?:[\w\-]+\.){1,5}\w+(/\S*)?)}sig;

	unless ($url) {
		my $pattern = "(\.com|\.ca|\.info|\.us|\.tv|.gov)";
		if ($html=~m/$pattern/i) {		
			($url) = $html =~ m{[^@](?:\b)((?:[\w\-]+\.){1,5}(com|us|info|ca|jpg|png|jpeg|gif)(/\S*)?)}sig;
		}
	}
	$url =~ s/<.*$// if ($web && $web=~m/<.*$/);
	$url =~ s/">.*$// if ($web && $web=~m/">/);
	$url =~ s/&amp;/&/g if ($web && $web=~m/&amp;/);
	$url =~ s/\S$// if ($web && $web=~m/["';,?]$/);
	return $url;
}

(c) extract email:



sub get_email {
	my ( $self, $str ) = @_;
	return unless $str;
	if ( $str =~ m/\@/ ) {
		$str =~ s/\<a.*?>//s;
		$str =~ s/<\/a>.*$//s;    # </a>
		$str = $self->trim($str);
	}
	else {
		$str = '';
	}
	return $str;
}

(d) a RE subroutine to trim space on both front and tail of string.



sub trim
{
	my ($self, $str) = @_;
	return '' unless $str;

	$str =~ s/&nbsp;/ /g if ($str =~ m/&nbsp;/);
	$str =~ s/&amp;/&/g if ($str =~ m/&amp;/);
	$str =~ s/^\s+// if ($str =~ m/^\s/);
	$str =~ s/\s+$// if ($str =~ m/\s$/);
	return $str;
}

By using above 4 sub-routines to parse the html content from craigslist, the extracted data are exactly what we want, and can suit most of variable formats.
It is easy, simple, with a lot of time saved.

A bonus question is that how can we print out a word-frequency or line-frequency summary in this html?

To do this, we have to parse out each word in the input stream. We’ll pretend that by word we mean chunk of alphabetics, hyphens, or apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question:



        while (<>) {
                while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
                        $seen{$1}++;
                }
        }
        while ( ($word, $count) = each %seen ) {
                print "$count $word\n";
                }

The above is to parse the whole html contents (multi-lines), if we want to do the same thing for individual line, we can do like this:



        while (<>) {
                $seen{$_}++;
                }
        while ( ($line, $count) = each %seen ) {
                print "$count $line";
        }

The above is the complete implementation of RE usage: to parse and extract different contents from the original html.