William Jiang

JavaScript,PHP,Node,Perl,LAMP Web Developer – http://williamjxj.com; https://github.com/williamjxj?tab=repositories

Tag Archives: perl

Perl Padre IDE

Perl Padre IDE, Catalyst, and others

Perl is good at text processing, the following are some examples:

1. Add space between words
my $text = "ThisTextWithoutSpaces";
$text =~ s/([a-z])([A-Z])/$1 $2/g;
print $text; # This Text Without Spaces
2. Get the size of a file in bytes
$size = -s “path/to/file.txt”;

3.Running Windows? Got a file with UNIX line endings?
perl -ne "s/n/rn/; print;" linux.txt > windows.txt
# This changes 'n' to 'tn', like tr "n" "rn" 
4. Want to make a table of the number of times each word in $text appears?
my %words = ();
$words{lc($_)}++ for $text =~ /b([w']+)b/g;

Each key of the %words hash will be a word in $text, and the value is the number of times that word appeared.

5. Padre, the Perl IDE
Padre is a Perl IDE, an integrated development environment, the url is http://padre.perlide.org/.
After install Padre which also install Perl latest version (5.14), then install Catalyst, the Perl MVC framework:

$ cpanm Catalyst::Devel

CentOS 6.2: Install Perl’s MongoDB modules without CPAN

Install Perl’s MongoDB module on CentOS 6.2

I already setup MongoDB server in CentOS 6.2 server, and want to add Perl client-side, just like PHP’s client side.
It is somewhat painful to install Perl modules without CPAN because CentOS doesn’t install it by default.

It is a little challenge to install MongoDB.pm without CPAN. However, it’s Perl, should be no problem.
I pre-installed a 32-bit xampp package in this CentOS; it has a 32-bit cpan and perl which I can use. It can’t compile with the system, but can be used to download Perl’s modules.

(1) So, first step, use xmapp’s cpan command to download all dependent modules from CPAN repository:
$ sudo /opt/lampp/bin/cpan MongoDB

(2) Now all the dependant modules are downloaded but could not installed, but compile are failed: because the OS is 64-bit while lamp is 32-bit, 32-bit /opt/lampp/bin/perl, make can’t compile these modules into 64-bits OS systems.
I have to use CentOS’ perl, make instead.

$ cd /root/.cpan/build/
$ ls -lrt | grep ^d 
# to get all MongoDB's dependant modules to a file '/tmp/1234'
# Notice the sequence are very important, so use 'ls -lrt'
$ cd /tmp/
$ cat 1234 | cut -f11 -d ' ' >/tmp/process.sh

(3) To do some vi process, generate file like this:

#!/bin/bash
cd /root/.cpan/build/
for i in `echo Params-Validate-1.06-kNaYxh 
DateTime-0.77-lWIhS7 
Data-Types-0.09-0jo_xI 
Class-Method-Modifiers-1.09-CogBmd 
blah blah...
MongoDB-0.46.3-p4xRQm`
do
    echo $i
    cd $i
    sudo perl Makefile.PL
    sudo make
    sudo make install
    cd -
done

Since it is already sorted, the installation goes smoothly without errors. After the installation, do the test:
$ perldoc MongoDB
It is done! very cool.

Perl: setup LD_LIBRARY_PATH

Perl setting LD_LIBRARY_PATH

When cpan installs Perl module with dynamic share object (.so) loading, it is probably with error like this:
Can’t load module.so: cannot open shared object file: No such file or directory. The error comes from /usr/lib64/perl5/DynaLoader.pm line 82:

if ($ldlibpthname_defined &&
    $ldlibpthname ne 'LD_LIBRARY_PATH' &&
    exists $ENV{LD_LIBRARY_PATH}) {
    push(@dl_library_path, split(/$pthsep/, $ENV{LD_LIBRARY_PATH}));
}

It indicates that perl can’t find the relative .so file, and probably the env variable LD_LIBRARY_PATH is not set.
The following are several ways to set LD_LIBRARY_PATH for Perl scripts:

  1. Add in Perl script itself:
    BEGIN {
     $ENV{LD_LIBRARY_PATH} = "/usr/local/lib";
    }
    # after setup the path, call Perl module which relys on the path:
    use Text::module;

    If problem, refer to this: Runtime Linker and LD_LIBRARY_PATH for proper description.

  2. Add in $HOME/.bash_profile
    This is local variable, only available for current login user. Edit $HOME/.bash_profile, adding the following line in the bottom:
    export LD_LIBRARY_PATH=/usr/local/lib
  3. Add in /etc/profile
    This is global variable, will effect all login users. Edit /etc/profile file, adding the following line in the bottom:
    export LD_LIBRARY_PATH=/usr/local/lib
  4. Operate in command line
    This is just for this section, will disappear when logout. In the command line:
    $ export LD_LIBRARY_PATH=/usr/local/lib
    $ perl -e ‘use Text:module; …’

It is always a good idea of using solution 1, put initial setting in perl’s BEGIN{} block.
In the crontab, for the variables inherit, we can do like this:
0 1 * * * (export LD_LIBRARY_PATH=/usr/local/lib; $HOME/perl_script >/dev/null 2>&1)

Perl, unicode/utf8/gb2312 convert

Perl, unicode/utf8/gb2312 convert

Here is a helpful chinese article which summarizes Perl’s unicode/utf8/gb2312 transfer. I list here for quick retrieve:

use utf8;
use Encode;
use URI::Escape;

$\ = "\n";

#从unicode得到utf8编码
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "utf8", $str );
print uc unpack( "H*", $str );

# 从unicode得到gb2312编码
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "gb2312", $str );
print uc unpack( "H*", $str );

# 从中文得到utf8编码
$str = "收";
print uri_escape($str);

# 从utf8编码得到中文
$utf8_str = uri_escape("收");
print uri_unescape($str);

# 从中文得到perl unicode
utf8::decode($str);
@chars = split //, $str;
foreach (@chars) {
    printf "%x ", ord($_);
}

# 从中文得到标准unicode
$a = "汉语";
$a = decode( "utf8", $a );
map { print "\\u", sprintf( "%x", $_ ) } unpack( "U*", $a );

# 从标准unicode得到中文
$str = '%u6536';
$str =~ s/\%u([0-9a-fA-F]{4})/pack("U",hex($1))/eg;
$str = encode( "utf8", $str );
print $str;

# 从perl unicode得到中文
my $unicode = "\x{505c}\x{8f66}";
print encode( "utf8", $unicode );

Actually, to convert GB2312 to Unicode, then insert into MySQL Unicode_general_ci table, the following strange way might be more efficient:

use Encode;
$gb=decode("euc-cn","$gb");
$unicode=$dbh->quote($gb);
# to insert $unicode to MySQL unicode general_ci table.

It seems strange, but works fine. Others, like Encode:from_to(), Encode:encode() all don’t work.

Perl’s Mason: A solution for large dynamic websites building

Perl’s Mason: A solution for large dynamic websites building

Perl’s Mason (http://www.masonhq.com/) is a powerful Perl-based web site development and delivery engine. Currently it has 2 versions (in CPAN: version 1 is HTML::Mason, version 2 is Mason).

With the supports of Apache/mod_perl‘s persistent environments, Mason is a powerful, high-performance templating for the web and beyond: it solves the common problems of site development: caching, debugging, templating, maintaining development and production sites, and more. So it is suitable for large, dynamic driven websites. Amazon, Delicious are the users.

From the following example we can see how it works:

1. HTML template file index.mc:
% my $name = "Mason";
Hello world! Welcome to <% $name %>.

2. Mason calls the template file:
#!/usr/bin/perl
use Mason;
my $mason = Mason->new(comp_root => '...');
print $mason->run('/index')->output;

It differs from other template systems: unlike many templating systems, Mason does not attempt to invent an alternate, “easier” syntax for templates. It provides a set of syntax and features specific to template creation, but underneath it is still clearly and proudly recognizable as Perl.

Mason is most often used for generating web pages. It can handle web requests directly via PSGI, or act as the view layer for a web framework such as Catalyst or Dancer.

In Perl, there are many templates systems can be used as presentation layer, such as:

  • Text::Template
  • HTML::Template
  • Template::Toolkit
  • Mason

Among them, Mason is used for building large dynamic websites,
With Mason we can embed Perl code in HTML and construct pages from shared, reusable components.

Perl: install XML::LibXML

According to http://perl-xml.sourceforge.net/faq/, For XML Parsing, there are many parser modules (DOM-based, SAX-based, or XSLT) to choose because no one solution will be appropriate in ALL cases.

(1) First of all, make sure to have XML::Parser installed – but don’t plan to use it. Other modules provide layers on top of XML::Parser – use them. If you’re looking for a more powerful tree based approach, try XML::LibXML for a standards compliant DOM or XML::Twig for a more ‘Perl-like’ API. Both of these modules support XPath. XML::LibXML is very fast, complete and stable. It can run in validating or non-validating modes and offers a DOM with XPath support.

According to ‘select a parser module’ section of the web, for general purpose XML processing with Perl, The Quick Answer is, XML::LibXML is usually the best choice. It is stable, fast and powerful. To make the most of the module you need to learn and use XPath expressions.

I used XML::Twig before, which is suitable for large XML document to parse (e.g., 60MB). This time I am going to use XML::LibXML.

(2) Installing XML::LibXML is easy to be failure. The following are my steps for it, which has to add extra manul processing to make it finally work.

# cd $HOME/
# cpan
cpan> install XML::LibXML
get Error:
looking for -lxml2... no
looking for -llibxml2... libxml2 not found
Try setting LIBS and INC values on the command line
Or get libxml2 from http://www.libxml.org/

(3) XML::LibXML needs libxml2.so supporting. By default, the system has no libxml.so. So I have to install it first. From http://git.gnome.org/browse/libxml2/ to download the latest version: libxml2-git-snapshot.tar.gz, In linux,we can do like this:


# wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
# tar xzvf libxml2-git-snapshot.tar.gz
# cd libxml2-2.7.8/
# ./configure; make; make install;

If everything is fine, the libxml2.so is installed to /usr/local/lib/. To check:
# find /usr/local/lib/libxml2.so
The ‘libxml2.so’ should be there.

(4) Since the libxml.so is installed, continue processing XML::LibXML

# cd /root/.cpan/build/XML-LibXML-1.70/
# vi Makefile.PL to add $config{LIBS} path:
  $config{LIBS} = '-L/usr/local/lib -L/usr/lib -lxml2 -lm';
then:
# Perl Makefile.PL
# make; make install;
# perldoc XML::LibXML

to check it is successfully installed.

That’s the processing. The manually adding $config{LIBS} is important, coz without it, perl can’t find libxml.so to associate.

Perl: install CPAN module

While setup a new app env in Linux, I need to install some perl modules, such as excellent WWW::Mechanize.pm. Normally using CPAN module to do the installation:

# perl -MCPAN -e shell
cpan> install WWW::Mechanize

However, in many cases the installation has some problems, due to system environment, such as:

Failed 1/54 test programs. 0/569 subtests failed.
make: *** [test_dynamic] Error 255
/usr/bin/make test — NOT OK
Running make install
make test had returned bad status, won’t install without force

The test is strict, in many case it fails. Test failure doesn’t mean the module can’t be installed. Installation can continue without any problem.
After the failure, to skip the testing, my way is like this:

# cd $HOME/.cpan/
# cd /build/WWW-Mechanize-1.68/
# make install

It should work. to verify the module is installed successfully, use:

$ perldoc WWW::Mechanize

to check out its documents. If the man-page can be accessible, means the installation is successful.

A tip is suppose we want to install many perl modules at the same time, don’t need ‘perl -MCPAN -e shell’ to install 1 by 1.
Create a file, put all the perl modules inside, and fetch from the files instead of command-line; if any issue, go directly to /root/.cpan/ to modify.
This is the most quick and simple solution.

Perl vs. Python vs. Ruby

Perl vs. Python vs. Ruby

This article is from web. I’m evaluating Python and Ruby as replacements for Perl.

I’ve been using Perl for several years and am very comfortable with it, although I’m definitely not an expert. Perl is a powerful language, but I think it’s ugly and encourages writing bad code, so I want to get rid of it. Python and Ruby both come with Mac OS X 10.2, both have BBEdit language modules, and both promise a cleaner approach to scripting. Over the past few weeks I read the Python Tutorial and the non-reference parts of Programming Ruby, however as of this afternoon I’d not written any Python or Ruby code yet.

Here’s a toy problem I wanted to solve. eSellerate gives me a tab-delimited file containing information about the people who bought my shareware.
I wanted a script to extract from this file the e-mail addresses of people who asked to be contacted when I release the new versions of the products.

I decided to solve this problem in each language and then compare
the resulting programs. The algorithm I chose was just the first one
that came to mind. I coded it first in Ruby, and then ported the code
to Python and Perl, changing it as little as possible. Thus, the style
is perhaps not canonical Python or Perl, although since I’m new to Ruby
it’s probably not canonical Ruby either. If I were just writing this in Perl, I might have tried to avoid Perl’s messy syntax for nested arrays and instead used an array of strings.

Here’s the basic algorithm:

  1. Read each line of standard input and break it into fields at each tab.
  2. Each field is wrapped in quotation marks, so remove them. Assume that there are no quotation marks in the interior of the field.
  3. Store the fields in an array called record.
  4. Create another array, records and fill it with all the records.
  5. Make a new array, contactRecords, that contains arrays of just the fields we care about: SKUTITLE, CONTACTME, EMAIL.
  6. Sort contactRecords by SKUTITLE.
  7. Remove the elements of contactRecords where CONTACTME is not 1.
  8. Print contactRecords to standard output, with the fields separated by tabs and the records separated by newlines.

And here’s the code:

Perl

#!/usr/bin/perl -w

use strict;

my @records = ();

foreach my $line ( <> )
{
    my @record = map {s/"//g; $_} split("\t", $line);
    push(@records, \@record);
}

my $EMAIL = 17;
my $CONTACTME = 27;
my $SKUTITLE = 34;

my @contactRecords = ();
foreach my $r ( @records )
{
    push(@contactRecords, [$$r[$SKUTITLE], 
          $$r[$CONTACTME], $$r[$EMAIL]]);
}

@contactRecords = sort {$$a[0] cmp $$b[0]} @contactRecords;
@contactRecords = grep($$_[1] eq "1", @contactRecords);

foreach my $r ( @contactRecords )
{
    print join("\t", @$r), "\n";
}

The punctuation and my’s make this harder to read than it should be.

Python

#!/usr/bin/python

import fileinput

records = []

for line in fileinput.input():
    record = [field.replace('"', '') for field in line.split("\t")]
    records.append(record)

EMAIL = 17
CONTACTME = 27
SKUTITLE = 34

contactRecords=[[r[SKUTITLE], r[CONTACTME], r[EMAIL]] for r in records]
contactRecords.sort() # default sort will group by sku title
contactRecords = filter(lambda r: r[1] == "1", contactRecords)

for r in contactRecords:
    print "\t".join(r)

I think the Python version is generally the cleanest to read—that is, it’s the most English-like. I had to look up how join and filter worked, because they weren’t methods of list as I had guessed.

Ruby

#!/usr/bin/ruby

records = []

while gets
    record = $_.split('\t').collect! {|field| field.gsub('"', '') }
    records << record
end

EMAIL = 17
CONTACTME = 27
SKUTITLE = 34

contactRecords=records.collect {|r| [r[SKUTITLE], r[CONTACTME], r[EMAIL]] }
contactRecords.sort! # default sort will group by sku title
contactRecords.reject! {|a| a[1] != "1"}

contactRecords.each {|r|
    print r.join("\t"), "\n"
}

A example of Regular Express

According to my knowledge, Perl has 2 legs: Regular Express and Hash Table. Having them, Perl is very powerful to solve real issues; without them, Perl is just a Shell ++.

  • Regular Express (RE)
    All the procedure languages have Regular Express: PHP, Perl, Python, Ruby.
    As far as I know, they are very similar: all inherit and expand from Perl. Perl is the originator of RE.
    e.g, PHP has RE, and does a significant improvement that it can match many regular expressions at once which Perl can not, here is a example of PHP’s preg_replace function which can operate array, as well as scalar variable.

    
    <?php
    $patterns = array ('/(19|20)(\d{2})-(\d{1,2})-(\d{1,2})/',
                       '/^\s*{(\w+)}\s*=/');
    $replace = array ('\3/\4/\1\2', '$\1 =');
    echo preg_replace($patterns, $replace, '{startDate} = 1999-5-27');
    ?>
    - output: $startDate = 5/27/1999
    

    The above $patterns and $replaces can be array, which is not implemented in Perl.
    However, that doesn’t mean RE in PHP is more super in Perl.
    Actually, Perl’s m//, s///, tr/// plus other functions (grep, map) make parsing much easier and quicker than others.

  • Hash Table
    This Hash Table (%hash or reference: $hash_ref) is different from Java’s HashTable.
    Java makes every see the bottom, it’s data structure are complex and stupid (sometimes).
    Perl’s Hash Table (as well as its array) are super.
    It makes things easier, and intuitionistic.

We focus on RE. Here is a comparation of PHP’s preg_* function and Perl version RE functions.

php perl
preg_match m//
preg_replace s///, tr///
preg_filter s///, tr///
preg_grep grep, s///,tr///
preg_match_all match, m//
preg_quote s///
preg_split s///

From the table, we can see Perl’s RE is more clear and compact.
There are many pm modules in CPAN which extends RE for parsing and extract data, e.g for parsing XML, normally they use SAX and DOM methods.

Here I wrote a simple example, let’s say download a webpage from craigslist.org, than parse the page, to get extracted data:

(1) Firstly, we download the webpage by using generic commend ‘wget’.
wget http://vancouver.en.craigslist.ca/web/
(2) Secondly, after the $html page is downloaded in memory, we can use Perl’s RE to extract data, as the following:


$html =~ m{
		Date:
		(.*?)		# Date
		<br
		(?:.*?)
		Reply\s+to:
		(?:.*?)
		<a\s(?:.*?)>
		(.*?)		# email
		</a>
		(?:.*?)
		<div\sid="userbody">
		(.*?)		# content
		</div>
	}sgix ) {
		my ( $date, $email, $t3 ) = ( $1, $2, $3 );
		...
}

in this example, we need 3 information: phone, url, and email address. The following sub-routines do the job and get accurate result.
RE is used to perfectly launched in such case.
(a) parse html and extract phone number:



sub get_phone
{
	my ($self, $html) = @_;
	return unless $html;
	$html =~ s/<img.*?>//g;
	my ($phone) = $html =~ m{(?:\b|<b>)?([\d\-\(\)\.]{10,})(?:\b|</b>|\s)}s;
	return unless($phone);
	return if ($phone=~m/\.{10,}/); 	# more..........
	return if ($phone=~m/(?:\d\s){3,}/);  # 5 0 0 0 0 0
	$phone =~ s/^\s+// if ($phone=~m/^\s+/); # ' 123'
	$phone =~ s/\s+$// if ($phone=~m/\s+$/); # '123 '
	$phone =~ s/^\.+// if ($phone=~m/^\./);	 # '.1(604)'
	$phone =~ s/^-+// if ($phone=~m/^-/);	 # '-1(604)'
	$phone = '(' . $phone if ($phone=~m"\)" && $phone!~m"\(");
	$phone =~ s/\s/-/g if ($phone=~m/\s/);		#  '123 456 7890'
	$phone =~ s/-\($// if ($phone=~m/-\($/);  # '6789-('
	return $phone;
}

(b) extract url from html.



sub get_url
{
	my ($self, $html) = @_;
	return unless $html;
	my ($url) = $html =~ m{((http://|www\.)(?:[\w\-]+\.){1,5}\w+(/\S*)?)}sig;

	unless ($url) {
		my $pattern = "(\.com|\.ca|\.info|\.us|\.tv|.gov)";
		if ($html=~m/$pattern/i) {		
			($url) = $html =~ m{[^@](?:\b)((?:[\w\-]+\.){1,5}(com|us|info|ca|jpg|png|jpeg|gif)(/\S*)?)}sig;
		}
	}
	$url =~ s/<.*$// if ($web && $web=~m/<.*$/);
	$url =~ s/">.*$// if ($web && $web=~m/">/);
	$url =~ s/&amp;/&/g if ($web && $web=~m/&amp;/);
	$url =~ s/\S$// if ($web && $web=~m/["';,?]$/);
	return $url;
}

(c) extract email:



sub get_email {
	my ( $self, $str ) = @_;
	return unless $str;
	if ( $str =~ m/\@/ ) {
		$str =~ s/\<a.*?>//s;
		$str =~ s/<\/a>.*$//s;    # </a>
		$str = $self->trim($str);
	}
	else {
		$str = '';
	}
	return $str;
}

(d) a RE subroutine to trim space on both front and tail of string.



sub trim
{
	my ($self, $str) = @_;
	return '' unless $str;

	$str =~ s/&nbsp;/ /g if ($str =~ m/&nbsp;/);
	$str =~ s/&amp;/&/g if ($str =~ m/&amp;/);
	$str =~ s/^\s+// if ($str =~ m/^\s/);
	$str =~ s/\s+$// if ($str =~ m/\s$/);
	return $str;
}

By using above 4 sub-routines to parse the html content from craigslist, the extracted data are exactly what we want, and can suit most of variable formats.
It is easy, simple, with a lot of time saved.

A bonus question is that how can we print out a word-frequency or line-frequency summary in this html?

To do this, we have to parse out each word in the input stream. We’ll pretend that by word we mean chunk of alphabetics, hyphens, or apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question:



        while (<>) {
                while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
                        $seen{$1}++;
                }
        }
        while ( ($word, $count) = each %seen ) {
                print "$count $word\n";
                }

The above is to parse the whole html contents (multi-lines), if we want to do the same thing for individual line, we can do like this:



        while (<>) {
                $seen{$_}++;
                }
        while ( ($line, $count) = each %seen ) {
                print "$count $line";
        }

The above is the complete implementation of RE usage: to parse and extract different contents from the original html.