Recursive collection of search statistics of a single search criterion within a hierarchy of IMAP folders

 

Emin Gabrielyan

2009-10-19

 

A Perl script is developed (i) discovering automatically all IMAP folders and subfolders on a set of accounts (possibly on different IMAP servers), (ii) executing search criteria within each discovered folder, (iii) collecting and summarizing the statistics, and (iv) reporting the results in a form of a chart within an html page uploaded on one or more specified web servers.

 

For demonstration, refer to the online pages of charts displaying the numbers of replies of Switzernet’s support team to customers sent from different accounts on different servers (contracts, billing, and support).

 

[091017 ii]

 

 

The discovery and collection of statistics relies entirely on IMAP functionality [RFC 3501]. The email headers are not fetched and searching is not carried out locally (as in the example of search folders (ii) of Thunderbird). Several RFC incompliance issues are observed in Zimbra IMAP servers, but the code is made compatible also with the specificities of Zimbra.

 

The system is tested on a configuration with 4 different server machines and a dozen of IMAP accounts to access, some of which having more than 70 IMAP folders each having maximum 5’000 messages. In this configuration, one of the machines runs Dovecot, two machines run Zimbra, and the fourth one runs unknown server software.  With a set of 25 criteria to execute on all messages within each folder of each account, the results are retrieved and the report is uploaded within a total time of 7 minutes.

 

The versions of the package are published in a public code depository. One or more execution servers recurrently download the last version of the code, execute it, and upload the reports on two mirror web servers. Currently the system is configured to generate reports three times per day.

 

[091017]

 

The package contains two configuration files, the ‘names.txt’ and the list of accounts and folders in the format shown in the example below:

 

Login: contracts@switzernet.com

Server: mail.switzernet.com

Dir: Inbox

Dir: Inbox.Feedback

Sub: Inbox.Approved

Sub: Inbox.stamps

 

The header field ‘Dir’ can be repeated several times under a specified account. These entries specify the folders to be scanned non-recursively. One ‘Dir’ entry per folder is needed. The header field ‘Sub’ can be repeated several times for a given account and is used to specify the folders to be scanned recursively.

 

The passwords file contains one or more space (not tab) separated fields. The first field is protocol name (‘imap’ or ‘ftp’), the second field is the server name, the third is the login, and the fourth is the password.

 

imap  mail.switzernet.com   tasks@switzernet.com          xxx

imap  mail3.switzernet.com  billing@mail3.switzernet.com  xxx

ftp   unappel.ch            unappel                       xxx

 

The password file can be kept in the non-public package (to be downloaded each time), or can be stored locally in execution server.

 

The rest of this document comments the Perl code.

 

Code

Comments

#!/usr/bin/perl

#Copyright (c) 2009 by Emin Gabrielyan, Switzernet

$version="aa15.perl.txt";

 

use IO::Socket;

use Net::FTP;

 

local $log=3;

local $timeout=30;

local $fpasswds="../passwords.txt";

local $ffolders="folders.txt";

local $fnames="names.txt";

 

local $SIG{ALRM} = sub { die "Server timeout" };

The log level 3 shows a brief progress report, while the log level 4 shows all messages exchanged with servers.

 

Timeout is the delay in seconds for an individual command exchange.

sub mksearches

{

  my ($search,$name_file)=@_;

  open file,"<$fnames" or {die "cannot access the names"};

  my @names=<file>;

  close file;

  my %searches=();

  foreach(@names)

  {

    s/ +/ /g;

    s/^ *//;

    s/ *$//;

    s/[\r\n]//g;

    my $name=$_;

    s/[^a-z ][^ ]*( |$)/$1/ig;

    s/ +/ /g;

    s/^ *//;

    s/ *$//;

    if(!/^ *$/)

    {

      s/(^| )/ from /ig;

      $searches{$search.$_}=$name;

    }

  }

  return %searches;

}

This subroutine generates the set of search criteria for the names retrieved from the input file. The search aims at messages where the names appear as display names in the from header fields.

 

The special non 7-bit characters do not appear in the IMAP query. The search query is a little bizarre for bypassing the RFC incompliance of Zimbra.

 

A global hash array ‘searches’ is loaded by IMAP search queries (as indexes) and the original names (as values).

sub getfolders

{

  my ($login,$server,$type)=@_;

 

  open file,"<$ffolders" or {die "cannot list folders"};

 

  my $flogin="", $fserver="", @dironl=(), @subdir=();

 

  while(<file>)

  {

    $flogin=$1 if(/^Login: +([^ \r\n]+)[\r\n ]*$/);     

    $fserver=$1 if(/^Server: +([^ \r\n]+)[\r\n ]*$/);

    if($flogin eq $login && $fserver eq $server)

    {

      push(@dironl,$1) if(/^Dir: +([^ \r\n]+)[\r\n ]*$/);

      push(@subdir,$1) if(/^Sub: +([^ \r\n]+)[\r\n ]*$/);

    }

  }

 

  close file;

  return ($type?@subdir:@dironl);

}

This subroutine is responsible to find and return the list of IMAP folders to scan.

 

The inputs of the subroutine are the IMAP login, server, and the type of folders (0 for folders to scan non-recursively, and 1 for folders to scan recursively).

sub getpasswd

{

  my ($proto,$host,$user)=@_;

 

  open file,"<$fpasswds" or die "cannot access the passwords";

  my @lines=<file>;

  close file;

  my $passw="";

  foreach(@lines)

  {

    if(/^$proto +$host +$user +([^ \r\n]+)[\r\n]*$/)

    {

      $passw=$1;

      last;

    }

  }

  return $passw;

}

This subroutine returns the password value for a given protocol, host, and login username.

sub subfolders

{

  my ($login,$server)=@_;

  my @dironl=&getfolders($login,$server,0);

  my @subdir=&getfolders($login,$server,1);

 

  if($log>=2)

  {

    print "\n";

    print " Login: $login\n";

    print "Server: $server\n";

    $"=", ";

    print "   Dir: @dironl\n";

    print "   Sub: @subdir\n";

    print "\n";

  }

 

  alarm($timeout);

 

  local $c=IO::Socket::INET->new(Proto=>"tcp",PeerAddr=>"$server",PeerPort=>143) or {die "cannot connect to IMAP"};

  local $tag=1000;

  local @folders;

 

  while(<$c>) {print if($show); last if(/^\* OK/)}

 

  sub dialog{

    my @commands=@_;

    foreach $cmd (@commands)

    {

      alarm($timeout);

      $tag++;

      my $send="$tag $cmd\n";

      my $expect="$tag OK";

      print $send if($log>=4);

      print $c $send;

      while(<$c>)

      {

        print if($log>=4);

        $found{$cmd}+=split/ +/,$1 if(/^\* SEARCH(?: +(\d|\d[\d ]*\d))?[^\d]*$/);

        push(@folders,$1) if(/^\* LIST \([^)]*\) \"[^\"]*\" (\"[^\"]*\")[\r\n]*$/);

        last if(/^$expect/)

      }

    }

  }

 

  my $passw=&getpasswd('imap',$server,$login);

 

  &dialog("noop", "login $login $passw", "noop");

 

  print "On $server as $login\n" if($log>=2);

 

  @folders=();

  &dialog("noop", "list \"\" \"$_\"", "noop") foreach(@dironl);

  &dialog("noop", "list \"$_\" *", "noop") foreach(@subdir);

 

  my @seq;

  foreach(@folders)

  {

    print "$_\n" if($log>=2);

    @seq=("examine $_");

    push(@seq,$_) foreach(keys %searches);

    &dialog(@seq);

  }

 

  &dialog("noop", "logout");

  close $c;

  alarm(0);

}

This function logins into the server credentials of which are provided as arguments.

 

Then with the two calls to ‘getfolders’ it discovers the sets of folders to scan non-recursively and the folders to scan recursively.

 

It launches an alarm after the timeout seconds and establish a TCP connection with the IMAP port (143) of the target server.

 

The parameter ‘tag’ is the identifier prefixing the IMAP client commands. This parameter is auto incremented after each client command on the server.

 

The subroutine ‘dialog’ takes as input a sequence of commands. If a list output (folder hierarch discovery) or a search output is encountered, the results are stored in ‘found’ hash array or in ‘folders’ array correspondingly.

 

The first dialog is the login exchange.

 

The next two dialog loops are the discoveries of folders and subfolders in the recursive and non-recursive folder sets.

 

Then all discovered folders are examined (the read only counterpart of the select IMAP command) and the set of search criteria is launched within each subfolder.

 

The last dialog is that of the logout.

sub dateandtime

{

  my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);

  $year+=1900;

  $mon+=1;

  return ($sec,$min,$hour,$mday,$mon,$year);

}

This subroutine returns an array containing the components of the current date and time.

sub trapezoid

{

  my $x=$_[0];

  $x=$x-6*int($x/6);

  return $x if($x<1);

  return 1 if($x>=1 && $x<=3);

  return 4-$x if($x>3 && $x<4);

  return 0;

}

 

sub mkcolor

{

  my $x=6*($_[0]-$_[1])/($_[2]-$_[1]+1);

 

  return (&trapezoid($x+2), &trapezoid($x), &trapezoid($x+4));

}

These two functions create a continuous linear color scheme. For an input from 0 to 1, the function returns an RGB color continuously changing from red to green, to blue, then back to green.

 

The generation of this color scheme is carried out also in a Excel file generating draft samples of the HTML report [xls].

sub save

{

  my ($year_of_stat,$month_of_stat,$filename)=@_;

  my @mmmm=qw(January February March April May June July August September October November December);

  my @mmm=qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);

  my ($sec,$min,$hour,$mday,$mon,$year)=&dateandtime;

 

  open file,">$filename" or die "cannot write";

 

print file <<____;

<html>

<head>

<title>Support email statistics of Switzernet on contracts@ billing@ support@ rates@ usd-online@ accounts</title>

</head>

<body><font face=arial>

<table border=0 cellpadding=0 cellspacing=0 height=750><tr>

<td valign=bottom bgcolor=#eeeeee align=center>

____

 

  my @invoke=(sort { $a cmp $b } keys %searches);

  my $n=@invoke;

  my $s=0;

  for(my $i=$n-1;$i>=0;$i--)

  {

    my $imap=$invoke[$i];

    $s+=$found{$imap};

  }

  printf file "<font size=2><u>Total %d</u></font><br>",$s;

 

print file <<____;

<table border=0 cellpadding=0 cellspacing=2 width=75>

____

 

  for(my $i=$n-1;$i>=0;$i--)

  {

    my ($r,$g,$b)=&mkcolor($i,0,$n-1);

    my $imap=$invoke[$i];

    my $name=$searches{$imap};

    my $count=$found{$imap};

    my $h=$count*0.15;

    printf file "<tr><td valign=middle align=center bgcolor=#%02x%02x%02x height=%.1f><font size=2 color=white>",255*$r,255*$g,255*$b,$h;

    my $label=$name;

    $label=~s/ //g;

    $label=~s/^(.{1,5}).*$/$1/;

    $label.=" ".$count;

    print file $label if($h>10);

    print file "</font></td></tr>\r\n";

  }

 

print file <<____;

</table>

</td>

<td valign=bottom>

<table border=0 cellpadding=0 cellspacing=0 width=200>

____

 

  for(my $i=$n-1;$i>=0;$i--)

  {

    my ($r,$g,$b)=&mkcolor($i,0,$n-1);

    my $imap=$invoke[$i];

    my $name=$searches{$imap};

    my $count=$found{$imap};

    printf file "<tr bgcolor=#%02x%02x%02x>",255*(0.9+0.1*$r),255*(0.9+0.1*$g),255*(0.9+0.1*$b);

    printf file "<td valign=middle align=right>";

    printf file "<font size=2 color=#%02x%02x%02x>",255*0.8*$r,255*0.8*$g,255*0.8*$b;

    printf file "%s</font></td>",$name;

    printf file "<td valign=middle align=right>";

    printf file "<font size=2 color=#%02x%02x%02x>",255*0.8*$r,255*0.8*$g,255*0.8*$b;

    printf file "%d</font></td></tr>",$count;

  }

 

print file <<____;

</table>

</td>

</tr></table>

<font size=1>

____

 

  printf file "<font size=3><b>%s'%02d</b></font><br>",$mmm[$month_of_stat-1],$year_of_stat%100;

  for(my $i=1;$i<=31;$i++)

  {

    printf file "<a href=%02d.htm>%02d</a> ",$i,$i;

  }

  print  file "<a href=today.htm>today</a><br>";

  printf file "<font size=1 color=gray><i>Created by %s on %04d-%02d-%02d %02d:%02d:%02d</i></font>",$version,$year,$mon,$mday,$hour,$min,$sec;

 

print file <<____;

</font>

</font></body>

</html>

____

 

  close file;

}

This subroutine generates the HTML report. It takes as input the year and the month of statistics, as well as the output HTML filename.

 

It assumes the two global hash arrays, ‘searches’ containing the names, and ‘found’ containing the numbers collected from IMAP servers.

 

The individual months histograms are created by piling in a single column table rows (i.e. cells) of different heights. The background colors of these cells changes (from bottom to top) according to our trapezoid RGB color scheme. The short names followed by values appear in the cell only if the height of the individual histogram piece is sufficiently high.

 

The individual month histograms can be viewed in the raw upload folder [091017 ii].

 

These histograms of individual months are then collected into chart using a main index with html IFRAME tags for each month. This technique permits uploading of only the pages containing the data of current month without regenerating the histograms of previous months. See the code of the main page [091017 ii] defining simply the monthly spaces via IFRAME windows.

 

The report page contains also links to all other eventual reports of the month.

sub upload

{

  my ($file,$host,$user,$dir,$year_of_stat,$month_of_stat)=@_;

 

  my ($sec,$min,$hour,$mday,$mon,$year)=&dateandtime;

 

  my $passw=&getpasswd('ftp',$host,$user);

 

  my $ftp = Net::FTP->new($host, Debug => ($log>=4?1:0)) or die "Cannot connect to ftp";

 

  $ftp->login($user,$passw) or die "Cannot login ", $ftp->message;

  print "ftp logged in $host\n" if ($log>=2);

 

  $ftp->cwd($dir) or die "Cannot change working directory ", $ftp->message;

 

  $ftp->put($file,sprintf("uploads/%02d%02d%02d,%02d%02d%02d,stats,%04d,%02d.htm",

    $year%100,$mon,$mday,$hour,$min,$sec,$year_of_stat,$month_of_stat))

    or die "cannot upload to [uploads]";

  print "ftp cp to [uploads] ok\n" if ($log>=2);

 

  $ftp->put($file,sprintf("%02d%02d/today.htm",$year_of_stat%100,$month_of_stat))

    or die "cannot upload to [yymm]";

  print "ftp cp to [yymm]/today.htm ok\n" if ($log>=2);

 

  if($year==$year_of_stat && $mon==$month_of_stat)

  {

    $ftp->put($file,sprintf("%02d%02d/%02d.htm",$year_of_stat%100,$month_of_stat,$mday))

      or die "cannot upload to [yymm]";

    print "ftp cp to [yymm]/dd.htm ok\n" if ($log>=2);

  }

 

  $ftp->quit();

}

This subroutine is responsible for uploading the report file on a remote web server.

 

It takes as arguments the local filename, the remote ftp hostname, the ftp login username, the ftp directory, the year and the month of the statistic report.

 

The remote server must have a subfolder called ‘uploads’ for keeping a track of all uploaded files without overwriting anything. The server must also contain an YYMM folder for each month of the report.

 

If the report concerns the current month (which is the case most of the time), then the report is additionally copied in YYMM/DD.htm file on the remote server (overwriting the previous version if any).

 

In any case the report overwrites the YYMM/today.htm file of the month’s folder.

sub collect

{

  &subfolders('contracts@switzernet.com',"mail.switzernet.com");

  &subfolders('usd-online@switzernet.com',"mail.switzernet.com");

  &subfolders('support@mail2.switzernet.com',"mail2.switzernet.com");

  &subfolders('billing@mail3.switzernet.com',"mail3.switzernet.com");

  return;

  &subfolders('rates@unappel.ch', "mail.unappel.ch");

  &subfolders('cash@switzernet.com',"mail.switzernet.com");

  &subfolders('tasks@switzernet.com', "mail.switzernet.com");

  &subfolders('projects@switzernet.com',"mail.switzernet.com");

}

This subroutine logins in each account, collects, and summarizes the statistics per each set of search criteria.

sub monthlyreport

{

  my ($year,$month,$criteria)=@_;

  my @mmm=qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);

 

  printf "\nPreparing monthly report %04d-%02d\n",$year,$month if($log>=2);

 

  my $search="search $criteria since 1-$mmm[$month-1]-$year before 1-".$mmm[$month%12]."-".($year+int($month/12));

  $search=~s/ +/ /g;

 

  local %searches=&mksearches($search);

  local %found=();

 

  &collect;

 

  if($log>=2)

  {

    print "\nSatatistics of names:\n";

    printf "%5d %s\n",$found{$_},$searches{$_} foreach(sort { $a cmp $b } keys %found)

  }

 

  &save($year,$month,"out.htm");

 

  &upload("out.htm","unappel.ch","unappel","/htdocs/public/091017-support-numbers/stats/",$year,$month);

  &upload("out.htm","switzernet.com","switzernet.com","/public/091017-support-numbers/stats/",$year,$month);

}

This subroutine receives as input the year and the month for which a report is to be generated. The third argument is the common root of all search criteria (multiplied by the number of names provided in the input name file).

 

The numerical values of the year and month are translated into IMAP search criteria.

 

The set of searches is generated and is stored in the hash array, used then by the ‘collect’ subroutine, which summarizes the results in the ‘found’ hash array.

 

These hash arrays are then used by the ‘save’ subroutine generating a report in the ‘out.htm’ file, which is then uploaded to two mirror web locations.

sub cron

{

  my ($criteria)=@_;

  my ($sec,$min,$hour,$mday,$mon,$year)=&dateandtime;

 

  if($mday<=2)

  {

    &monthlyreport($year-($mon==1),$mon==1?12:$mon-1,$criteria);

  }

 

  &monthlyreport($year,$mon,$criteria);

}

This subroutine is the main entry point by default. It is called a couple of times per day. If we are at the beginning of the month we still generate and update the last (completed) month’s stats in addition to the current new month stats. From the third day of the month, only the stats of the current month are generated and uploaded.

sub ytd

{

  my ($criteria)=@_;

  my ($sec,$min,$hour,$mday,$mon,$year)=&dateandtime;

 

  for(my $m=1;$m<=$mon;$m++)

  {

    &monthlyreport($year,$m,$criteria);

  }

}

This subroutine generates the Year To Day history (all monthly stats of the year as of today).

 

This subroutine must be called only once, at the beginning, to initialize the system.

 

#&ytd("not to undisclosed not to inconnus not to switzernet.com");

 

&cron("not to undisclosed not to inconnus not to switzernet.com");

 

#&monthlyreport(2009,9,"not to undisclosed not to inconnus not to switzernet.com");

 

 

 

The samples of calls. By default only the cron daily update entry is used. The root of the search criteria limits the search only to messages addressed to customers.

 

The following short script is responsible for downloading and executing the current package.

 

Script

Coments

#!/bin/bash

 

url="http://unappel.ch/public/091017-support-numbers/code/current.zip"

dir="/root/folders/091017-support-numbers"

 

cd $dir

 

if [ `pwd` = $dir ]

then

  echo "[$0]" working folder found

else

  echo "[$0]" not in working folder

  exit

fi

 

if [ -f current.zip ]

then

  echo "[$0]" removing the previous version

  rm current.zip

else

  echo "[$0]" ..

fi

 

if [ -f current.zip ]

then

  echo "[$0]" cannot remove the old package

  exit

fi

Removing the previous version

/usr/bin/wget $url

 

if [ -f current.zip ]

then

  echo "[$0]" script is downloaded

else

  echo "[$0]" script is not downloaded

  exit

fi

 

if [ -d code ]

then

  echo "[$0]" deleting the old script folder

  rm -r code

fi

 

if [ -d code ]

then

  echo "[$0]" cannot delete the old script folder

  exit

fi

 

mkdir code

 

if [ -d code ]

then

  echo "[$0]" ..

else

  echo "[$0]" cannot access script folder

  exit

fi

 

/usr/bin/unzip current.zip -d code

Downloading and unpacking the Perl script

if [ -f code/run.pl ]

then

  echo "[$0]" perl script is extracted

else

  echo "[$0]" cannot extract the perl script

  exit

fi

 

cd code

/usr/bin/perl run.pl >> ../std.log 2>> ../err.log

cd ..

 

tail -5000 std.log > last.std.log

rm std.log

mv last.std.log std.log

 

tail -5000 err.log > last.err.log

rm err.log

mv last.err.log err.log

 

date >> std.log

date >> err.log

 

echo "[$0]" end of reporting

Executing and rotating the standard and error output log files

 

 

Below are the crontab entries on two execution servers:

 

$ ssh root@xxx.xxx.15.3 crontab -l | grep support-numbers

51 3,11,20 * * * /root/folders/091017-support-numbers/report.sh > /dev/null

 

$ ssh root@xxx.xxx.15.4 crontab -l | grep support-numbers

11 23 * * * /root/folders/091017-support-numbers/report.sh > /dev/null

 

 

References:

 

On-line statistics of answered emails of support@ and billing@ accounts of Switzernet [090309 ii]

 

Charts of the staff replies collected with IMAP search commands [091017 ii]

Charts of the staff replies collected with Thunderbird search (discontinued) [090822 ii]

Charts of customer emails by classes to billing and support accounts [090309 ii]

Migration of IMAP accounts of Support and Billing questions to dedicated distributed servers [091004 ii]

 

 

*   *   *

Copyright © 2009, Switzernet

www.switzernet.com