Home | SourceForge | Forums | Contact

New Features in Web-Harvest 2.0

Collect log files, create ZIP archive and store it on FTP server

Here new feature of file processor is used to collect filenames with specified pattern, then those files are packed to a single zip archive and finally, the archive is sent to specified FTP server.

<config>
    <ftp server="my.ftp.server" username="myname" password="mypassword">
        <ftp-put path="logs_${sys.date()}.zip">
            <zip>
                <loop item="logFileName" empty="true">
                    <list>
                        <file action="list" path="c:/logs/" listfilter="20??-??-??.log"/>
                    </list>
                    <body>
                        <zip-entry name="${sys.getFilename(logFileName.toString())}">
                            <file action="read" path="${logFileName}"/>
                        </zip-entry>
                    </body>
                </loop>
            </zip>
        </ftp-put>
    </ftp>
</config>

Upload each employee's info from database table to a web server

Records about employees are collected from database table, then for each of them information, including image for upload are submitted to a web server. New features from Web-Harvest 2.0 used here are database access and upload with http processor.

<config>
    <loop item="emp" empty="true">
        <list>
            <database connection="jdbc:mysql://myserver/mydb" 
                jdbcclass="com.mysql.jdbc.Driver" 
                username="myusername" 
                password="mypassword">
                select firstname, lastname, email, image 
                from employee
            </database> 
        </list>
        <body>
            <http url="http://www.my.users/register.html" method="post" multipart="true">
                <http-param name="fname">
                    <template>${emp.get("firstname")}</template>
                </http-param>
                <http-param name="lname">
                    <template>${emp.get("lastname")}</template>
                </http-param>
                <http-param name="email">
                    <template>${emp.get("email")}</template>
                </http-param>
                <http-param name="pic" isfile="true">
                    <script return='emp.get("image").toBinary()'></script>
                </http-param>
            </http>
        </body>
    </loop>
</config>

Find well-rated films and send email with images and previews

List of films from tvguide.com is downloaded and only well-rated (with 4 stars or more) are filtered. For each one of them, review page is visited where film photo and short text is extracted. All this information is composed in HTML table and sent to an GMail account. Here, new email feature is used together with email attachment.

<config>
    <mail from="tvguide@popularfilms.com" smtp-host="smtp.gmail.com" smtp-port="25" type="html"
          to="myaccount@gmail.com" username="myaccount" password="mypassword" security="tsl" 
          subject="Best rated films from TV Guide">
        <loop item="link" index="index">
            <list>
                <xpath expression='//div[@class="toplist-w"]//tr[count(.//image[@class="stars" and ends-with(@src, "ColorStar.gif")]) >= 4]//a[1]'>
                    <html-to-xml>
                        <http url="http://www.tvguide.com/top-movies"/>
                    </html-to-xml>
                </xpath>
            </list>
            <body>
                <empty>
                    <var-def name="page">
                        <html-to-xml omitunknowntags="true">
                            <http url='${sys.xpath("//@href", link.toString())}'/>
                        </html-to-xml>
                    </var-def>
                    <var-def name="photourl">
                        <xpath expression='//div[@class="obj-review-pic"]//img[1]/@src'>
                            <var name="page"/>
                        </xpath>
                    </var-def>
                </empty>
 
                <template>
                    <![CDATA[ 
                        <h3> ${ index + ". " + sys.xpath("data(.)", link.toString()) } </h3>
                        <div><table><tr><td>
                    ]]>
                    <case>
                        <if condition='${!photourl.toString().equals("")}'>
                            <![CDATA[ <img src="]]>
                                <mail-attach inline="true">
                                    <http url="${photourl}"/>
                                </mail-attach>
                            <![CDATA[ "> ]]>
                        </if>
                    </case>
                    <![CDATA[
                        </td>
                        <td valign='top'>${ sys.xpath("//div[@class='obj-review-recap']/span[1]/text()", page.toString()) }</td>
                        </tr></table></div>
                        <hr>
                    ]]>
                </template>
            </body>
        </loop>
    </mail>
</config>

Download Dilbert comics and store them to database

This example illustrates database inserts, including storing downloaded images to BLOB (Binary Large OBject) fields. For specified number of images, page is downloaded, then image urls, number of votes and ratings are extracted with XPath and data is inserted to the table.

<config>
    <var-def name="count" overwrite="false">50</var-def>
    <loop item="node" empty="true">
        <list>
            <xpath expression="//div[@class='STR_Strip_Full']">
                <html-to-xml>
                    <http url="http://www.dilbert.com/strips/?ViewType=Full&amp;PerPage=${count}"/>
                </html-to-xml>
            </xpath>
        </list>
        <body>
            <var-def name="imgUrl">
                <xpath expression="//div[@class='STR_Content']/a[1]/img[1]/@src"><var name="node"/></xpath>
            </var-def>
            <database connection="jdbc:mysql://myserver/mydb" jdbcclass="com.mysql.jdbc.Driver" username="myuser" password="mypass">
                insert into dilbert (rating, votes, img)
                values (
                    <xpath expression="substring-before( substring-after(//div[@class='STR_Footer']//script[1]/text(), 'curvalue: '), '}' )"><var name="node"/></xpath>,
                    <xpath expression="data(//div[@class='STR_Metric STR_VoteCount'])"><var name="node"/></xpath>, 
                    <db-param>
                        <http url='${sys.fullUrl("http://www.dilbert.com", imgUrl.toString())}'/>
                    </db-param>
                )
            </database> 
        </body>
    </loop>
</config>