Home | SourceForge | Forums | Contact

Example: Bookmaker odds at expekt.com

This example extracts odds for soccer matches at online bookmaker expekt.com. After short analysis it is evident that data is formatted in HTML table with multiple rows where there are also heading and date rows at the same table level. Further, rows which contain odds data have alternating CSS style.

The central point in this configuration is main loop processor that iterates over the table rows and for each row recognizes if it is a date or an odds row. For the rows containing odds, XQuery expression is used to extract data from individual HTML table cells.

<?xml version="1.0" encoding="UTF-8"?>
 
<!-- Updated on February, 8th, 2011 -->
<config>
    <var-def name="league_name">
        <text>
            <xpath expression='//ul[@id="cat_445subUl"]/li//input[@class="inputhidden"]/@name'>
                <html-to-xml>
                    <http url="http://www.expekt.com/sports/"/>
                </html-to-xml>
            </xpath>
        </text>
    </var-def>
    
    <var-def name="url">
        <template>http://www.expekt.com/sports/odds.do?categoryCodes=${league_name.toString().replace("\n", "_")}&amp;datePeriod=1000000</template>
    </var-def>
 
    <file action="write" path="expekt/odds.xml">
        <template><![CDATA[ <odds time="${sys.datetime("dd.MM.yyyy, HH:mm:ss")}"> ]]></template>
            <xquery>
                <xq-param name="doc">
                    <html-to-xml>
                        <http url="${url}"/>
                    </html-to-xml>
                </xq-param>
                <xq-expression><![CDATA[
                    declare variable $doc as node() external;
                    for $odds_header in $doc//div[@class="odds_header"][./span[not(@class)]] 
                    let $date := data($odds_header/span[1]) 
                    for $match_row in $odds_header/following-sibling::div[@class="odds_body"][1]//tr[starts-with(@class,"odds_row")]
                    let $comp := normalize-space(data($match_row/preceding-sibling::tr[./td[@colspan="6"]][1]//span[@class="odds_row_header"]))
                    let $odds := $match_row//td[@class="m1X2"]//span[@class="right"]
                        return
                            <odd date="{$date}" 
                                 time="{data($match_row//span[@class='time'])}"
                                 event="{normalize-space(data($match_row/td[2]))}"
                                 comp="{$comp}">
                                <odd_1>{normalize-space(data($odds[1]))}</odd_1>
                                <odd_x>{normalize-space(data($odds[2]))}</odd_x>
                                <odd_2>{normalize-space(data($odds[3]))}</odd_2>
                            </odd>
                ]]></xq-expression>
            </xquery>
        <![CDATA[ </odds> ]]>
    </file>    
        
</config>

The result of extraction is file odds.xml. (see odds.xml from October 27th, 2006).