Modern HTML querying with familiar CSS syntax
WebHarvest 2.2 supports CSS selectors as a simpler alternative to XPath for HTML querying, powered by jsoup library.
CSS selector support in XPath plugin
Uses standard CSS selector syntax, not proprietary extensions
a.link vs //a[@class='link']
Same syntax as jQuery and CSS - web developers know it already
XPath still works! CSS selectors are optional alternative
How to use CSS selectors in WebHarvest
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- CSS Selector Mode: type="css" -->
<xpath type="css" expression="a.product-link">
${html}
</xpath>
<!-- Extract text from all matching elements -->
<xpath type="css" expression="h1.title">
${html}
</xpath>
<!-- Get attribute value -->
<xpath type="css" expression="img.product" attribute="src">
${html}
</xpath>
</config>
type="css" attribute to use CSS selectors instead of
XPath.
CSS selector examples for typical scraping tasks
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<xpath type="css" expression=".product-title">${html}</xpath>
<xpath type="css" expression="div.result">${html}</xpath>
</config>
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<xpath type="css" expression="#main-content">${html}</xpath>
</config>
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<xpath type="css" expression="a[href*='product']">${html}</xpath>
<xpath type="css" expression="input[type='submit']">${html}</xpath>
</config>