Web Scraping

Martin Weck

This blog demonstrates a simple web scraping example using four different tools. In the end a short comparison of the four is provided.

HtmlUnit

HtmlUnit is a “GUI-Less browser for Java programs”. The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer behaviour. It is a lightweight solution that doesn’t have too many dependencies. Generally, it supports JavaScript and Cookies, but in some cases it may fail (e.g., giving an error message that says, “activate cookies in your web browser”, although cookie support is switched on in HtmlUnit). HtmlUnit is used for testing, web scraping, and is the basis for other tools.

Usage

Add the following Maven dependency to your project:

<dependency>
 <groupId>net.sourceforge.htmlunit</groupId>
 <artifactId>htmlunit</artifactId>
 <version>2.19</version>
</dependency>

The following example uses the search bar on the arc42 wiki website to search for all entries that contain the expression Requirements. It clicks on the Next link to use pagination to scrape over several pages:

package com.innoq.sample.htmlunit;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.logging.Level;
import java.util.stream.Collectors;

import org.apache.commons.logging.LogFactory;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;

public class HtmlUnitExample {
    // define usage of firefox, chrome or IE
    private static final WebClient webClient = new WebClient(BrowserVersion.CHROME);

    public static void main(String[] args) {
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setTimeout(2000);
        webClient.getOptions().setUseInsecureSSL(true);
        // overcome problems in JavaScript
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setPrintContentOnFailingStatusCode(false);
        webClient.setCssErrorHandler(new SilentCssErrorHandler());
        LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log",
                "org.apache.commons.logging.impl.NoOpLog");
        java.util.logging.Logger
           .getLogger("com.gargoylesoftware.htmlunit")
           .setLevel(Level.OFF);
        java.util.logging.Logger
           .getLogger("org.apache.commons.httpclient")
           .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter")
          .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject")
          .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument")
          .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("com.gargoylesoftware.htmlunit.html.HtmlScript")
          .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("com.gargoylesoftware.htmlunit.javascript.host.WindowProxy")
          .setLevel(Level.OFF);
        java.util.logging.Logger
          .getLogger("org.apache")
          .setLevel(Level.OFF);
        HtmlPage page = null;
        try {
            List<HtmlAnchor> searchResults = new ArrayList<>();
            page = webClient
                    .getPage("http://confluence.arc42.org/display/templateEN/" +
                             "arc42+Template+%28English%29+-+Home");
            final HtmlForm searchForm = htmlElementByXPath(page,
                    "//form[contains(@action,'pagetreesearch')]",
                    HtmlForm.class).get();
            final HtmlInput searchField = htmlElementByXPath(page,
                    "//input[@name='queryString' and contains(@class ,'medium-field')]",
                    HtmlInput.class).get();
            searchField.setValueAttribute("Requirements");
            HtmlSubmitInput submit = (HtmlSubmitInput) searchForm
                    .getElementsByAttribute("input", "type", "submit")
                    .get(0);
            page = submit.click();
            Optional<HtmlAnchor> nextLink = null;
            do {
                searchResults.addAll(
                     page.getByXPath("//a[contains(@class,'search-result-link')]")
                     .stream().map(HtmlAnchor.class::cast)
                     .collect(Collectors.toList()));
                nextLink = htmlElementByXPath(page, "//a[@class='pagination-next']",
                      HtmlAnchor.class);
                if (nextLink.isPresent()) {
                    page = nextLink.get().click();
                }
            } while (nextLink.isPresent());
            searchResults.stream().map(a -> a.getTextContent() + "->"
                         + a.getAttribute("href"))
                         .forEach(System.out::println);
        } catch (FailingHttpStatusCodeException | IOException e) {
            e.printStackTrace();
        }
    }

    public static <T> Optional<T>  htmlElementByXPath(
           DomNode node,
           String xpath,
           Class<T> type) {
        return node.getByXPath(xpath).stream()
               .filter(o->type.isAssignableFrom(o.getClass()))
               .map(type::cast).findFirst();
    }
}

Per default, HtmlUnit is verbose, providing a lot of log information regarding CSS and JavaScript issues, which makes it suitable for testing. These logs can be configured by using a logging framework (e.g., log4j) and configuring it appropriately. The above example demonstrates how these logs can be switched off in the source code. The WebClientOptions object of the WebClient that represents the browser allows various configurations. The example shows the activation of Cookies, JavaScript, Timeout for loading pages, ignoring SSL problems, and tolerant handling of errors. The actual code starts when the webClient.getPage() method is called that loads the website. The returned page object is the root of the DOM tree that can be traversed using XPath. Unfortunately, the getByXPath() method provides a list of objects. Even if it is clear that only one object will match, the user will have to extract the first list element. Moreover, casting is needed to convert the provided objects into the appropriate type. For each HTML tag HtmlUnit provides a class (e.g. HtmlForm, HtmlInput, HtmlButton, HtmlAnchor etc.). The click() method returns the next loaded page once the page loading has been finished. HtmlUnit is used without a GUI. Other libraries like Selenium might be an alternative where a GUI is needed.

Proxy

Normally, if a Java program is behind a proxy, it is sufficient to configure the JVM to contact the proxy server, when it tries to connect to the Internet via HTTP:

java -Dhttp.proxyHost=192.168.0.0 -Dhttp.proxyPort=4711 -jar myApp.ja

In the case of HtmlUnit, a special ProxyConfig object needs to be configured so that the setting is taken into account. Assuming that the proxy has been configured via the command line as shown above, we can configure HtmlUnit’s WebClient like this:

final String proxyHost = System.getProperty("http.proxyHost");
final String proxyPort = System.getProperty("http.proxyPort");
ProxyConfig proxyConfig = new ProxyConfig(proxyHost, Integer.parseInt(proxyPort));
webClient.getOptions().setProxyConfig(proxyConfig);

SeleniumHQ

Selenium is a set of tools that automates browsers. Its major use case is testing websites. Nevertheless, it could be used for web scraping. Selenium starts a web browser with a GUI window, which makes headless tests harder. On the other hand, a GUI window makes it easier to trace any causes of failure during the scraping process. Moreover, the browser allows the full usage of JavaScript or CSS. Besides Java, Selenium supports C#, Ruby, Python and JavaScript.

Usage

Install chrome-driver by copying the downloaded executable to a certain directory. Add the following Maven dependency to your project:

<dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-java</artifactId>
 <version>2.49.1</version>
</dependency>
Like the HtmlUnit example, the following code uses the search bar on the arc42 wiki website to fetch all links that contain the expression Requirements.
package com.innoq.sample.selenium;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;

import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class SeleniumExample {
    private WebDriver driver;

    public SeleniumExample(boolean mobile) {
        System.setProperty("webdriver.chrome.driver",
                           "/home/martin/Documents/chromedriver");
        if (mobile) {
            final DesiredCapabilities dc = DesiredCapabilities.chrome();
            dc.setCapability(ChromeOptions.CAPABILITY, new ChromeOptions() {
                {
                    setExperimentalOption("mobileEmulation", new HashMap<String, Object>() {
                        private static final long serialVersionUID =
                            6294953054374614483L;
                        {
                            put("deviceName", "Google Nexus 5");
                        }
                    });
                }
            });
            this.driver = new ChromeDriver(dc);
        } else {
            this.driver = new ChromeDriver();
        }
    }

    private void scrape() {
        driver.get("http://confluence.arc42.org/display/templateEN/"
         + "arc42+Template+%28English%29+-+Home");
        List<String> searchResults = new ArrayList<>();
        try {
            WebElement form = driver.findElement(By.name("pagetreesearchform"));
            WebElement searchElement = form.findElement(By.name("queryString"));
            searchElement.sendKeys("Requirements");
            form.submit();
            int currentPageIndex = 1;
            do {
                searchResults.addAll(driver.findElements(
                        By.xpath("//a[contains(@class,'search-result-link')]"))
                        .stream().map(a -> a.getText() + "->"
                        + a.getAttribute("href"))
                        .collect(Collectors.toList()));

            } while (jumpToNextPage(++currentPageIndex));
        } catch (NoSuchElementException e) {
            e.printStackTrace();
        }
        searchResults.stream().sorted().forEach(System.out::println);
        driver.close();

    }

    private boolean jumpToNextPage(int nextPageIndex) {
        boolean found = false;
        // per default 3s wait
        driver.manage().timeouts().implicitlyWait(0, TimeUnit.MILLISECONDS);
        List<WebElement> nextLink = driver.findElements(
            By.className("pagination-next"));
        if (nextLink.size() > 0) {
            nextLink.get(0).click();
            final WebDriverWait wait = new WebDriverWait(driver, 1);
            wait.until(ExpectedConditions.textToBePresentInElementLocated(
                      By.className("pagination-curr"),
                      String.valueOf(nextPageIndex)));
            found = true;
        }
        driver.manage().timeouts().implicitlyWait(3, TimeUnit.SECONDS);
        return found;
    }

    public static void main(String[] args) {
        new SeleniumExample(false).scrape();
    }
}

The example assumes the downloaded chrome-driver ELF file to be located in /home/martin/Documents/chromedriver. The WebDriver represents the browser. Its get() method opens the passed page in the browser. driver.findElement() returns a found WebElement, i.e., node in the DOM tree. If the element does not exist, a NoSuchElementException is thrown. To avoid this exception, the user may call driver.findElements(), which returns a list that contains all elements that match the given search criteria. If the list is empty, nothing has been found.

By default Selenium waits for 3 seconds if the element does not exist, so that the loading of the page will not normally hinder the retrieval of elements. The method jumpToNextPage() demonstrates how this default value can be overridden. To wait after a click() on an element, a WebDriverWait object can be created with a timeout in seconds as a parameter, that lets the driver wait for the existence of an element as specified by the criteria passed to the WebDriverWait’s until() method. There are several search criteria represented by the By object (e.g. by name, class, id attribute of HTML tags, HTML tag name, (partial)linktext or an XPath expression). If the returned WebElement belongs to a form (i.e., the form or any sub element), the submit() method can be called to submit the form, instead of using its click() method.

WebDriver may also simulate a mobile browser. The constructor demonstrates how a certain mobile hardware can be simulated. Besides the ChromeDriver as shown, a FirefoxDriver, OperaDriver, AndroidDriver and iOSDriver are all available.

More information how to use Selenium can be found here.

Headless Mode

If you use an X-Window based system, you may send the GUI data to nirvana and by this Selenium can be used in headless mode. To do so, replace your X-Server with Xvfb. The X virtual frame buffer can be installed on Ubuntu through

$ sudo apt-get install xvfb x11-xkb-utils libxrender-dev libxtst-dev libgtk2.0–0
$ xvfb :99 -ac &
$ export DISPLAY=:99.0
Once Xvfb is running and the display port has been set to 99 (or any other value), the java application that uses Selenium can be started from the shell where the display port has been set.

ui4j

ui4j is a Java 8 library based on the JavaFX Webkit Engine that allows automatic access to web pages for testing or scraping.

Usage

Add the following Maven dependency to your project:

<dependency>
 <groupId>com.ui4j</groupId>
 <artifactId>ui4j-all</artifactId>
 <version>2.1.0</version>
</dependency>
Similar to the HtmlUnit example, the following code uses the search bar on the arc42 wiki website to fetch all links that contain the expression Requirements.
package com.innoq.sample.ui4j;

import java.util.HashMap;
import java.util.Map;
import java.util.Optional;

import com.ui4j.api.browser.BrowserEngine;
import com.ui4j.api.browser.BrowserFactory;
import com.ui4j.api.browser.BrowserType;
import com.ui4j.api.browser.Page;
import com.ui4j.api.dom.Document;
import com.ui4j.api.dom.Element;

public class Ui4jExample {

    private static BrowserEngine webkit = BrowserFactory.getBrowser(BrowserType.WebKit);

    public static void main(String[] args) {
        //load page
        Page page = webkit
                .navigate("http://confluence.arc42.org/display/templateEN/"
                + "arc42+Template+%28English%29+-+Home");
        Document document = page.getDocument();
        //display page in GUI
        page.show();
        //do the search
        Element searchFormElement =
           document.query("form[action*='pagetreesearch']").get();
        searchFormElement.query("input[name='queryString']")
           .get().setValue("Requirements");
        searchFormElement.getForm().get().submit();
        Map<String, String> descriptionsByLinks = new HashMap<>();
        Optional<Element> nextPage = Optional.empty();
        //for each page
        do {
            if (nextPage.isPresent()) {
                nextPage.get().click();
            }
            wait1Second();
            //put links with description from current page into map
            document.queryAll(".search-result-link").forEach(a -> {
                descriptionsByLinks.put(a.getAttribute("href").get(),
                                        a.getText().get());
            });

            //get next page link
            Optional<Element> pageLink = document.query(".pagination-next");
            if (pageLink.isPresent() && pageLink.get().getText().isPresent()
                    && "Next".equalsIgnoreCase(pageLink.get().getText().get())) {
                nextPage = pageLink;
            } else {
                nextPage = Optional.empty();
            }
        } while (nextPage.isPresent());
        //print result
        descriptionsByLinks.entrySet().stream()
                .sorted(Map.Entry.<String, String> comparingByValue())
                .forEach(e -> System.out.println(e.getValue() + " -> " + e.getKey()));
        page.close();
        webkit.shutdown();
    }

    private static void wait1Second() {
        try {
            Thread.sleep(1000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

Calling webkit.navigate(uri) opens the Webkit browser on the given URI and returns a com.ui4j.api.browser.Page object that has a com.ui4j.api.dom.Document. This Document object serves as the root for jQuery expressions. The user can search by calling document.query(), which returns a java.lang.Optional with the first hit, or alternatively via document.queryAll(), which provides all hits. Once navigated to a sub node in the tree, the user can restrict the search to its subtree calling the query()/queryAll() commands on the current node (e.g., searchFormElement.query() in the example above).

The consequent use of java.util.Optional by ui4j replaces checks against null by Optional.isPresent() checks. Technically, it is either possible to use the element.click() method, where element is a button or link, or to use the form.submit() method as a proper way to post the form. It might be interesting to note that the element.getAttribute() method has been used in the example above to fetch the href value, instead of using jQuery.

More examples can be found on the project’s github site.

Headless Mode

It is also possible to execute ui4j without using a GUI. Add the following Maven dependency:

<dependency>
 <groupId>org.jfxtras</groupId>
 <artifactId>openjfx-monocle</artifactId>
 <version>1.8.0_20</version>
 <scope>test</scope>
</dependency>
One would need to remove the commands page.show() resp. page.close() in the example above. Moreover, the JVM needs to be called with the parameter -Dui4j.headless or a java.lang.UnsupportedOperationException: Unable to open DISPLAY will be thrown, if no X-Window is available. If JavaFX does not find fonts it may throw a Exception in thread "JavaFX Application Thread"java.lang.ExceptionInInitializerError with the message Error: JavaFX detected no fonts! Please refer to release notes for proper font configuration. To circumvent this problem, one has to add the argument -Dprism.useFontConfig=false. Another way to run ui4j in headless mode is to use Xvfb as described in the selenium section above.

Jaunt

Jaunt stands for Java Web Scraping & JSON Querying. It does not support JavaScript, but is extremely fast. Even though its website states the opposite, it is not a free library. A jar file is provided on its download page, which is usable for free for one month. A jar that can be used for a longer term costs money. The library cannot be used with a GUI. A detailed tutorial is available. Since jaunt is not based on a webkit browser, it allows an access to HTTP that eases handling of REST calls. Its support for parsing JSON payloads is a plus. Instead of relying on XPath or CSS selectors, the selectors are kept as short as possible to reduce the liability to structural changes in the DOM tree. The next paragraph demonstrates that the Java code that uses jaunt is very concise.

Usage

Add the downloaded jar to your project.

package com.innoq.sample.jaunt;

import com.jaunt.Element;
import com.jaunt.NotFound;
import com.jaunt.ResponseException;
import com.jaunt.SearchException;
import com.jaunt.UserAgent;
import com.jaunt.component.Form;

public class JauntExample {

    public static void main(String[] args) {
        UserAgent userAgent = new UserAgent();
        try {
            userAgent.visit("http://confluence.arc42.org/display/templateEN/"
                  + "arc42+Template+%28English%29+-+Home");
            Form form = userAgent.doc.getForm(0);
            form.setTextField("queryString", "Requirements");
            form.submit();
            String nextLink = null;
            do {
                for (Element e : userAgent.doc.findEach("<a data-type=page>")) {
                    System.out.println(e.getText() + "->" + e.getAt("href"));
                }
                try {
                    nextLink = userAgent.doc.findFirst("<a class=pagination-next>")
                               .getAt("href").replace("&amp;", "&");
                } catch (NotFound e) {
                    break;
                }
                userAgent.visit(nextLink);
            } while (true);
        } catch (ResponseException | SearchException e) {
            e.printStackTrace();
        }
    }

The userAgent represents the browser whose visit() method opens a site which is provided through the userAgent’s doc property. The example shows that a form can be simply retrieved by specifying its index. Alternatively, a query like Form form = userAgent.doc.getForm("<form name=pagetreesearchform>"); allows to retrieve it by name. If the submit button is unambiguous, it is sufficient to call submit() on the form without a parameter, otherwise the label on the button can be passed as parameter to the submit() method (e.g. submit("Search")). A disadvantage is the heavy usage of exceptions. Instead of providing Optionals or null, if an element could not be found, exceptions are thrown that need to be handled.

Comparison

Based on the above observations, the following comparison table can be derived (X supported, - not supported, L limited).

Feature GUI No GUI JavaScript/CSS free fast XPath Selector jQuery Selector CSS Selector HTML Selector
HtmlUnit - - L X - X - - X
SeleniumHQ X L X X - X - X X
ui4j X X X X - - X X -
jaunt - X - - X - - - X
There is a little project on github available that extends HtmlUnit to support CSS resp. limited jQuery querying.
Thumb martinwecksquare

Martin Weck is a software engineer at innoQ with seveal years of development experience in large enterprise projects. His technical focus is on Java/JavaScript Micro Service solutions.

More content

Comments

Please accept our cookie agreement to see full comments functionality. Read more