Web Scraping

This blog demonstrates a simple web scraping example using three different tools. In the end a short comparison of the three is provided.

HtmlUnit

HtmlUnit is a “GUI-Less browser for Java programs”. The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer/Edge behaviour. It is a lightweight solution that does not have too many dependencies. Generally, it supports JavaScript and Cookies. HtmlUnit is used for testing, web scraping, and is the basis for other tools.

Usage

Add the following Maven dependency to your project:

<dependency>
  <groupId>org.htmlunit</groupId>
  <artifactId>htmlunit</artifactId>
  <version>4.18.0</version>
</dependency>

The following example uses the search bar on the INNOQ website to search for all entries that contain the expression scraping:

package com.innoq.sample.htmlunit;

import java.io.IOException;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;

import org.htmlunit.BrowserVersion;
import org.htmlunit.FailingHttpStatusCodeException;
import org.htmlunit.SilentCssErrorHandler;
import org.htmlunit.WebClient;
import org.htmlunit.html.DomNode;
import org.htmlunit.html.HtmlAnchor;
import org.htmlunit.html.HtmlButton;
import org.htmlunit.html.HtmlPage;
import org.htmlunit.html.HtmlTextInput;

public class HtmlUnitExample {
    // define usage of firefox, chrome or Edge
    private static final WebClient webClient = new WebClient(BrowserVersion.CHROME);
    
    public static void main(String[] args) {
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setTimeout(2000);
        webClient.getOptions().setUseInsecureSSL(true);
        // overcome problems in JavaScript
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setPrintContentOnFailingStatusCode(false);
        webClient.setCssErrorHandler(new SilentCssErrorHandler());
        try {
            final HtmlPage page = webClient
                    .getPage("https://www.innoq.com/en/search/");
            final HtmlTextInput searchField = htmlElementById(page,
                    "q",
                    HtmlTextInput.class).get();
            // alternative to XPath: querySelector(".search-form__btn")
            final HtmlButton searchButton = htmlElementByXPath(page,
            		"//button[@class='search-form__btn' and @type='submit']",
            		HtmlButton.class).get();
            searchField.type("Scraping");
           
            final HtmlPage resultPage = searchButton.click();
            // in real life you may use LOGGER.debug()
            //  System.out.println("HTML source: " + resultPage.getWebResponse().getContentAsString());
            htmlElementsByCssClass(resultPage, ".search-result", HtmlAnchor.class).stream().forEach(System.out::println);
        } catch (FailingHttpStatusCodeException | IOException e) {
            e.printStackTrace();
        }
    }
    
    public static <T> Optional<T> htmlElementByXPath(
           DomNode node,
           String xpath,
           Class<T> type) {
        return node.getByXPath(xpath).stream()
               .filter(o->type.isAssignableFrom(o.getClass()))
               .map(type::cast).findFirst();
    }
    
    public static <T> List<T> htmlElementsByCssClass(
    		DomNode node,
    		String cssClass,
    		Class<T> type) {
    	return node.querySelectorAll(cssClass).stream()
	        .filter(o->type.isAssignableFrom(o.getClass()))
	        .map(type::cast).collect(Collectors.toList());
    }
    
    public static <T> Optional<T> htmlElementById(
    		HtmlPage node,
    		String htmlTagId,
    		Class<T> type) {
    	return Optional.ofNullable(node.getElementById(htmlTagId)).map(type::cast);
    }
}

The above example demonstrates how HtmlUnit can be used with JavaScript. Originally, HtmlUnit has been developed for testing. Therefore, per default JavaScript errors result in Exceptions. With webClient.getOptions().setThrowExceptionOnScriptError(false) as shown in the example above you can change that behaviour (cf. JavaScript HowTo). Moreover, the WebClientOptions object of the WebClient that represents the browser allows various other configurations. Besides JavaScript, the example shows the activation of Cookies, Timeout for loading pages, and ignoring SSL problems. The actual code starts when the webClient.getPage() method is called that loads the website. The returned page object is the root of the DOM tree that can be traversed using XPath. As multiple nodes may match a given XPath expression, the getByXPath() method provides a list of objects. So you need to filter and cast a found object. HtmlUnit provides with DomNode.querySelector() a way to select Elements by CSS classes. Of couse, it is also possible to select elements by id or name. For each HTML tag HtmlUnit provides a class (e.g. HtmlForm, HtmlInput, HtmlButton, HtmlAnchor etc.).

In the above example, we load the search page of innoq.com, enter a search string, click on the search button, and print the URIs of all found content. The click() method of the button returns the next loaded page once the page loading has been finished. The HTML of the page can be printed on the screen for debugging purposes. HtmlUnit is used without a GUI. Other libraries like Selenium might be an alternative where a GUI is needed.

Proxy

Normally, if a Java program is behind a proxy, it is sufficient to configure the JVM to contact the proxy server, when it tries to connect to the Internet via HTTP:

java -Dhttp.proxyHost=192.168.0.0 -Dhttp.proxyPort=4711 -jar myApp.ja

In the case of HtmlUnit, a special ProxyConfig object needs to be configured so that the setting is taken into account. Assuming that the proxy has been configured via the command line as shown above, we can configure HtmlUnit’s WebClient like this:

final String proxyHost = System.getProperty("http.proxyHost");
final String proxyPort = System.getProperty("http.proxyPort");
ProxyConfig proxyConfig = new ProxyConfig(proxyHost, Integer.parseInt(proxyPort));
webClient.getOptions().setProxyConfig(proxyConfig);

Selenium

Selenium is a set of tools that automates browsers. Its major use case is testing websites. Nevertheless, it could be used for web scraping. Selenium starts a web browser with a GUI window, which makes headless tests harder. On the other hand, a GUI window makes it easier to trace any causes of failure during the scraping process. Moreover, the browser allows the full usage of JavaScript or CSS. Besides Java, Selenium supports C#, Ruby, Python, JavaScript, and Kotlin. Chromium/Chrome, Firefox, InternetExplorer/Edge, and Safari are supported.

Usage

Having a browser of choice installed that fits your OS, you need to download the appropriate driver for it. The driver version must match the version of the browser. In the following example, we download the chrome driver and copy the downloaded executable to a certain directory (e.g., /home/martin/Documents/). Add the following Maven dependency to your project:

<dependency>
  <groupId>org.seleniumhq.selenium</groupId>
  <artifactId>selenium-java</artifactId>
  <version>4.2.2</version>
 </dependency>

Like the HtmlUnit example, the following code uses the search bar on the INNOQ website to fetch all links that contain the expression scraping.

package com.innoq.sample.selenium;

import java.time.Duration;
import java.util.Objects;
import java.util.stream.Collectors;

import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class SeleniumExample {
    private WebDriver driver;

    private WebDriver createWebDriver(final String uri) {
    	// you need a local browser driver or use a RemoteWebDriver that is e.g. in docker:
    	// cf. org.testcontainers.containers.GenericContainer
    	// docker image selenium/standalone-chrome:101.0
    	// browser driver: https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/
    	System.setProperty("webdriver.chrome.driver",  "/home/martin/Documents/chromedriver");
    	ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.addArguments("--headless");
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("disable-gpu");
        WebDriver driver = new ChromeDriver(chromeOptions);
        driver.get(uri);
        driver.manage().deleteAllCookies();
        driver.manage().addCookie(new Cookie("cookieconsent_status", "deny"));
        return driver;
    }
    
    private void shutdown() {
    	if (Objects.nonNull(driver)) {
    		driver.quit();
    	}
    }
  
    private void scrape() {
    	this.driver = createWebDriver("https://www.innoq.com/en/search/");
        try {
            WebElement form = driver.findElement(By.className("search-form"));
            // alternative By.id("q")
            WebElement searchFormInput = form.findElement(By.name("q"));
            searchFormInput.sendKeys("scraping");
            WebElement searchFormSubmitButton = form.findElement(By.className("search-form__btn"));
            searchFormSubmitButton.click();
            // give the page time to load
            WebDriverWait w = new WebDriverWait(driver, Duration.ofSeconds(3L));
            // we wait max 3s, but if data is available then we go faster
            w.until(ExpectedConditions
            	      .elementToBeClickable(By.className("search-result")));
            // System.out.println("Page with search results: " + driver.getPageSource());
            // just to show selection by XPath, normally: By.className("search-result")
            driver.findElements(
                    By.xpath("//a[contains(@class,'search-result')]"))
                    .stream().map(a -> a.getText() + "->"
                    + a.getAttribute("href")).sorted()
                    .collect(Collectors.toList()).forEach(System.out::println);
        } catch (NoSuchElementException e) {
            e.printStackTrace();
        }
        driver.close();
    }

    public static void main(String[] args) {
    	SeleniumExample test = new SeleniumExample();
    	test.scrape();
    	test.shutdown();
    }
}

The example assumes the downloaded chrome-driver ELF file to be located in /home/martin/Documents/chromedriver. The WebDriver represents the browser. Its get() method opens the passed page in the browser. driver.findElement() returns a found WebElement, i.e., node in the DOM tree. If the element does not exist, a NoSuchElementException is thrown. To avoid this exception, the user may call driver.findElements(), which returns a list that contains all elements that match the given search criteria. If the list is empty, nothing has been found.

By default Selenium’s get() does an HTTP POST that returns once the page has been fully loaded. To wait after a click() on an element, a WebDriverWait object can be created with a timeout in seconds as a parameter, that lets the driver wait for the existence of an element as specified by the criteria passed to the WebDriverWait’s until() method. There are several search criteria represented by the By object (e.g. by name, className (element has one and only one CSS class), cssSelector (element has multiple CSS classes), id attribute of HTML tags, HTML tag name, (partial)linktext or an XPath expression). If the returned WebElement belongs to a form (i.e., the form or any sub element), the submit() method can be called to submit the form, instead of using its click() method.

To tweak the ChromeDriver you can make use of the ChromeOptions' capabilities. For example, to execute the code without opening the UI of the browser you simply set the headless flag as shown in the example. If you like to test the mobile version of your website, you can set ChromeOptions to emulate a mobile browser:

chromeOptions.setExperimentalOption("mobileEmulation", new HashMap<String, String>() {
  {
    put("deviceName", "Google Nexus 5");
  }
});

chromeOptions.addArguments(List.of("--user-agent=Mozilla/5.0 (iPhone;)...")); allows to specify the user agent. Checkout ChromeDriver documentation for more information on mobile emulation. In gerneral, more information how to use Selenium can be found here.

Jaunt

Jaunt stands for Java Web Scraping & JSON Querying. It does not support JavaScript, but is extremely fast. Even though its website states the opposite, it is not a free library. A jar file is provided on its download page, which is usable for free for one month. A jar that can be used for a longer-term costs money. The library cannot be used with a GUI. A detailed tutorial is available. Since jaunt is not based on a webkit browser, it allows an access to HTTP that eases handling of REST calls. Its support for parsing JSON payloads is a plus. Instead of relying on XPath or CSS selectors, the selectors are kept as short as possible to reduce the liability to structural changes in the DOM tree. The next paragraph demonstrates that the Java code that uses jaunt is very concise.

Usage

Add the downloaded jar to your project.

package com.innoq.sample.jaunt;

import com.jaunt.Element;
import com.jaunt.NotFound;
import com.jaunt.ResponseException;
import com.jaunt.SearchException;
import com.jaunt.UserAgent;
import com.jaunt.component.Form;

public class JauntExample {

    public static void main(String[] args) {
        UserAgent userAgent = new UserAgent();
        try {
            userAgent.visit("http://confluence.arc42.org/display/templateEN/"
                  + "arc42+Template+%28English%29+-+Home");
            Form form = userAgent.doc.getForm(0);
            form.setTextField("queryString", "Requirements");
            form.submit();
            String nextLink = null;
            do {
                for (Element e : userAgent.doc.findEach("<a data-type=page>")) {
                    System.out.println(e.getText() + "->" + e.getAt("href"));
                }
                try {
                    nextLink = userAgent.doc.findFirst("<a class=pagination-next>")
                               .getAt("href").replace("&amp;", "&");
                } catch (NotFound e) {
                    break;
                }
                userAgent.visit(nextLink);
            } while (true);
        } catch (ResponseException | SearchException e) {
            e.printStackTrace();
        }
    }

The userAgent represents the browser whose visit() method opens a site which is provided through the userAgent’s doc property. The example shows that a form can be simply retrieved by specifying its index. Alternatively, a query like Form form = userAgent.doc.getForm("<form name=pagetreesearchform>"); allows to retrieve it by name. If the submit button is unambiguous, it is sufficient to call submit() on the form without a parameter, otherwise the label on the button can be passed as parameter to the submit() method (e.g. submit("Search")). A disadvantage is the heavy usage of exceptions. Instead of providing Optionals or null, if an element could not be found, exceptions are thrown that need to be handled.

Comparison

Based on the above observations, the following comparison table can be derived (X supported, - not supported, L limited).

Feature	GUI	No GUI	JavaScript/CSS	free	fast	XPath Selector	CSS Selector	HTML Selector
HtmlUnit	-	X	X	X	-	X	X	X
SeleniumHQ	X	L	X	X	-	X	X	X
jaunt	-	X	-	-	X	-	-	X

HtmlUnit

Usage

Proxy

Selenium

Usage

Jaunt

Usage

Comparison

TAGS