This blog demonstrates a simple web scraping example using three different tools. In the end a short comparison of the three is provided.
Add the following Maven dependency to your project:
The following example uses the search bar on the INNOQ website to search for all entries that contain the expression scraping:
The actual code starts when the webClient.getPage() method is called that loads the website. The returned page object is the root of the DOM tree that can be traversed using XPath. As multiple nodes may match a given XPath expression, the getByXPath() method provides a list of objects. So you need to filter and cast a found object. HtmlUnit provides with DomNode.querySelector() a way to select Elements by CSS classes. Of couse, it is also possible to select elements by id or name. For each HTML tag HtmlUnit provides a class (e.g. HtmlForm, HtmlInput, HtmlButton, HtmlAnchor etc.).
In the above example, we load the search page of innoq.com, enter a search string, click on the search button, and print the URIs of all found content. The click() method of the button returns the next loaded page once the page loading has been finished. The HTML of the page can be printed on the screen for debugging purposes. HtmlUnit is used without a GUI. Other libraries like Selenium might be an alternative where a GUI is needed.
Normally, if a Java program is behind a proxy, it is sufficient to configure the JVM to contact the proxy server, when it tries to connect to the Internet via HTTP:
In the case of HtmlUnit, a special ProxyConfig object needs to be configured so that the setting is taken into account. Assuming that the proxy has been configured via the command line as shown above, we can configure HtmlUnit’s WebClient like this:
Having a browser of choice installed that fits your OS, you need to download the appropriate driver for it. The driver version must match the version of the browser. In the following example, we download the chrome driver and copy the downloaded executable to a certain directory (e.g., /home/martin/Documents/). Add the following Maven dependency to your project:
Like the HtmlUnit example, the following code uses the search bar on the INNOQ website to fetch all links that contain the expression scraping.
The example assumes the downloaded chrome-driver ELF file to be located in
/home/martin/Documents/chromedriver. The WebDriver represents the browser. Its get() method opens the passed page in the browser. driver.findElement() returns a found WebElement, i.e., node in the DOM tree. If the element does not exist, a NoSuchElementException is thrown. To avoid this exception, the user may call driver.findElements(), which returns a list that contains all elements that match the given search criteria. If the list is empty, nothing has been found.
By default Selenium’s get() does an HTTP POST that returns once the page has been fully loaded. To wait after a click() on an element, a WebDriverWait object can be created with a timeout in seconds as a parameter, that lets the driver wait for the existence of an element as specified by the criteria passed to the WebDriverWait’s until() method. There are several search criteria represented by the By object (e.g. by name, className (element has one and only one CSS class), cssSelector (element has multiple CSS classes), id attribute of HTML tags, HTML tag name, (partial)linktext or an XPath expression). If the returned WebElement belongs to a form (i.e., the form or any sub element), the submit() method can be called to submit the form, instead of using its click() method.
To tweak the ChromeDriver you can make use of the ChromeOptions' capabilities. For example, to execute the code without opening the UI of the browser you simply set the headless flag as shown in the example. If you like to test the mobile version of your website, you can set ChromeOptions to emulate a mobile browser:
chromeOptions.addArguments(List.of("--user-agent=Mozilla/5.0 (iPhone;)...")); allows to specify the user agent. Checkout ChromeDriver documentation for more information on mobile emulation.
In gerneral, more information how to use Selenium can be found here.
Add the downloaded jar to your project.
The userAgent represents the browser whose visit() method opens a site which is provided through the userAgent’s doc property. The example shows that a form can be simply retrieved by specifying its index. Alternatively, a query like
Form form = userAgent.doc.getForm("<form name=pagetreesearchform>"); allows to retrieve it by name. If the submit button is unambiguous, it is sufficient to call submit() on the form without a parameter, otherwise the label on the button can be passed as parameter to the submit() method (e.g.
submit("Search")). A disadvantage is the heavy usage of exceptions. Instead of providing
Optionals or null, if an element could not be found, exceptions are thrown that need to be handled.
Based on the above observations, the following comparison table can be derived (X supported, - not supported, L limited).