This is the second chapter of the Creating your first crawler tutorial. In the first chapter, we created a crawler that opens the front page of Hacker News. Now we are going to scrape some data from it.

Page function


Page function is a piece of JavaScript code that is executed in the context of every web page that the crawler visits. This is where you define how you want to extract the data. Although you can write the code in plain JavaScript, it's much more convenient to use the jQuery library like we do in this tutorial.

By default, Page function has the following implementation:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;
    var result = {
        myValue: $('TODO').text()
    };
    return result;
}

Note that the jQuery object is accessible through context.jQuery. The data you want to extract from the page has to be returned from the pageFunction() - this is what will be displayed in the Results tab of the Run console.

Finding jQuery/CSS/XPath selectors


Let's extract the title of the first article on Hacker News. To do that, we'll try to find a selector that corresponds to the title's HTML element. Modern web browsers provide tools that make this task pretty easy. For example, in Google Chrome you can right-click on any HTML element and then click Inspect to open Chrome Developer Tools:



It will show something like this:



In the Elements tab you can see the HTML structure of the web page. The best way to find a selector for an HTML element is to look for its ID (<tag id="xxx">) or CSS class name (<tag class="xxx">). If the HTML element has no class or ID, maybe some of its parents have and the path to the desired child element can be uniquely described using another selector. There are typically many ways to construct selectors - let your inner hacker shine!

For example, you can see that each Hacker News article has class="athing" (hover a mouse over the HTML code in the Elements tab and the corresponding element on the page will be highlighted). Also note that every parent element of our desired title text has class="title". If we find the first "athing" element and return the text of its "title" sub-element, we'll have the title that we need. With jQuery, this translates to the following code:

$('.athing:eq(0) .title').text()

In the code above, the "." operator selects CSS classes, ":eq(0)" limits the result to the first element found and text() returns the plain text content of matching elements. To learn more about jQuery, check this tutorial.

To test your jQuery code in Chrome Console, you might need to inject jQuery into the web page in case it does not already use it (yes, it happens). Simply paste the following code in the Console and press ENTER:

var jq = document.createElement('script');
jq.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(jq);

Running the above jQuery code will show something like this:



Note that the returned text also contains a Hacker News rank, which we don't want to include right now. Since the rank element also has class="title", we only need to limit the selector to the second "title" element:

$('.athing:eq(0) .title:eq(1)').text()

Alternative tools for selectors


If jQuery is not your cup of tea, you can use plain CSS selectors and the native document.querySelectorAll() function. The native equivalent of the above jQuery code will look like this:

document.querySelectorAll(".athing:nth-of-type(1) .title")[1].textContent

Alternatively, if you prefer XPath, you can also use the document.evaluate() function:

document.evaluate('//tr[@class="athing"][1]/td[@class="title"][2]', document, null, XPathResult.STRING_TYPE, null).stringValue

Returning results

Now that we have tested our jQuery selector in a web browser we can use it in our Page function to extract the title of the first article on Hacker News. The pageFunction() can return any object that can be stringified to JSON:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;
    var result = {
        title: $('.athing:eq(0) .title:eq(1)').text()
    };
    return result;
}

And that's all - just the one extra line of JavaScript code. Now you can save and run the crawler again and open the Results tab. You should see a table containing the title of the first Hacker News articles as well as the URL of the source page:



In the Full data tab you can find all details about the last crawl in JSON format, as well as links to the same data in HTML, CSV and XML formats.

In the next chapter we will improve our crawler to extract all the attributes of all Hacker News articles.

Did this answer your question?