This is the third chapter of the Creating your first crawler tutorial. In the first and second chapter, we created a crawler that opens the front page of Hacker News and scrapes the title of the first article. Now we are going to scrape the full list of articles.

Returning array of objects


Our current implementation of the Page function looks like this:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;
    var result = {
        title: $('.athing:eq(0) .title:eq(1)').text()
    };
    return result;
}

Our goal is to extract the titles of all articles on the front page of Hacker News. As mentioned in the previous chapter, the pageFunction() can return any object that can be stringified to JSON, for example, an array of objects describing particular articles. Since each Hacker News article is enclosed in a class="athing" HTML element, we can iterate through all the articles using a jQuery $(selector).each() function as follows:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;    
    var result = [];    
    $(".athing").each( function() {
        result.push({
            title : $(this).find(".title:eq(1)").text()
        });
    });        
    return result;
}

In the code above, we defined result as an array, iterated through all Hacker News article HTML elements and added a new object with a title attribute for each of the articles. Note that $(this) inside $(selector).each() returns the currently iterated HTML element and find() returns all its sub-elements matching a specific selector. Running the updated crawler will return results similar to these:



In the Raw data tab of the Run console you will see results in JSON format:

Adding all attributes


In this section, we are going to extract other attributes of all Hacker News articles. We'll do it exactly the same way as with title in the previous chapter. Let's start with the rank of each article. Inspection of the corresponding HTML element in Chrome Developer Tools reveals this:



By coincidence, the desired HTML element has class="rank", so we can simply use $(".rank") jQuery selector to add rank to our results:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;    
    var result = [];    
    $(".athing").each( function() {
        result.push({
            title : $(this).find(".title:eq(1)").text(),
            rank : $(this).find(".rank").text()
        });
    });        
    return result;
}

Now, as an exercise, try to extract other attributes, such as link, score, author, time and the number of comments. If you get stuck, your new best friends are W3School's JavaScript and jQuery tutorials, StackOverflow and of course, Apify support :-)

You should end up with a Page function similar to this one:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;    
    var result = [];    
    $(".athing").each( function() {
        var comments = $(this).next("tr").find("a:eq(2)").text();
        if (comments === "discuss") {
            comments = "0";
        } else {
            comments = comments.replace(/ comments| comment/,"");
        }
        result.push({
            rank : $(this).find(".rank").text(),
            title : $(this).find(".title:eq(1) a:eq(0)").text(),
            link : $(this).find(".title:eq(1) a:eq(0)").attr("href"),
            score : $(this).next("tr").find(".score").text().replace(/ points| point/,""),
            author : $(this).next("tr").find("a:eq(0)").text(),
            time : $(this).next("tr").find("a:eq(1)").text(),
            comments : comments
        });
    });        
    return result;
}

In your pageFunction() code you can perform any data transformation, such as cleaning up the number of comments to only contain a number, as shown above.
 
The crawler will now return the following results:



Now we have a crawler that opens the first page of Hacker News and extracts the details of all articles on it. In the next chapter, we're going to enhance the crawler to visit multiple pages.

Did this answer your question?