There is a certain scraping scenario in which you need to process the same URL many times, but each time with a different setup (e.g. filling in a form with different data each time). This is easy to do with Apify, but how to go about it may not be obvious at first glance.

We'll show you how to do this with a simple example: starting a crawler with an array of keywords, inputting each of the keywords separately into Google and retrieving the results on the last page. The tutorial will be split into these three main parts.

This whole thing could be done in a much easier way, by directly enqueuing the search URL, but we're choosing this approach to demonstrate some of the not so obvious features of the Apify crawler.

Enqueuing start pages for all keywords

First we need to start the crawler on the page from which we're going to do our enqueuing. To do that, we create one startURL with the label "enqueue" and URL "https://example.com". Now we can proceed to enqueue all the pages. The first part of our pageFunction will look like this:

var $ = context.jQuery;
   
if(context.request.label === 'enqueue'){
   
    // parse input keywords
    var cd = JSON.parse(context.customData);
   
    // process all the keywords
    $.each(cd, function(index, keyword){

        // enqueue the page and pass the keyword in
        // the interceptRequestData attribute
        context.enqueuePage({
            label: 'fill-form',
            url: 'https://google.com',
            interceptRequestData: keyword,
            uniqueKey: Math.random() + ''
        });
    });
   
    // disable output for this page
    context.skipOutput();
}

To set the keywords, we're using the customData crawler parameter. This is useful for smaller data sets, but may not be perfect for bigger ones. For such cases you may want to use something like Importing a list of URLs from an external source.

Since we're enqueuing the same page more than once, we need to set our own uniqueKey so the page will be added to the queue (by default uniqueKey is set to be the same as the URL). The label for the next page will be "fill-form". We're passing the keyword to the next page in the interceptRequestData attribute (this can contain any user data).

Inputting the keyword into Google

Now we come to the next page (Google). We need to retrieve the keyword and input it into the Google search bar. This will be the next part of the pageFunction:

else if(context.request.label === 'fill-form'){
   
    // retrieve the keyword
    var keyword = context.request.interceptRequestData;

    // input the keyword into the search bar
    $('#lst-ib').val(keyword);

    // submit the form
    $('#tsf').submit();

    // disable output for this page
    context.skipOutput();
}

For the next page to correctly enqueue, we're going to need a new pseudoURL. Create a pseudoURL with the label "result" and the URL "https://www.google.com/search?[.+]". 

Passing the keyword to the final page

With each of the results, we might want to show which keyword was used to search for it. We could extract it from the resulting URL, but that may not always be available. To solve it, we need to pass it in interceptRequestData again. Since this time we're not using context.enqueuePage, we need to use the interceptRequest function (which is triggered for every request being added to the queue) to pass it forward to the next page. The interceptRequest function (in Advanced settings) will look like this:

function interceptRequest(context, newRequest) {
    if(!newRequest.interceptRequestData){
        newRequest.interceptRequestData =
            context.request.interceptRequestData;
    }
    return newRequest;
}


Extracting the results

Now we're on the last page and can finally extract the results.

else if(context.request.label === 'result'){

    // create result array
    var result = [];

    // process all the results
    $('.rc').each(function(index, elem){

        // wrap element in jQuery
        var gResult = $(elem);

        // lookup link and text
        var link = gResult.find('.r a');
        var text = gResult.find('.s .st');

        // extract data and add it to result array
      result.push({
            name: link.text(),
            link: link.attr('href'),
            text: text.text(),
            // add the keyword from previous page
            search: context.request.interceptRequestData
        });

    });

    return result;
}

To test the crawler, set the customData (in Advanced settings) to something like this ["apple", "orange", "banana"] and push the Run button to start.

Did this answer your question?