Typically a crawler on Apify uses one or more Start URLs to start the crawling process. You have several options for defining these URLs:

  • define a static list of Start URLs in basic settings on the crawler configuration page
  • POST an array on Start URLs when starting the crawler via API and thus handle these settings dynamically from your application
  • fetch a list of URLs from an external source via the REST API from the Page function

We're going to take a look at the third option.

What is our goal?


Importing a list of URLs is useful when you don't need the crawler functionality - you just need to scrape data from a list of URLs you already have. The goal of this article is to show you how to import these URLs from an external source with the REST API from the Page function.

External source of URLs


First, you have to prepare some external source of URLs. All it has to have is a REST API which can be used to fetch data using an AJAX call from the Page function. It can be a database, application, cloud service, etc.
We're going to use Google Sheets, which also has a REST API we can use to access data from sheets. Let's create a new spreadsheet with just one column with a header in the first row and then a list of URLs, one per row. We'll use this one, which looks like this:



To make it available via API, we have to publish the spreadsheet (File -> Publish to the web...). You'll get a URL where you can check that the spreadsheet has been published, like this one in our case (it may take a moment to be published). Finally, a URL with the sheet in a JSON format can be made as follows:
https://spreadsheets.google.com/feeds/list/SPREADSHEET_ID/SHEET_NUMBER/public/values?alt=json

In our case it's this URL.

Fetching URLs from the crawler run


To import the list of URLs we have to run a Page function. We can choose any page for that, but let's use http://example.com. So our approach is to set this page as a Start URL and then import the list of URLs from its Page function. For this purpose we'll use a context.enqueuePage() function. You can learn more about this internal function in our doc.

Note:
Setting an irrelevant page as a Start URL is the reason we're not talking about importing Start URLs, but only URLs

And finally here's the content of the pageFunction:

function pageFunction(context) {
    var $ = context.jQuery;
    if (context.request.label === "START") {
        var SPREADSHEET_ID = "1d31uviYNfik-qi9nlDHgIjuXjgo5BIDA1S1Wsfm1e4E";
        var NUMBER_OF_SHEETS = 2;
        var loadData = function(id) {
            var urlAPI = "https://spreadsheets.google.com/feeds/list/" + SPREADSHEET_ID + "/" + id + "/public/values?alt=json";
            $.get(urlAPI, function(data) {
                var entries = data.feed.entry;
                $.each(entries, function(index, value) {
                    var url = value.title.$t;
                    context.enqueuePage(url);
                });
                if (id === NUMBER_OF_SHEETS) {
                    context.finish();
                } else {
                    loadData(id + 1);
                }
            });
        };
        loadData(1);
        context.skipOutput();
        context.willFinishLater();
    } else {
        var result = {
            myValue: $('TODO').text()
        };
        return result;
    }
}

Note: You can use more sheets in your spreadsheet and then set NUMBER_OF_SHEETS in the Page function, as we did here.

The only thing you have to set in crawler configuration except this Page function is the Start URL with a START label. Then the crawler will load our Start URL (the result won't be outputted thanks to context.skipOutput()) and then enqueue and load all URLs from our external source.
You can of course set other settings and even crawl from imported URLs. Just set pseudo URLs and Clickable elements accordingly.

And that's all. Let us know if you want to help with the integration of some other external source.

Did this answer your question?