This article relates to the legacy Apify Crawler product, which is being retired in favor of the apify/legacy-phantomjs-crawler actor. All the information in this article is still valid, and applies to both the legacy Crawler product and the new actor. For more information, please read this blog post.
For new projects, we recommend using the newer apify/web-scraper actor that is based on the modern headless Chrome browser.
This is the fourth and final chapter of the Creating your first crawler tutorial. In previous chapters, we created a crawler that opens the front page of Hacker News and scrapes a list of all articles. Now we'll let the crawler visit multiple pages.
Basically we want to make the crawler "click" the More link multiple times (see the screenshot below) in order to navigate to subsequent pages with articles and also call the Page function on them.
First, let's discuss the crawling process of Apify. By default, the crawler repeats the following steps:
- Add each of the Start URLs into the crawling queue
- Fetch the first URL from the queue and load it in the virtual browser
- Execute Page function on the loaded page and save its results
- Find all links from the page. If a link matches any of the Crawl pseudo-URLs and has not yet been enqueued, add it to the queue
- Go to step 2.
This process is depicted in the diagram below. Note that blue elements represent settings or operations that can be modified in crawler settings.
In the context of our example, the crawler only adds the URL of the Hacker News front page to the queue, loads the page, executes the Page function, looks for outgoing links and immediately finishes, as we didn't define any Crawl pseudo-URLs.
Defining Crawl pseudo-URLs
Our current crawler configuration looks like this:
 brackets. This PURL will be matched against every link URL when evaluating whether the URL should be enqueued.
If you click the More link at
https://news.ycombinator.com/news, you'll be sent to
https://news.ycombinator.com/news?p=2. The next click on that link will navigate to
https://news.ycombinator.com/news?p=3, and so on. A PURL matching these pages is simply
"https://news.ycombinator.com/news?p=[\d+]". Note that the regular expression we used (
\d+) means one or more digits. To learn more about regular expressions, check this W3Schools tutorial or regexp101.com.
When we add the above PURL to our crawler configuration, the crawler should theoretically start loading the next pages. However, when you run the crawler you will see that it still finishes after the first page. The problem lies in the Clickable elements options in Advanced settings, because for some reason the default setting skips the More link.
Defining Clickable elements
The Clickable elements setting is a CSS selector that tells the crawler which HTML elements it should click when looking for outgoing links from a page. The default value is
In our Hacker News example, if you inspect the More link in Google Developer Tools, you will notice that the link has a
Due to the default Clickable elements setting, the crawler will skip the above link. We could simply set Clickable elements to
"a" which means the crawler would click all links regardless of the
rel="nofollow" attribute, but such a wide selector is generally not a good idea for two reasons. First, it could slow down the crawling process because Apify would have to click and match many more links. Second, it could cause some unexpected and potentially harmful actions. For example, if you're logged in to Hacker News (Apify can handle that) and you let the crawler click all upvote links, your reputation might be damaged. Therefore, it's better to narrow down the Clickable elements selector to the smallest possible set of links. In our example, the
'a[href^="news?p="]' CSS selector will only match an
a element with a
href attribute starting with
"news?p=", which only matches the More link.
Now our crawler configuration looks as follows:
In the Advanced settings section you can find other useful crawler configuration options. We only want to extract the first 150 articles, so we'll set Max pages per crawl to 5 because each page on Hacker News has 30 articles. We can also increase Delay between requests to 3000 milliseconds to be nice to Hacker News.
You can find more information about advanced crawler settings in the Apify documentation.
Running the crawler
Everything should finally be ready. After you hit the Run button, you should see the crawler load the first 5 pages of Hacker News. After it finishes you should see something like this:
In the Results tab you can see a table with all extracted articles (the view is limited to 100 rows). To see complete results, click on full table in a new window:
In the Full data tab you can find complete details about the last crawler run in JSON format, as well as CSV, HTML and XML.
In the Log tab you can check that your crawler worked according to the process described above. You can also check that the Clickable elements option only matched a single element on every page.
If you want to see more details in the log, simply check the Verbose log option in Advanced settings.
And that's all. If you're missing something here, please let us know. Note that we'll be adding more tutorials describing the advanced concepts of the Apify crawler, such as accessing dynamic page content, authentication and API.