This article is about the legacy Apify Crawler product, which is being phased out and replaced by the apify/legacy-phantomjs-crawler actor. Note that this tutorial can be applied to the new actor as well.
For new projects, we recommend using the newer apify/web-scraper actor that is based on the modern headless Chrome browser, and offers more features and performance. For complete control of the crawling, you might also consider developing a new actor in Node.js using Apify SDK.
The tutorial is divided into four chapters:
What is our goal?
Our goal is to create a crawler that can extract the first 150 articles from the front page of Hacker News. The source website looks like this:
We want to extract the data from the website in a structured JSON format:
"title": "From inside Facebook",
"time": "2 hours ago",
"title": "That awkward moment when Apple mocked good hardware and poor people",
"time": "5 hours ago",
"title": "Akin's Laws of Spacecraft Design*",
"time": "4 hours ago",
Create a new crawler
First, please sign in to Apify and and go to the Crawlers section, where you will see copies of example crawlers from our front page. Click the Add new button to create a new crawler.
You will see a page with crawler settings, which is divided into 4 sections:
- Basic settings shows basic properties of the crawler, such as start URLs, pseudo-URLs and a page function
- Advanced settings contains detailed crawler configuration
- API describes how to start the crawler and fetch results using an API
- Run console helps you develop and debug your crawler
So, let's dive in. First, fill in Custom ID to give a name to your new crawler, for example "
Hacker News - my very first crawler". Then add "
https://news.ycombinator.com/news" to Start URLs to let the crawler know which web page it should open and click save button. And that's all you need to perform the first dry run of your crawler, because Page function is pre-defined for you..
Your configuration should look as follows:
Go to Run console and click the Run button. After a few seconds, you should see a screenshot of the front page of Hacker News in the Page tab:
Note that the crawler only loaded a single web page, because we didn't tell it how to find the next pages - we will address this issue later.
Now let's have a look at the Results tab, which contains the structured data extracted from the page. As you can see, it only contains dummy values because we only used the pre-filled Page function that did not extract any meaningful values from the web page. In the next chapter, we will fix this.