This article relates to the legacy Apify Crawler product, which is being retired in favor of the apify/legacy-phantomjs-crawler actor. All the information in this article is still valid, and applies to both the legacy Crawler product and the new actor. For more information, please read this blog post.
For new projects, we recommend using the newer apify/web-scraper actor that is based on the modern headless Chrome browser.
Apify is typically used for crawling/scraping publicly available data. But sometimes you need to access data available after login. A typical use case is getting partner prices from an e-commerce site, importing orders as a seller from various marketplaces or fetching statistics from a Google business insights page. There are plenty of sites (services) you use as a signed-in user which don't provide any suitable API.
Warning: it's really important to say here at the beginning that you should be extra careful with crawler configuration when using logins. When you automatically log in using Apify, your crawler acts upon your account with your credentials. So you should narrow down the clickable element to a minimal set - you definitely don't want to automatically click on all 'a' elements. Also, using more parallel processes/crawlers can be an easy trigger for a site to recognize you're using a crawler with your account.
How to log in
There are more ways you can use to log in before you actually start crawling. Typically it depends on the site where you need to log in. Below are the most often used options to log in to a site, with links to more detailed articles about each option:
- sending POST data with user and password in a StartURL
- using credentials in Custom http headers
- submitting a form in a first Page function
- importing Initial cookies of logged in user
For all these options you also have to set Cookies persistence in Advanced settings to
"Per full crawler run" or
"Over all crawler runs" so the whole crawler run preserves login state. You can find more info about cookie handling in our docs. If your authenticated session expires during the crawler run, you can just log in again once you notice it. You can use
queuePosition property set to
Happy logged-in crawling!