Apify provides a number of actors such as apify/web-scraper or apify/cheerio-scraper that make it really simple to crawl web pages and extract data from them. These actors start with a pre-defined list of Start URLs and then optionally recursively follow links to find new pages.

You can enter the start URLs either manually one by one, by linking a remote text file with the URLs or uploading the file directly:

Let's say you have your start URLs to crawl in a Google Spreadsheet, such as this one:

https://docs.google.com/spreadsheets/d/1GA5sSQhQjB_REes8I5IKg31S-TuRcznWOPjcpNqtxmU

Of course, you could easily export the spreadsheet to a comma-separated values (CSV) file and then upload the file to the Start URLs control. However, with this approach, the changes in the spreadsheet will not be automatically propagated to the actor and you'd need to upload the text file again after every change. That's not very flexible.

Fortunately, there's a better way. By adding the /gviz/tq?tqx=out:csv  query parameter of the Google Spreadsheet URL, you'll get a URL that automatically exports the spreadsheet to CSV. Such a special spreadsheet URL looks something like this one:

https://docs.google.com/spreadsheets/d/1GA5sSQhQjB_REes8I5IKg31S-TuRcznWOPjcpNqtxmU/gviz/tq?tqx=out:csv

Then you just need to click Link remote text file and paste the special spreadsheet URL:

And that's it, now the actor will simply download the content of the spreadsheet with up-to-date URLs whenever it starts.

Beware that the spreadsheet should have a simple structure, so that Apify can easily find the URLs in it. Also, it should only have on sheet.

Happy crawling of URLs from Google Spreadsheet!

Did this answer your question?