Let's say we want to scrape a database of craft beers before the summer season starts. If we're lucky, then the website (https://www.brewbound.com) contains a sitemap (https://www.brewbound.com/sitemap.xml). A sitemap is usually located at the path /sitemap.xml . It's always worth trying that URL, as it's not often linked anywhere on the site. It usually contains a list of all pages in XML format. This is what it looks like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.brewbound.com/advertise</loc>
        <lastmod>2015-03-19</lastmod>
        <changefreq>daily</changefreq>
    </url>
    <url>
...

The URLs of breweries are in the form

http://www.brewbound.com/breweries/[BREWERY_NAME]

and the URLs of craft beers are in the form

http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME]

which can be matched with the following regular expression:

http:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+

Note the two parts of the regex [^\/<] containing < . That's because we don't want to include the </loc>  tag that closes each URL.

In the following code, we will use the Apify SDK. First, let's import the beer URLs from the sitemap to RequestList using our regex to match only the (craft!) beer URLs and not pages of breweries, contact page, etc.: 

const requestList = await new Apify.RequestList({
    sources: [{
        requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
        regex: /http:\/\/www.brewbound.com\/breweries\/.*/
    }],
});

await requestList.initialize();

Now let's use PuppeteerCrawler to scrape the created  RequestList  with Puppeteer and push it to the final dataset:

const crawler = new Apify.PuppeteerCrawler({
    requestList,

    handlePageFunction: async ({ page, request }) => {
        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[1].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Apify.pushData(data);
    },
});

await crawler.run();

If we run the full code

const Apify = require('apify');

Apify.main(async () => {
    const requestList = await new Apify.RequestList({
        sources: [{
            requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
            regex: /http:\/\/www.brewbound.com\/breweries\/[^\/<]+\/[^\/<]+/
        }],
    });

    await requestList.initialize();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,

        handlePageFunction: async ({ page, request }) => {
            const data = await page.evaluate(() => {
                const title = document.getElementsByTagName('h1')[1].innerText;
                const [brewery, beer] = title.split(':');
                const description = document.getElementsByClassName('productreviews')[0].innerText;

                return { brewery, beer, description };
            });

            await Apify.pushData(data);
        },
    });

    await crawler.run();
});

at the Apify platform, it gives us a nicely formatted spreadsheet containing a list of breweries with their beers and descriptions!

Did this answer your question?