When running crawlers that go through a single website, each page opened has to load all resources again (sadly headless browser does not use cache). Problem is, that each resource needs to be downloaded through network, that can be slow and/or unstable (especially when proxies are used).

For this reason in this article we will take a look at how to use memory to cache responses (only those that contain header "cache-control" with "max-age" above 0).

For this example we will use this actor which goes through top stories on CNN websites and takes a screenshot of each page opened.
(The actor is very slow since it waits till all network requests are finished and the posts contain videos)

If the actor is run with caching disabled, we get these statistics at the end of the run:

As you can see, for 10 posts (that is how many posts are in top stories column) and 1 main page we used 177MB of traffic.

We also stored all the screenshots, you can find them here.

From the screenshot above, it's visible, that most of the traffic coming from script files (124MB) and documents (22.8MB).

For these situations, it's always good to check if the content of the page is cachable. You can do that using Chromes Developer tools.

If we go to CNN website, open up the tools and go to "Network" tab, then there is an option to disable cache.

With cache disabled, we can take a look at how much data is transferred when we open the page, it's visible at the bottom of the developer tools.

Now if we uncheck the disable cache checkbox and refresh the page, we will see how much data we can save if we cache responses.

If we compare them, on front page, with cache enabled the data transfer is reduced by 88%.

Let's try to emulate it in Puppeteer. All that needs to be done, is to check when response is received whether it contains the "cache-control" header and if it's set to max-age higher then 0. If it does, then save the headers, url and body to memory of the response and on next request check if the requested URL is already stored in cache.

The code looks like this

// On top of your code
const cache = {};

// The code bellow should go between newPage function and goto function

await page.setRequestInterception(true);
page.on('request', async(request) => {
    const url = request.url();
    if (cache[url] && cache[url].expires > Date.now()) {
        await request.respond(cache[url]);
        return;
    }
    request.continue();
});
page.on('response', async(response) => {
    const url = response.url();
    const headers = response.headers();
    const cacheControl = headers['cache-control'] || '';
    const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
    const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
    if (maxAge && input.cacheResponses) {
        if (!cache[url] || cache[url].expires > Date.now()) return;

        cache[url] = {
            status: response.status(),
            headers: response.headers(),
            body: buffer,
            expires: Date.now() + (maxAge * 1000),
        };
    }
});

With this code implemented, we can try to run the actor again.

Looking at the statistics, it went from 177MB to 13.4MB which is a reduction of data transfer by 92%.

Screenshots can be found here.

It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle and CNN has a lot of tracking and analytics scripts that keep the network busy.

Hopefully this short tutorial helps you with your solutions.

Did this answer your question?