Getting information from inside iframes is a known pain, especially for new developers. After spending some time on Stack Overflow, you usually find answers like jQuery's contents() method or native contentDocument property, which can guide you to the insides of an iframe. But still, getting the right identifiers and holding that new context is little annoying. Fortunately, everything is simpler and more straightforward in Puppeteer.

Finding the right iframe

If you are using basic methods of page object like 'page.evaluate()', you are actually already working with frames. Behind the scenes, Puppeteer will call 'page.mainFrame().evaluate()'. So most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to simply loop over the main frame's child frames and identify the one you want to use.

For simple training with Puppeteer, we can use online playgroud and try to scrape the Twitter widget iframe from https://www.imdb.com/

const browser = await puppeteer.launch() 
// in Actor use Apify.launchPuppeteer()

const page = await browser.newPage()

await page.goto(https://www.imdb.com)
await page.waitFor(5000) // we need to wait for Twitter widget to load

let twitterFrame // this will be populated later by our identified frame

for (const frame of page.mainFrame().childFrames()){
    // Here you can use few identifying methods like url(),name(),title()
    if (frame.url().includes('twitter')){
        console.log('we found the Twitter iframe')
        twitterFrame = frame
        // we assign this frame to myFrame to use it later
    }
}

await browser.close()

If it is hard to identify the iframe you want to access, don't worry. You can already use any Puppeteer method on the frame object to help you identify it, scrape it or manipulate it. You can also go through any nested frames.

let twitterFrame
for (const frame of page.mainFrame().childFrames()){
    if (frame.url().includes('twitter')){
        for(const nestedFrame of frame.childFrames()){
             const tweetList = await nestedFrame.$('.timeline-TweetList')
             if(tweetList){
                 console.log('We found the frame with tweet list')
                 twitterFrame = nestedFrame
             }
        }
    }
}

Okay, we used little more advanced techniques to find a nested iframe. Now when we have it assigned to our twitterFrame object, the hard work is over and we start working with it (almost) like with the regular page object.

const textFeed = await twitterFrame.$$eval(
    '.timeline-Tweet-text',
     pElements => pElements.map(el=>el.textContent))
)
for (const text of textFeed){
    console.log(text)
    console.log('**********')
}

With a little more effort, we could also follow different links from the feed or even play a video, but that is not within the scope of this article. For all references about page and frame object (and Puppeteer generally), you should study the documentation. New versions are released quite often, so checking the docs regularly can help you to stay on top of web scraping and automation.

Did this answer your question?