Must-have marketer skill: Web Scraping
Here’s a trick: You can do it in Google Sheets. Scrape links, tables, emails, social handles. ANYTHING in a sheet.
Here’s the 2-minute tutorial?
The function to know: =IMPORTXML().
This pulls in any structured data type like CSV, RSS, HTML, or XML from a URL. For this tutorial, we will work with XML or “xpaths.”
=IMPORTXML() takes 2 Arguments
- The URL you want to use, including the “https://” bit and wrapped in “” . For example: “https://kolbyhatch.com“
2. The xpath of the element you want to scrape.
Let’s briefly talk xpaths ?
Think of these as roadmaps to data types on a webpage.
w3schools has a great tutorial on them:
https://www.w3schools.com/xml/xpath_intro.asp
Now xpaths can be simple or complicated, but getting them is easy.
- On the webpage, open inspector (⌃⌘C)
- Find the HTML element you want to scrape (title, H1, href, etc.)
- Right-click, Copy –> “Copy xpath”
For more basic scrapping like, page titles, or links on a webpage, you can use these generic xpaths:
- All links = //@href
- All H1s = //h1 (works for any heading, h2, h3)
- Title = //title
(There’s more on the w3school tutorial above)
Putting it all together:
=ImportXML(“https://thehustle.co/home/“, //@href)
This will scrape all links on The Hustle’s homepage.
Using =ImportXML() function and copied xpaths you can build sheets that scrape…
- Front pages of publishers, giving you article links, titles, authors
- Scrape emails from company “about” pages
- Scrape social handles from pages
[insert endless list]
Try it for yourself: I made a simple scraper that pulls links from any webpage.
Copy the sheet to your drive, drop in a URL and boom.
Here it is:
https://docs.google.com/spreadsheets/d/12Yc4FnOMEWli4mmQw6IJQxDUWwdS-mhz1jlBCgkfPSw/edit?usp=sharing
Personally, I’ve used this trick to quickly pull data from hosted tables, or scrape social handles. You’ll have to tweak/copy the right xpaths).
It’s good for building lists of leads or contacts.
It does have limitations. You can crash a sheet if the data set is too large — I’ve had issues over 300 cells.
And sites like FB, TikTok, Instagram, make it really hard to do this.