Web Scraping for SEO

Web scraping is the magical act of extracting information from a web page. You can do it on one page or millions of pages. There are multiple reasons why scraping is essential in SEO:

We might use it for auditing a website

We might need it in the context of programmatic SEO 

We could use it for providing context to our web analytics

Here at WordLift, we primarily focus on structured data and improving the data quality of content knowledge graphs. We depend on crawling to cope with missing and messy data on various use cases. 

Extracting Structured Data from Web Pages using Large Language Models

Recently, I’ve been exploring the potential of OpenAI function calling for extracting structured data from web pages. This could be a game-changer for those who, like us, are actively looking to synergize Large Language Models (#LLMs) with Knowledge Graphs (#KGs).

Why is this exciting? Because the integration of LLMs with KGs is fast becoming a hot topic in tech, and developing a unified framework that can enrich both LLMs and KGs simultaneously is of significant importance.

By using this Colab Notebook, you can extract entity attributes from a list of URLs – even from  pages built in JavaScript! I used in this implementation the schema for LodgingBusiness (hotels, b&b and resorts).

A few lessons learned from this exploration:

We can seamlessly extract data from webpages using LLMs.

It’s wise to continue using existing scraping techniques where possible. For instance, BeautifulSoup is excellent for scraping titles and meta descriptions.

Using LLMs is slow and expensive, so optimizing the process is key.

After extraction, it’s crucial to thoroughly check and validate the data to ensure its accuracy and reliability. Data integrity is paramount!

The code is open for modifications and adaptations to suit your needs and can be integrated with the AutoScraper introduced in this article.

Developed by Alireza Mika, it makes your web scraping fast, simple, and fun. All the credit goes to him for bringing innovation to a sector that isn’t evolving as fast as you think. 

AutoScraper – the new kid on the block

If you are interested in using the library in Python, I suggest you read Ali’s blog post on Medium

I found this tool very powerful, yet limited to only some use-cases, and I decided to build a simple Streamlit web application that you can immediately use. 

Jump to the web application here

Here is how the scraping app works 

You provide the URL of the web page used as a template. I am using a product page on our E-commerce demo site as a reference.

You add a list of information (comma separated) that you expect to scrape from that page. Here you can add anything, a snippet of text, the URL of an image, or the structured data property present in the markup. I am adding the title, the price, and the SKU in this example.

You finally hit “Train” and let AutoScraper learn to extract these attributes from similar pages.

You can choose to let AutoScraper run under the assumption that all pages will be the same (choose “exact”) or that they will have a similar structure (choose “similar” instead). 

You can now add a list of pages that you would like to scrape. I have added two samples here. Keep in mind that there is a limit to the total number of characters that you can add (and therefore to the total number of URLs that you can scrape). This is a demonstration tool and shall be used only for a limited set of pages. 

Voilà the work is done, and you can now download a CSV containing, for each URL, price, SKU, and product name.

How to refine the results

In some cases, we might get false positives; in other words, AutoScraper might extract data that we don’t need. In these cases, we’ll need to revise the set of rules that have been identified and keep just what we need. Let’s review an example.  

If we add the URL of the image behind the reference product in the list of attributes that want to extract, we will get a table with an unneeded column (column 4).

We can now refine the rules by clicking on the “Refine Results” button. Here we can see that if we remove rule_zk7p and hit “Crawl” again, we now have the correct table without column 4.  

Existing limitations

This is a demonstrative web app. The UI is a bit clunky when you start refining rules, and in general, it is limited to crawling only a few URLs. If you are looking for something that scales, I would recommend Advertools, a well-known python library developed by the mythical Elias Dabbas.

If you want to see how you can use it, watch this webinar. Here, Elias Dabbas and Doreid Haddad show how to build a Knowledge Graph using Advertools and WordLift.

Is web scraping illegal?

No, web scraping is, generally, legal, which is why commercial search engines exist. However, there are some considerations to be made:

Some websites might have terms and conditions that do not allow scraping;

Technically speaking, scraping is a task that consumes a significant amount of bandwidth and computational resources. We shall do it only when it is needed. Google itself is reviewing its indexing policies to be more environmentally friendly; we should do it too.

How we use the extracted data makes a huge difference. We want to be respectful of others’ content and aware of potential copyright infringements. 

You can find more useful information around this topic here.

How can we scrape information? 

Here is the thread for you:

How to scrape content?

Today, I’ll walk you through all free and paid options you have to extract content from a webpage.

Part of our job is to extract content to use it for our analysis. So it can be useful to know how to do it. pic.twitter.com/uPH2cb2eT3

— Antoine Eripret (@antoineripret) February 17, 2022

The post Web Scraping for SEO appeared first on WordLift Blog.

Testimonials
Get updates

Stay in touch and up to date with your industry news… Always be a step ahead with BBK Services… 

join newsletter now ⤵

EN»
Subscribe to get the monthly report to stay on top of your market and have a 15% discount on us with your 1st order.