How to do NLP Named Entity Recognition with Python (Free Script Included)
Over the years search and search engine algorithms have evolved. Two of the primary goals of this continued evolution is to become better at understanding the meaning and intent behind search queries and to provide better search results. Achieving this goal helps search engines provide a better experience to searchers by answering questions more quickly, helping users’ along their search journeys, and finding the information users are looking for more efficiently.
This is recognized widely as semantic search. In this new world, search engines evolved from a keyword-matching-based information retrieval system to a semantic-based information retrieval system that is able to more deeply understand search queries and documents rather than simply looking at frequency of keywords used.
The SEO industry commonly refers to this as search engines evolving from strings to things. This process is more advanced than search engines simply crawling and parsing a page and then extracting keywords to evaluate and rank search results. Rather, a semantic information retrieval system is able to more deeply understand search queries using multiple advanced search ranking systems. Named Entity Recognition/Entities and Natural Language Processing play a major role in making semantic search possible.
This blog post is going to take a look at four Python libraries and APIs that you can use to perform NLP Entity SEO. You can make a copy of the Webfor Entity Analyzer Tool (Public) – Google Collab notebook that we will be covering in more detail below. I also have short YouTube videos below for three of the libraries covering how to get your various API keys and how the scripts work.
There is much that goes into Semantic Search/SEO and we will not be able to cover all related concepts within this blog post. For now, we will focus mainly on natural language processing (NLP) and entities as they relate to semantic SEO.
Here is an overview of what we will cover in the post and the various libraries we will be using to analyze entities:
- How to do NLP Named Entity Recognition with Python (Free Script Included)
- Introduction
- What is Natural Language Processing (NLP)?
- What are Entities?
- How to use Python for Named Entity Recognition (NER)?
- Four Free Python Scripts for Entity Analysis
- How to Use TextRazor API and Python for Named Entity Recognition (NER)? (FREE SCRIPT)
- Analyze a Single URL using TextRazor ( TextRazor V1)
- Analyze SERP Results using TextRazor (TextRazor V2)
- How to Use Dandelion.eu API and Python to Analyze Entities? (FREE SCRIPT)
- How to Use spaCy Python Library to Analyze Entities?
- How to Use OpenAI API and Python to Perform Entity Analysis
- Conclusion
- How to Use TextRazor API and Python for Named Entity Recognition (NER)? (FREE SCRIPT)
What is Natural Language Processing (NLP)?
Natural language processing (NLP) is a branch of artificial intelligence focused on computer science. NLP’s primary goal is to give computers the ability to understand written text, human language, and analyze large corpuses of natural language data.
NLP uses computational techniques to analyze and process text or spoken word, enabling machines to better understand and interpret unstructured data, such as written text. NLP has been used in a variety of applications, including machine translation, text summarization, sentiment analysis, question-answering systems, and more. It also plays a large role in search engine algorithms; we will cover more on that later in this post.
The goal of NLP is to extract meaningful information from unstructured or semi-structured data. A common example of unstructured data is the content on a webpage. This content is not organized in a machine-readable format, and search engines have to crawl the webpage and use various algorithms to identify topics, core concepts, and more. NLP can recognize patterns in language, identify entities that are within a corpus of text, and perform tasks such as part-of-speech tagging, sentiment analysis, and more.
One particular task that has become increasingly popular in recent years is named entity recognition (NER), which aims at recognizing named entities from texts by assigning them predefined categories, such as person, name, or location.
What are Entities?
Entities can be defined as any person, place, or thing. Google and other search engines use entities to understand the meaning behind search queries and to disambiguate search results.
Google uses its Knowledge Graph to understand entities and the relationships or connections between them — also known as semantic triples or tuples (more on this later). Entities on the knowledge graph have their own machine-readable entity ID (MREID) to help disambiguate concepts and entities.
Named entity recognition (NER) is an NLP process that involves identifying named entities from text. This helps machines understand text and human language better and extract meaningful information from unstructured data.
In the above screenshot, you can see one way an entity can be highlighted in search results. This is a Google Knowledge Panel, and the information highlighted in the knowledge panel is sourced from the connections and relationships on the Google Knowledge Graph.
You can use the Google Knowledge Graph API Search to examine entities on the knowledge graph and information related to them. Merkle also offers a tool that I like that offers a nice UX.
How to use Python for Named Entity Recognition (NER)?
We are going to look at how to use Python to analyze entities and perform named entity recognition with four different Python libraries and APIs.
Python is one of the best resources a digital marketer could use to analyze and visualize competitors’ SEO strategies, identify opportunities, and much more. Entity Analysis is one example of ways that Python can be extremely helpful in uncovering gaps and opportunities. There are many libraries that can be used for analyzing entities. Some we will explore, including Spacy and Dandelion.eu.
There are also extremely powerful visualization libraries that we will examine. They offer insights not capable with most standard SEO tools. Plotting libraries like Matplotlib and Plotly are two examples of visualization libraries that are powerful for SEOs, digital marketers, and business owners.
Using Python to analyze rankings goes much deeper than just entity analysis. Topic mapping, n-gram analysis, query counting, and clustering are just a few examples of other powerful SEO techniques that are possible with Python. Future blog posts will cover these various techniques.
With rapidly evolving language models now available to SEO professionals, I see these techniques being enhanced to gain even deeper insights and analyze datasets on a level that was never possible before.
For example, Meta has announced an open-source language model called LLaMA that could make accessibility to language models easier — while also improving sustainability and the impact that language models have on the environment due to high computational resource demands. GPT-4 recent release with multi-modal capabilities is another great example of how quickly language models are growing and evolving. There are extensive use cases for digital marketers and SEOs.
Four Free Python Scripts for Entity Analysis
Below we will be examing four Python scripts (one colab notebook) that use four separate Python libraries/APIs to perform entity analysis. These scripts for the most part are pretty basic examples and should be enhanced to fit the needs of you or your organization. You are free to build on these scripts as you build out your own tool set to take your SEO data analysis and SEO data science skills to the next level.
How to Use TextRazor API and Python for Named Entity Recognition (NER)? (FREE SCRIPT)
TextRazor is a powerful API for entity extraction, disambiguation, classification, and more.
Looking at the above TextRazor API output, you can see it is capable of outputting entity ID, Wikipedia links, Wikidata and Firebase IDs, and much more. Analysis using the TextRazor API is extremely powerful based on the insights the API provides into entities.
Let’s look at two separate scripts for analyzing entities using TextRazor.
Analyze a Single URL using TextRazor ( TextRazor V1)
In version one of the TextRazor script, we will simply crawl a single URL and extract and analyze the entities using TextRazor. We will then visualize the most frequently mentioned entities using Matplotlib. This version of the TextRazor script can be helpful for examining a single URL or comparing multiple URLs. This can provide insight into gaps in topics, entity salience misalignment, and more.
Here are the steps to run the first script:
- Navigate to textrazor.com
- Click to set up your account and get your free API key
- Install TextRazor and other Python libraries and import
- Insert TextRazor API key
- Create TextRazor Class using textrazor.TextRazor
- Set up dataframes and loop through entities
- Dispay the first 25 rows of dataframe
- Plot out entities using Matplotlib
1. Navigate to TextRazor.com
Go to textrazor.com so you can set up your account and grab your free API key.
2. Click free API key and set up account
Click “Free API Key” and go through the process of entering your information and verifying your email so you can grab your free API key.
Go ahead and enter your information to set up your TextTrazor account and then grab your free API key so you can use that for the script. TextRazor allows up to 500 queries per day for free. Anything beyond 500 queries per day will require you to look into monthly plans.
3. Install TextRazor and other Python Libraries
Run the first cell to install and import TextRazor and other needed Python libraries.
4. Insert TextRazor API Key
Here is where you will insert the TextRazor API key you grabbed from the URL shared above. Replace the brackets and text with your new API key.
5. Create TextRazor Class and Enter URL to Analyze
Here you will want to replace the URL inside the client.analyze_url() call function.
6. Create Dataframe with TextRazor Entities for URL
Run the cell. This will input all entities TextRazor API identified and add to a dataframe. This is a good format to further analyze, visualize, and export results of the analysis.
7. Visualize TextRazor Top Entities using Matplotlib
Now, let’s visualize the entities using Matplotlib showing the most frequently mentioned entities.
Analyze SERP Results using TextRazor (TextRazor V2)
In the second version of our TextRazor script we will modify the previous script to analyze all URLs on page one of SERPs for any keyword you want to examine. We will use the Advertools library I previously highlighted in a blog post about performing a competitor analysis using Python and Advertools.
- Install and import Advertools
- Insert Keyword and Crawl Page 1 of SERPs
- Save Rankings to CSV file
- Crawl and Analyze Rankings with TextRazor API
- Group Entities by URL and Count Entity Mentions
- Visualize Entities Using Plotly
Let’s examine the script step by step:
1. Install and import Advertools
First, you will install and import the Advertools library. This is the library we will use to crawl Page 1 search results. Advertools is a powerful library for digital marketers and SEOs that has built-in tools or functions for analyzing websites, search results, and much more.
“A digital marketer is a data scientist. Your job is to manage, manipulate, visualize, communicate, understand, and make decisions based on data.”
The tools built into the library include sitemap and website architecture analysis, website crawlers, keyword research, and more.
We will be using the Advertools serp_goog function to generate search results for a query.
2. Insert Keyword and Crawl Page 1 of SERPs
Now enter your query, search engine ID, and Google Search API key. The video above covers the steps for you to get your own Search Engine ID and Custom Search API Key. Once you have updated these fields you are ready to run this cell to crawl and scrape Page 1 SERP results.
3. Save Rankings to CSV file
Once you run this cell, Pandas to_csv will write the results to a CSV file named data.csv. You will see this pop up into the file explorer, and you can save the CSV file to your computer if you choose.
4. Crawl and Analyze Rankings with TextRazor API
The next block of code will crawl the list of URLs collected for the search query and will extract all the entities the TextRazor API identifies that the pages mention. It will save the results to a csv file.
5. Group Entities by URL and Count Entity Mentions
The next bit of code will group the dataframe by URL and entity ID and then count the number fo times the entity is mentioned in the URL. This will allow us the ability to plot out the most frequently mentioned entities in the final step of our first script.
6. Visualize Entities Using Plotly
We will now visualize each URL grouped by the frequency of entities mentioned. Using Plotly Express, we will be able to interact with the bar plot to further analyze the results. The graph above displays all URLs that are ranking for a given query. Then plots out into a bar each URL and the entities/frequency of entities mentioned. A lot can be done when analyzing this data. One example is looking at Search Engine Journal results we can see the number of times they mention the most important topics like machine learning, Python, SEO, and more. Then compare against our results to identify gaps in entities or entity salience misalignment.
Plotly can provide impressive visualizations that make it easy to gather insights. In my post on performing a competitor analysis with Python, the SERP Heatmap output is also generated using Plotly.
How to Use Dandelion.eu API and Python to Analyze Entities? (FREE SCRIPT)
Dandelion.eu is a brand-new semantic text analysis API with capabilities that include entity extraction, semantic similarity, concept extraction, and entity sentiment analysis. This library is similar to TextRazor API; it provides free access for a specific number of queries each day. The free tier offers up to 1,000 units per day or up to 30,000 units monthly. Dandelion.eu offers paid tiers if you need more than 1,000 daily units.
API response is fairly similar to TextRazor. By slightly modifying the code, you can grab wikipedia ID, description, and much more. These knowledge bases can be helpful for SEOs when analyzing or optimizing websites. Semantic APIs can be helpful for digital marketers. Some uses cases include competitor analysis, SERP ranking analysis, topical authority, and much more.
Here are the steps our script will cover:
- Install and import Dandelion.eu API and other python libraries
- Add Dandelion.eu token to dataTXT API
- Create Pandas Dataframe and Extract Entities
- Call entity_analyzer function
1. Install and import Dandelion.eu API and other python libraries
In the first step, we will install and import Dandelion.eu and other Python libraries for the script.
2. Add Dandelion.eu token to dataTXT API
Now you will add your Dandelion.eu API token that you created. You can reference the video above to find out how to generate your own Dandelion.eu API token.
3. Create Pandas Dataframe and Extract Entities
Update this block with the URL that you want to analyze using Dandelion.eu API. Replace the URL in the page variable.
With this cell, you will create the dataframe to store the entities and then extract the entities from your URL. The dataframe will include the entity name, title, confidence, and entity URI (Wikipedia URL for the entity).
4. Call entity_analyzer function
You will now call the entity analyzer function and it will extract the entities from the URL you updated in the cell above. The cell will output a list of the entities extracted, a list of the most frequent entities mentioned, and a Matplotlib graph showing the Top 10 entities extracted based on frequency. The cell will save the dataframe to a CSV file.
This is just the beginning of how you can use the Dandelion.eu API to analyze your website or competitors. Modify this script to meet your or your organization’s needs.
How to Use spaCy Python Library to Analyze Entities?
spaCy is an open source Python library built for NLP. It is a commonly used library in the digital marketing and SEO community and has been one of the most popular libraries for NLP analysis.
spaCy supports training for more than 70 languages and can perform POS tagging, named entity recognition, similarity analysis, text classification, entity linking, and more.
The Python script will share how to analyze a URL using spaCy.
- Install spaCylibrary
- Import spaCy and other Python libraries
- Run entity_analyzer function
- Add URL and call entity_analyzer function
Here is a video reviewing the script and going over the URL and HTML class modifications you will make to the script:
1. Install spaCy library
The first cell will install the spaCy library. This will install the components we will use to perform named entity recognition.
2. Import spaCy and other Python libraries
Now go ahead and import the necessary libraries. This will import BeautifulSoup so we can crawl and extract the content from a webpage. It will also import the Pandas library that we will use to put the entities into a dataframe and CSV file. Most importantly this will import the spaCy library for entity analysis.
3. Run entity_analyzer function
The third cell is our spaCy entity analyzer Python function. The first few lines of code uses MatPlotLib to crawl and extract the content. You may need to update the HTML class that is being used here to match a valid HTML class on whatever URL you are trying to analyze.
The next few lines of code use spaCy to perform entity extraction. Then the for loop extracts the entities that were found into a Pandas dataframe and saves it to a CSV file. Finally MatPlotLib is used to visualize the Top 10 most frequently mentioned entities.
4. Add URL and call entity_analyzer function
In the final cell, you call the spaCy entity analyzer function and you will see the output.
You can watch the video above for additional details on spaCy and the entity analyzer script.
How to Use OpenAI API and Python to Perform Entity Analysis
OpenAI and ChatGPT are two of the hottest topics in digital marketing right now. Since ChatGPT’s release, the digital marketing and SEO communities have been abuzz about the different use cases — from content generation to coding and tools.
Language models like ChatGPT appear to be in a position to have a significant impact on our everyday lives. The SEO, PPC, content marketing, and social media teams at Webfor have been innovating with ChatGPT and other AI-powered language models over the past year, continually testing and experimenting to understand the limitation and benefits of incorporating AI into our workflow.
OpenAI released the ChatGPT API recently, and it won’t be long before we start seeing this API pulled into various SEO and digital marketing tools. ChatGPT is now incorporated into Bing and you can sign up to beta test Bing Chat (nicknamed Sydney). Let’s look at one way to leverage this new technology via Python.
The final Python script we will be reviewing will use the ChatGPT OpenAI API to crawl a single URL, perform named entity recognition (NER), count the total number of entities extracted and sort from highest to lowest.
- Install OpenAI library
- Import OpenAI and other Python Libraries
- Input OpenAI API Key and Run Entity Analyzer Function
- Call Entity Analyzer Function and View Output
Let’s review the code in detail and walk through what is going on in the script.
1. Import OpenAI Library
Running the first cell imports the OpenAI library so you can query the ChatGPT API.
2. Import OpenAI and other Python Libraries
Now you will import OpenAI library that you installed previously and any other relevant libraries. This script has been modified multiple times for the public version; some libraries that are imported may not be used in the script.
3. Input OpenAI Key and Run Entity Analyzer Function
In the third cell, you will need to add your OpenAPI API key that you generated. You can follow the video above to find out how to generate your own API key. In my experience, ChatGPT API costs are quite reasonable and a fraction of the cost of previous models.
4. Call Entity Analyzer Function and View Output
In the fourth cell, you will need to update your URL to match the URL that you want to analyze. This will call the entity analyzer function and return the output based on ChatGPT’s analysis of the content.
The output includes the entity name, frequency of entity mentions, and known Wikipedia URLs.
This prompt can be modified and tested to improve the quality of the output. I highly recommend testing the prompt to optimize the output as much as possible.
In my opinion, the capabilities and quality of ChatGPT and the OpenAI API will only continue to improve, especially considering the upcoming release of the GPT-4 language model that is multimodal and able to generate not only text and code but also video and images.
Conclusion
In this post, we touched on NLP Entity SEO using Python. First, we touched on how Entities and NLP play a role in SEO. Then reviewed how to use four Python APIs and libraries to perform Named Entity Recognition and Extraction for SEO. Here is a link to the Webfor Entity Analyzer Tool (Public) – Google Collab notebook that we reviewed. The Python script shared above is just a starting point and much can be done to modify each script to extract even deeper insights and perform more comprehensive data analysis. Feel free to take these scripts and modify them based on your organization’s or agency’s use case.
There are many other ways to use Python for SEO analysis and automation. Comment and let me know what type of Python scripts you or your agency use or how you plan to modify the script for your use case.
Want more content about Python and SEO? Check out my previous blog posts on using Python for SEO competitor analysis and Automating Screaming Frog SEO Analysis.
Is your business ready for the transition to Google Analytics 4? Read our Google Analytics 4 (GA4) Guide!
If you are interested in SEO, connect with me on LinkedIn. I like to share SEO tips and talk about the state of SEO.