Building an SEO competitor-analysis Actor with Crawlee and Apify
I run a small blog where I regularly publish comparison articles, the kind that rank for queries like “best project management tools” or “windows OS alternatives.” After a while, I noticed a pattern: competitors ranking above me weren’t necessarily covering more topics, they were just structuring their content differently. I wanted to understand exactly what they were covering that I wasn’t.
My first instinct was to manually open each page, scan the headings, and take notes. That worked for one or two pages. It didn’t scale. So I built a Crawlee Actor that does the comparison for me, extracting heading structures from my page and competitor pages, cleaning and normalizing them, and producing a structured JSON output that shows shared topics, missing coverage, and unique sections.
This article walks through how I built it, what broke along the way, and how I ended up with something reliable enough to run on a monthly schedule.
Prerequisites
To follow along, you’ll need:
- Node.js v18 or higher
- The Apify CLI installed (
npm install -g apify-cli) - Basic familiarity with JavaScript and async/await
- An Apify account (free tier works fine)
What the Actor does
By the end of this guide, I’ll have a Content SEO Competitor Analysis Actor Crawlee Actor that:
- Crawls my page and competitor pages targeting a given topic
- Extracts structured headings (H2, H3, H4) from real article content
- Removes noise from widgets, sidebars, and injected elements
- Normalizes headings into comparable values
- Extracts meaningful entities from headings
- Compares my page against competitors
- Returns a structured JSON output with:
- shared sections
- missing topics
- unique coverage
This output gives me a usable baseline for content gap analysis, without opening a single competitor page manually.
Why this problem is harder than it looks
At first, comparing pages sounds straightforward. Extract headings from your page, extract headings from competitor pages, and compare them. In practice, this breaks quickly.
Real-world pages are not clean. They mix actual content with unrelated elements that interfere with extraction. Some of the most common issues I ran into:
- Noisy sections: Pages often include sidebars, “related articles”, newsletters, and promotional blocks. These elements contain headings that are not part of the core content but still get scraped.
- Inconsistent structure: One page may list tools under H2 headings, while another uses H3 or even plain text. The hierarchy is not reliable across sites.
- Injected or styled content: Some sites inject styles or scripts directly into heading elements. Instead of clean text, I was extracting CSS fragments or broken strings like
.css-19a5n3-link{color:#0a0a23}. - Ambiguous headings: Headings like “Best options for teams” or “Top picks” don’t clearly identify what they refer to. They are not usable for comparison without further processing.
- Different wording for the same concept: One page may use “Linux Mint”, another “Mint Linux”. Without normalization, these appear as different items.
If you stop at raw extraction, the output becomes noisy and misleading. The real challenge is not scraping — it is isolating the actual article content, removing irrelevant sections, normalizing headings, extracting meaningful entities, and making the data comparable across pages.
Project setup
I started by creating a new Crawlee project and installing the required dependencies. If you’re starting from scratch:
npx crawlee create seo-content-gap-analyzer
cd seo-content-gap-analyzer
If you already have a project, install Crawlee:
npm install crawlee
Since this workflow is designed to run as an Actor, I also installed the Apify SDK:
npm install apify
All the main logic lives in a single entry file:
src/main.js
That’s all the setup I needed to get started.
Defining the Actor input
The Actor needs two main inputs: my page (the one I want to analyze) and competitor pages (used for comparison). I also included a topic label and a few optional settings. Here’s an example input for Windows OS alternatives:
{
"type": "comparison",
"topic": "windows os alternatives",
"myPage": "https://your-site.com/windows-os-alternatives",
"competitors": [
"https://example.com/windows-alternatives",
"https://example.com/best-windows-alternatives",
"https://example.com/linux-vs-windows-alternatives"
],
"maxRequests": 20,
"debug": true
}
Input breakdown
- type: Defines the output mode. Here, I’m running a comparison workflow.
- topic: A label describing what the analysis is about.
- myPage: The page I want to evaluate.
- competitors: A list of URLs ranking for the same topic. These serve as the comparison baseline.
- maxRequests (optional): Limits how many pages the crawler processes.
- debug (optional): Enables additional logging during execution.
At runtime, the Actor will crawl my page and each competitor page, extract heading structures from all of them, then compare everything to identify overlaps and gaps. This input structure keeps the workflow flexible and reusable across topics. If you want to add validation or a UI for your inputs, the Actor input schema specification covers how to set that up.
Crawling strategy
For this workflow, I used CheerioCrawler. This was a deliberate choice. I’m not interacting with pages, clicking buttons, or handling dynamic user flows. The goal is to extract structured content from article pages as efficiently as possible. CheerioCrawler gives me:
- fast HTML parsing
- low resource usage
- simple DOM traversal
- enough control to clean and process content
Using a browser-based crawler like Playwright would increase complexity and cost without adding real value for this use case.
Working with real-world comparison pages quickly exposed edge cases that don’t appear in simple demos. The main challenge was not crawling itself, but handling inconsistent and deeply nested HTML structures. Some pages exposed tools cleanly, while others embedded them inside component-based layouts, mixed with styling layers and editorial content.
To make the crawler reliable, I focused on three areas:
- Text normalization: to remove CSS fragments and presentation artifacts before processing
- Entity filtering: to separate actual tools from scores, features, and editorial sections
- Flexible matching: to handle variations in naming across pages
These adjustments didn’t change how the crawler works at a high level, but they significantly improved the quality of the extracted data.
What I actually need from each page
I’m not scraping the entire page. I only care about the main article content, heading structure (H2, H3, H4), and meaningful text inside those headings. Everything else is noise.
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10,
async requestHandler({ request, $, response }) {
const url = request.url;
const title = $('head > title').first().text().trim() || null;
const metaDescription = $('meta[name="description"]').attr('content')?.trim() || null;
const h1 = $('h1').first().text().trim() || null;
const canonical = $('link[rel="canonical"]').attr('href') || null;
const statusCode = response?.statusCode || null;
$('aside').remove();
$('.sidebar').remove();
$('.widget').remove();
$('.related').remove();
$('.latest').remove();
$('.recommend').remove();
$('.newsletter').remove();
removeWidgetBlocks($);
const articleRoot = getArticleRoot($);
let h2List = [];
articleRoot.find('h2').each((i, el) => {
const normalized = normalizeHeading($(el).text());
if (normalized && !ignoredHeadings.includes(normalized)) {
h2List.push(normalized);
}
});
let h3List = [];
articleRoot.find('h3').each((i, el) => {
const normalized = normalizeHeading($(el).text());
if (normalized && !ignoredHeadings.includes(normalized)) {
h3List.push(normalized);
}
});
h2List = uniqueList(h2List);
h3List = uniqueList(h3List);
await Actor.pushData({
url,
pageType: url === myPage ? 'myPage' : 'competitor',
title,
metaDescription,
h1,
h2List,
h2Count: h2List.length,
h3List,
h3Count: h3List.length,
canonical,
statusCode,
checkedAt: new Date().toISOString(),
});
},
});
await crawler.run([myPage, ...competitorUrls]);
await Actor.exit();
Here’s what the initial extraction looks like for three of the pages I tested:



Adding a comparison snapshot
At this point, each page was returning structured headings — useful, but not yet comparable. To get a global view, I introduced a comparison layer that aggregates all pages into a single snapshot. The goal is not to change extraction, but to summarize what appears across pages, what is missing from my page, and what is unique.
const result = {
type: "comparison",
topic,
sharedHeadings: [],
myPageOnlyHeadings: [],
competitorOnlyHeadings: []
};
I added a helper function just below uniqueList():
function compareHeadings(myPageData, competitorsData) {
const myItems = uniqueList(
[...myPageData.h3List, ...(myPageData.h4List || [])]
.filter(isMeaningfulHeading)
.map(extractLabel)
.map(normalizeComparisonLabel)
.filter(looksLikeEntityLabel),
);
const compItems = uniqueList(
competitorsData
.flatMap((c) => [...c.h3List, ...(c.h4List || [])])
.filter(isMeaningfulHeading)
.map(extractLabel)
.map(normalizeComparisonLabel)
.filter(looksLikeEntityLabel),
);
const sharedHeadings = [];
const myPageOnlyHeadings = [];
const competitorOnlyHeadings = [];
for (const myItem of myItems) {
const hasMatch = compItems.some((compItem) =>
areHeadingsSimilar(myItem, compItem)
);
if (hasMatch) {
sharedHeadings.push(myItem);
} else {
myPageOnlyHeadings.push(myItem);
}
}
for (const compItem of compItems) {
const hasMatch = myItems.some((myItem) =>
areHeadingsSimilar(compItem, myItem)
);
if (!hasMatch) {
competitorOnlyHeadings.push(compItem);
}
}
return {
sharedHeadings: uniqueList(sharedHeadings),
myPageOnlyHeadings: uniqueList(myPageOnlyHeadings),
competitorOnlyHeadings: uniqueList(competitorOnlyHeadings),
};
}
This doesn’t change the crawler itself. It adds a snapshot layer on top of the extracted data — one summary that shows how my page compares to the rest. After the crawl finishes, I load the dataset and split results into myPage and competitor entries:
const { items } = await dataset.getData({ limit: 1000 });
const myPageData = items.find((item) => item.pageType === 'myPage');
const competitorsData = items.filter((item) => item.pageType === 'competitor');
Then I run the comparison and store the snapshot:
if (myPageData && competitorsData.length > 0) {
const comparison = compareHeadings(myPageData, competitorsData);
console.log('Comparison Result:', comparison);
await Actor.pushData({
type: 'comparison',
...comparison,
generatedAt: new Date().toISOString(),
});
}
This gives a first global view of the topic: sharedHeadings shows overlap between my page and competitors, myPageOnlyHeadings shows content covered only on my page, and competitorOnlyHeadings shows content competitors cover that my page doesn’t.

Testing the crawler on another topic
At this point, the crawler worked for one dataset. The next step was to check whether the logic held when the topic changed. Instead of modifying the code, I only changed the input. This time using real project management pages I already had bookmarked:
{
"topic": "project management tools",
"myPage": "https://www.learn-dev-tools.blog/best-legal-project-management-software/",
"competitors": [
"https://www.paymoapp.com/blog/project-management-software/",
"https://project-management.com/top-10-project-management-software/",
"https://zapier.com/blog/free-project-management-software"
]
}
Results
Here’s what the individual page extractions looked like:




And here’s the comparison summary for this run — this is the output I actually use to plan my content updates:

This one was more telling. The mainstream tools were well covered on my end — monday.com, ClickUp, Trello, Asana, Wrike, Smartsheet, Basecamp all showed up in the shared list. But the competitorOnlyHeadings was long. Jira, Airtable, Zoho, Teamwork, Podio, Hive — tools I hadn’t touched at all. The myPageOnlyHeadings also caught something I hadn’t noticed: noise entries like “but what are” and “key features” were still leaking through, which meant the entity filtering still needed tuning for this type of content. The legal-specific tools I covered — Clio, Thomson Reuters — didn’t appear anywhere in the competitor set, which made sense given the article’s focus, but it also explained a big chunk of the gap.
Inspecting the Zapier page structure revealed exactly the kind of noise problem I mentioned earlier:

Extracting tools from comparison pages feels simple, I expected them to sit cleanly in H2 or H3 tags. But once I tried it across a few sites, that assumption broke. Some pages use H3 lists, others use H2 with long descriptions, and more complex ones don’t expose tools as clear headings at all.
One problem was noise. The same heading level mixed tool names, scores, and editorial sections. If I extracted everything blindly, I ended up treating all of it as the same type of data. Another issue was that the DOM isn’t just content, on modern pages, styled components wrap everything, so I started pulling CSS fragments or UI text alongside the actual tool names. The h3List for Zapier pages was polluted with text like .css-19a5n3-link{...} followed by the actual entity text.
And even when headings looked clean, their meaning wasn’t consistent. A tool might be in H3 on one page and in H2 on another, while lower levels handled features or pricing. So relying on a fixed heading level worked sometimes, but quietly failed on others.
Enhancing the extraction
I needed to refine the results to stop getting polluted text like .css-19a5n3-link{...}. To fix this:
First, I added a text-cleaning layer to strip CSS fragments and normalize headings before comparison:
function cleanExtractedText(text) {
return String(text || '')
.replace(/\.css-[^{\s]+(?:\[[^\]]+\])?\{[^}]*\}/g, ' ')
.replace(/@media[^{]*\{[^}]*\}/g, ' ')
.replace(/[a-z-]+\s*:\s*[^;{}]+;/gi, ' ')
.replace(/\u00a0/g, ' ')
.replace(/\s+/g, ' ')
.trim();
}
function normalizeHeading(text) {
return cleanExtractedText(text).toLowerCase().replace(/\s+/g, ' ').trim();
}
function normalizeForComparison(text) {
return cleanExtractedText(text)
.toLowerCase()
.replace(/\.com$/i, '')
.replace(/[–—]/g, '-')
.replace(/^\d+[\.]\s*/, '')
.replace(/\s*\([^)]*\)\s*$/g, '')
.replace(/[^\w\s:\-\.]/g, ' ')
.replace(/\s+/g, ' ')
.trim();
}
Second, I tightened entity detection instead of treating every heading as a tool. This removes scores, FAQs, pricing sections, action phrases, and other review scaffolding from the comparison layer:
function isEntityLike(text) {
const value = canonicalizeEntity(text);
if (!value) return false;
if (entityBlocklist.has(value)) return false;
if (isNumericLike(value)) return false;
if (value.includes('{') || value.includes('}')) return false;
if (isQuestionLike(text)) return false;
if (startsWithAction(value)) return false;
if (isDetailLike(value)) return false;
const tokens = value.split(/\s+/).filter(Boolean);
return tokens.length >= 1 && tokens.length <= 3;
}
function isMeaningfulHeading(text) {
const value = canonicalizeEntity(text);
if (!value) return false;
if (isBoilerplateHeading(value)) return false;
return true;
}
Third, I improved matching by canonicalizing entity labels and comparing them more flexibly. This is what merges things like monday work management into monday.com, or jira software cloud into jira, instead of treating them as separate tools:
function extractLabel(text) {
const entity = extractEntityFromHeading(text);
if (entity) return entity;
const normalized = canonicalizeEntity(text);
const tokens = getComparisonTokens(normalized);
if (tokens.length === 0) return normalized;
if (tokens.length <= 3) return tokens.join(' ');
return tokens.slice(0, 3).join(' ');
}
function areHeadingsSimilar(a, b) {
const na = canonicalizeEntity(a);
const nb = canonicalizeEntity(b);
if (!na || !nb) return false;
if (na === nb) return true;
if (na.includes(nb) || nb.includes(na)) return true;
const aCandidates = buildHeadingCandidates(a);
const bCandidates = buildHeadingCandidates(b);
if (!aCandidates.length || !bCandidates.length) return false;
const bSet = new Set(bCandidates);
const bCompactSet = new Set(bCandidates.map(compact));
for (const candidate of aCandidates) {
if (bSet.has(candidate)) return true;
if (bCompactSet.has(compact(candidate))) return true;
}
return false;
}
Finally, I slimmed the comparison summary by comparing only cleaned items that survive both isMeaningfulHeading() and isEntityLike():
function compareHeadings(myPageData, competitorsData) {
const myRaw = getComparisonItems(myPageData).filter(isMeaningfulHeading).filter(isEntityLike);
const compRaw = competitorsData.flatMap((page) =>
getComparisonItems(page).filter(isMeaningfulHeading).filter(isEntityLike),
);
const sharedHeadings = [];
const myPageOnlyHeadings = [];
const competitorOnlyHeadings = [];
for (const myItem of myRaw) {
const hasMatch = compRaw.some((compItem) => areHeadingsSimilar(myItem, compItem));
if (hasMatch) sharedHeadings.push(extractLabel(myItem));
else myPageOnlyHeadings.push(extractLabel(myItem));
}
for (const compItem of compRaw) {
const hasMatch = myRaw.some((myItem) => areHeadingsSimilar(compItem, myItem));
if (!hasMatch) competitorOnlyHeadings.push(extractLabel(compItem));
}
return {
sharedHeadings: uniqueList(sharedHeadings),
myPageOnlyHeadings: uniqueList(myPageOnlyHeadings),
competitorOnlyHeadings: uniqueList(competitorOnlyHeadings),
};
}
Results
After all three changes, the comparison output looked like this:

The difference is immediate: most of the noise is gone. The extracted lists now behave like actual entity sets, not raw headings. Shared tools are clean and consistent (clickup, asana, wrike), and both myPageOnlyHeadings and competitorOnlyHeadings are dominated by recognizable product names rather than mixed content. There are no scores, no editorial sections, and no CSS fragments.
The individual blog pages looked clean too, each tool clearly separated:




This output shows a clean separation between informational sections and the actual comparison block. I intentionally keep the full outline for each page instead of over-filtering it. The goal is not just extraction. Each individual page gives me a structural view of how competitors position their content: what sections they introduce, how they group tools, and what supporting topics they emphasize.
The comparison result serves a different purpose. It compresses everything into a single snapshot: what overlaps, what I’m missing, and what competitors are covering. The two outputs are complementary. The outline helps me understand how competitors structure their pages. The comparison helps me understand what I should add or remove.
Deploying the Actor on Apify
Once the crawler was working locally, I deployed it so it could run on demand or on a schedule. Since the project already follows the Actor structure, deployment is one command:
apify push
The push itself went fine. But the first run failed. I checked the logs and saw that my inputs weren’t present — which made sense once I thought about it. On the Apify platform, each Actor run gets its own isolated storage: a fresh dataset, key-value store, and request queue managed by the platform. The local INPUT.json I had been using doesn’t carry over after a push. I needed to define the input directly in the Console.
After a few seconds, the Actor appears in your Apify Console under Actors:

Open the Actor and set your input:

{
"topic": "project management tools",
"myPage": "https://www.learn-dev-tools.blog/best-legal-project-management-software/",
"competitors": [
"https://www.paymoapp.com/blog/project-management-software/",
"https://project-management.com/top-10-project-management-software/",
"https://zapier.com/blog/free-project-management-software"
]
}
Then click Save & Start to run the Actor. After the run completes, go to the Output tab to preview results:

You can also open the Dataset to inspect the full JSON output — the Apify Dataset storage docs explain how to export it in JSON, CSV, or Excel if you need it elsewhere.

Scheduling runs
I do competitor analysis monthly, so I set the Actor to run on a schedule instead of triggering it manually each time. Open your Actor in the Apify Console, click the three dots in the top-right corner, and select “Schedule Actor”:

Then configure the frequency:

Running the crawler on a schedule turns it into a monitoring tool instead of a one-time analysis. Each run gives me an updated snapshot of new tools added by competitors, structural changes in their pages, and gaps that appear over time.
If you want to make your Actor public and earn from it, follow the Apify Actor publishing guide to list it on the Apify Store.
Enhancing capabilities
The current version is reliable for extracting and comparing tools across different page structures. A few improvements that naturally follow from this setup:
- Handling dynamic pages: Some comparison pages rely heavily on JavaScript rendering. Switching to PlaywrightCrawler would allow extracting content that isn’t present in the initial HTML.
- Improving resilience against structural drift: As sites update their layouts, selectors and assumptions can break. Adding fallback strategies or scoring multiple candidate nodes would make extraction more robust over time.
- Tracking changes over time: Instead of a one-time comparison, storing snapshots and detecting changes across runs would turn this into a proper competitor monitoring tool.
The full code is available in the GitHub repository — the entity detection logic is the part most worth adapting if your content niche differs from comparison articles.
Conclusion
When I ran this against my Windows OS alternatives article, the output was immediately useful. Several tools my competitors were covering didn’t appear anywhere on my page. Pop!OS was one of them — it showed up across multiple competitor pages while being completely absent from mine. I went back, added it, restructured a few sections, and added some supporting content based on what the comparison revealed.
The key lesson wasn’t about scraping. It was about the extraction layer. Raw headings are noisy and misleading. The real work is isolating actual content, stripping presentation artifacts, and making headings comparable across pages with completely different structures. Once that’s solid, the comparison itself is straightforward.
If you want to take it further, the natural next steps are switching to PlaywrightCrawler for JavaScript-heavy pages, adding a scheduled diff to track changes over time, and extending the entity detection logic to cover more content niches. The GitHub repository includes all the supporting functions referenced in this article.




