How to Build an SEO competitor-analysis Actor with Crawlee and Apify

17 minutes read

Home » Blog » Development tools » How to Build an SEO competitor-analysis Actor with Crawlee and Apify

#Actor#Apify#Crawlee#How-to guide#SEO Compatitor Analysis

Building an SEO competitor-analysis Actor with Crawlee and Apify

I run a small blog where I regularly publish comparison articles, the kind that rank for queries like “best project management tools” or “windows OS alternatives.” After a while, I noticed a pattern: competitors ranking above me weren’t necessarily covering more topics, they were just structuring their content differently. I wanted to understand exactly what they were covering that I wasn’t.

My first instinct was to manually open each page, scan the headings, and take notes. That worked for one or two pages. It didn’t scale. So I built a Crawlee Actor that does the comparison for me, extracting heading structures from my page and competitor pages, cleaning and normalizing them, and producing a structured JSON output that shows shared topics, missing coverage, and unique sections.

This article walks through how I built it, what broke along the way, and how I ended up with something reliable enough to run on a monthly schedule.

Prerequisites

To follow along, you’ll need:

Node.js v18 or higher
The Apify CLI installed (npm install -g apify-cli)
Basic familiarity with JavaScript and async/await
An Apify account (free tier works fine)

What the Actor does

By the end of this guide, I’ll have a Content SEO Competitor Analysis Actor Crawlee Actor that:

Crawls my page and competitor pages targeting a given topic
Extracts structured headings (H2, H3, H4) from real article content
Removes noise from widgets, sidebars, and injected elements
Normalizes headings into comparable values
Extracts meaningful entities from headings
Compares my page against competitors
Returns a structured JSON output with:
- shared sections
- missing topics
- unique coverage

This output gives me a usable baseline for content gap analysis, without opening a single competitor page manually.

Why this problem is harder than it looks

At first, comparing pages sounds straightforward. Extract headings from your page, extract headings from competitor pages, and compare them. In practice, this breaks quickly.

Real-world pages are not clean. They mix actual content with unrelated elements that interfere with extraction. Some of the most common issues I ran into:

Noisy sections: Pages often include sidebars, “related articles”, newsletters, and promotional blocks. These elements contain headings that are not part of the core content but still get scraped.
Inconsistent structure: One page may list tools under H2 headings, while another uses H3 or even plain text. The hierarchy is not reliable across sites.
Injected or styled content: Some sites inject styles or scripts directly into heading elements. Instead of clean text, I was extracting CSS fragments or broken strings like .css-19a5n3-link{color:#0a0a23}.
Ambiguous headings: Headings like “Best options for teams” or “Top picks” don’t clearly identify what they refer to. They are not usable for comparison without further processing.
Different wording for the same concept: One page may use “Linux Mint”, another “Mint Linux”. Without normalization, these appear as different items.

If you stop at raw extraction, the output becomes noisy and misleading. The real challenge is not scraping — it is isolating the actual article content, removing irrelevant sections, normalizing headings, extracting meaningful entities, and making the data comparable across pages.

Project setup

I started by creating a new Crawlee project and installing the required dependencies. If you’re starting from scratch:

npx crawlee create seo-content-gap-analyzer
cd seo-content-gap-analyzer

If you already have a project, install Crawlee:

npm install crawlee

Since this workflow is designed to run as an Actor, I also installed the Apify SDK:

npm install apify

All the main logic lives in a single entry file:

src/main.js

That’s all the setup I needed to get started.

Defining the Actor input

The Actor needs two main inputs: my page (the one I want to analyze) and competitor pages (used for comparison). I also included a topic label and a few optional settings. Here’s an example input for Windows OS alternatives:

{
  "type": "comparison",
  "topic": "windows os alternatives",
  "myPage": "https://your-site.com/windows-os-alternatives",
  "competitors": [
    "https://example.com/windows-alternatives",
    "https://example.com/best-windows-alternatives",
    "https://example.com/linux-vs-windows-alternatives"
  ],
  "maxRequests": 20,
  "debug": true
}

Input breakdown

type: Defines the output mode. Here, I’m running a comparison workflow.
topic: A label describing what the analysis is about.
myPage: The page I want to evaluate.
competitors: A list of URLs ranking for the same topic. These serve as the comparison baseline.
maxRequests (optional): Limits how many pages the crawler processes.
debug (optional): Enables additional logging during execution.

At runtime, the Actor will crawl my page and each competitor page, extract heading structures from all of them, then compare everything to identify overlaps and gaps. This input structure keeps the workflow flexible and reusable across topics. If you want to add validation or a UI for your inputs, the Actor input schema specification covers how to set that up.

Crawling strategy

For this workflow, I used CheerioCrawler. This was a deliberate choice. I’m not interacting with pages, clicking buttons, or handling dynamic user flows. The goal is to extract structured content from article pages as efficiently as possible. CheerioCrawler gives me:

fast HTML parsing
low resource usage
simple DOM traversal
enough control to clean and process content

Using a browser-based crawler like Playwright would increase complexity and cost without adding real value for this use case.

Working with real-world comparison pages quickly exposed edge cases that don’t appear in simple demos. The main challenge was not crawling itself, but handling inconsistent and deeply nested HTML structures. Some pages exposed tools cleanly, while others embedded them inside component-based layouts, mixed with styling layers and editorial content.

To make the crawler reliable, I focused on three areas:

Text normalization: to remove CSS fragments and presentation artifacts before processing
Entity filtering: to separate actual tools from scores, features, and editorial sections
Flexible matching: to handle variations in naming across pages

These adjustments didn’t change how the crawler works at a high level, but they significantly improved the quality of the extracted data.

What I actually need from each page

I’m not scraping the entire page. I only care about the main article content, heading structure (H2, H3, H4), and meaningful text inside those headings. Everything else is noise.

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10,
    async requestHandler({ request, $, response }) {
        const url = request.url;
        const title = $('head > title').first().text().trim() || null;
        const metaDescription = $('meta[name="description"]').attr('content')?.trim() || null;
        const h1 = $('h1').first().text().trim() || null;
        const canonical = $('link[rel="canonical"]').attr('href') || null;
        const statusCode = response?.statusCode || null;
        $('aside').remove();
        $('.sidebar').remove();
        $('.widget').remove();
        $('.related').remove();
        $('.latest').remove();
        $('.recommend').remove();
        $('.newsletter').remove();
        removeWidgetBlocks($);
        const articleRoot = getArticleRoot($);
        let h2List = [];
        articleRoot.find('h2').each((i, el) => {
            const normalized = normalizeHeading($(el).text());
            if (normalized && !ignoredHeadings.includes(normalized)) {
                h2List.push(normalized);
            }
        });
        let h3List = [];
        articleRoot.find('h3').each((i, el) => {
            const normalized = normalizeHeading($(el).text());
            if (normalized && !ignoredHeadings.includes(normalized)) {
                h3List.push(normalized);
            }
        });
        h2List = uniqueList(h2List);
        h3List = uniqueList(h3List);
        await Actor.pushData({
            url,
            pageType: url === myPage ? 'myPage' : 'competitor',
            title,
            metaDescription,
            h1,
            h2List,
            h2Count: h2List.length,
            h3List,
            h3Count: h3List.length,
            canonical,
            statusCode,
            checkedAt: new Date().toISOString(),
        });
    },
});
await crawler.run([myPage, ...competitorUrls]);
await Actor.exit();

Here’s what the initial extraction looks like for three of the pages I tested:

Initial scrape result from how2shout.com showing raw heading extraction

How to Build an SEO competitor-analysis Actor with Crawlee and Apify : Initial scrape result from learn-dev-tools.blog

Initial scrape result from PCMag showing noisy heading output

Adding a comparison snapshot

At this point, each page was returning structured headings — useful, but not yet comparable. To get a global view, I introduced a comparison layer that aggregates all pages into a single snapshot. The goal is not to change extraction, but to summarize what appears across pages, what is missing from my page, and what is unique.

const result = {
  type: "comparison",
  topic,
  sharedHeadings: [],
  myPageOnlyHeadings: [],
  competitorOnlyHeadings: []
};

I added a helper function just below uniqueList():

function compareHeadings(myPageData, competitorsData) {
    const myItems = uniqueList(
        [...myPageData.h3List, ...(myPageData.h4List || [])]
            .filter(isMeaningfulHeading)
            .map(extractLabel)
            .map(normalizeComparisonLabel)
            .filter(looksLikeEntityLabel),
    );

    const compItems = uniqueList(
        competitorsData
            .flatMap((c) => [...c.h3List, ...(c.h4List || [])])
            .filter(isMeaningfulHeading)
            .map(extractLabel)
            .map(normalizeComparisonLabel)
            .filter(looksLikeEntityLabel),
    );

    const sharedHeadings = [];
    const myPageOnlyHeadings = [];
    const competitorOnlyHeadings = [];

    for (const myItem of myItems) {
        const hasMatch = compItems.some((compItem) =>
            areHeadingsSimilar(myItem, compItem)
        );

        if (hasMatch) {
            sharedHeadings.push(myItem);
        } else {
            myPageOnlyHeadings.push(myItem);
        }
    }

    for (const compItem of compItems) {
        const hasMatch = myItems.some((myItem) =>
            areHeadingsSimilar(compItem, myItem)
        );

        if (!hasMatch) {
            competitorOnlyHeadings.push(compItem);
        }
    }

    return {
        sharedHeadings: uniqueList(sharedHeadings),
        myPageOnlyHeadings: uniqueList(myPageOnlyHeadings),
        competitorOnlyHeadings: uniqueList(competitorOnlyHeadings),
    };
}

This doesn’t change the crawler itself. It adds a snapshot layer on top of the extracted data — one summary that shows how my page compares to the rest. After the crawl finishes, I load the dataset and split results into myPage and competitor entries:

const { items } = await dataset.getData({ limit: 1000 });

const myPageData = items.find((item) => item.pageType === 'myPage');
const competitorsData = items.filter((item) => item.pageType === 'competitor');

Then I run the comparison and store the snapshot:

if (myPageData && competitorsData.length > 0) {
    const comparison = compareHeadings(myPageData, competitorsData);

    console.log('Comparison Result:', comparison);

    await Actor.pushData({
        type: 'comparison',
        ...comparison,
        generatedAt: new Date().toISOString(),
    });
}

This gives a first global view of the topic: sharedHeadings shows overlap between my page and competitors, myPageOnlyHeadings shows content covered only on my page, and competitorOnlyHeadings shows content competitors cover that my page doesn’t.

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: First comparison summary result showing raw heading overlap before cleanup

Testing the crawler on another topic

At this point, the crawler worked for one dataset. The next step was to check whether the logic held when the topic changed. Instead of modifying the code, I only changed the input. This time using real project management pages I already had bookmarked:

{
  "topic": "project management tools",
  "myPage": "https://www.learn-dev-tools.blog/best-legal-project-management-software/",
  "competitors": [
    "https://www.paymoapp.com/blog/project-management-software/",
    "https://project-management.com/top-10-project-management-software/",
    "https://zapier.com/blog/free-project-management-software"
  ]
}

Results

Here’s what the individual page extractions looked like:

Scrape result from learn-dev-tools.blog for project management topic

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Scrape result from paymoapp.com

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Scrape result from project-management.com

And here’s the comparison summary for this run — this is the output I actually use to plan my content updates:

Comparison summary result for project management topic

This one was more telling. The mainstream tools were well covered on my end — monday.com, ClickUp, Trello, Asana, Wrike, Smartsheet, Basecamp all showed up in the shared list. But the competitorOnlyHeadings was long. Jira, Airtable, Zoho, Teamwork, Podio, Hive — tools I hadn’t touched at all. The myPageOnlyHeadings also caught something I hadn’t noticed: noise entries like “but what are” and “key features” were still leaking through, which meant the entity filtering still needed tuning for this type of content. The legal-specific tools I covered — Clio, Thomson Reuters — didn’t appear anywhere in the competitor set, which made sense given the article’s focus, but it also explained a big chunk of the gap.

Inspecting the Zapier page structure revealed exactly the kind of noise problem I mentioned earlier:

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Zapier HTML structure showing CSS-polluted heading elements

Extracting tools from comparison pages feels simple, I expected them to sit cleanly in H2 or H3 tags. But once I tried it across a few sites, that assumption broke. Some pages use H3 lists, others use H2 with long descriptions, and more complex ones don’t expose tools as clear headings at all.

One problem was noise. The same heading level mixed tool names, scores, and editorial sections. If I extracted everything blindly, I ended up treating all of it as the same type of data. Another issue was that the DOM isn’t just content, on modern pages, styled components wrap everything, so I started pulling CSS fragments or UI text alongside the actual tool names. The h3List for Zapier pages was polluted with text like .css-19a5n3-link{...} followed by the actual entity text.

And even when headings looked clean, their meaning wasn’t consistent. A tool might be in H3 on one page and in H2 on another, while lower levels handled features or pricing. So relying on a fixed heading level worked sometimes, but quietly failed on others.

Enhancing the extraction

I needed to refine the results to stop getting polluted text like .css-19a5n3-link{...}. To fix this:

First, I added a text-cleaning layer to strip CSS fragments and normalize headings before comparison:

function cleanExtractedText(text) {
    return String(text || '')
        .replace(/\.css-[^{\s]+(?:\[[^\]]+\])?\{[^}]*\}/g, ' ')
        .replace(/@media[^{]*\{[^}]*\}/g, ' ')
        .replace(/[a-z-]+\s*:\s*[^;{}]+;/gi, ' ')
        .replace(/\u00a0/g, ' ')
        .replace(/\s+/g, ' ')
        .trim();
}

function normalizeHeading(text) {
    return cleanExtractedText(text).toLowerCase().replace(/\s+/g, ' ').trim();
}

function normalizeForComparison(text) {
    return cleanExtractedText(text)
        .toLowerCase()
        .replace(/\.com$/i, '')
        .replace(/[–—]/g, '-')
        .replace(/^\d+[\.]\s*/, '')
        .replace(/\s*\([^)]*\)\s*$/g, '')
        .replace(/[^\w\s:\-\.]/g, ' ')
        .replace(/\s+/g, ' ')
        .trim();
}

Second, I tightened entity detection instead of treating every heading as a tool. This removes scores, FAQs, pricing sections, action phrases, and other review scaffolding from the comparison layer:

function isEntityLike(text) {
    const value = canonicalizeEntity(text);
    if (!value) return false;
    if (entityBlocklist.has(value)) return false;
    if (isNumericLike(value)) return false;
    if (value.includes('{') || value.includes('}')) return false;
    if (isQuestionLike(text)) return false;
    if (startsWithAction(value)) return false;
    if (isDetailLike(value)) return false;

    const tokens = value.split(/\s+/).filter(Boolean);
    return tokens.length >= 1 && tokens.length <= 3;
}

function isMeaningfulHeading(text) {
    const value = canonicalizeEntity(text);
    if (!value) return false;
    if (isBoilerplateHeading(value)) return false;
    return true;
}

Third, I improved matching by canonicalizing entity labels and comparing them more flexibly. This is what merges things like monday work management into monday.com, or jira software cloud into jira, instead of treating them as separate tools:

function extractLabel(text) {
    const entity = extractEntityFromHeading(text);
    if (entity) return entity;

    const normalized = canonicalizeEntity(text);
    const tokens = getComparisonTokens(normalized);
    if (tokens.length === 0) return normalized;
    if (tokens.length <= 3) return tokens.join(' ');
    return tokens.slice(0, 3).join(' ');
}

function areHeadingsSimilar(a, b) {
    const na = canonicalizeEntity(a);
    const nb = canonicalizeEntity(b);

    if (!na || !nb) return false;
    if (na === nb) return true;
    if (na.includes(nb) || nb.includes(na)) return true;

    const aCandidates = buildHeadingCandidates(a);
    const bCandidates = buildHeadingCandidates(b);
    if (!aCandidates.length || !bCandidates.length) return false;

    const bSet = new Set(bCandidates);
    const bCompactSet = new Set(bCandidates.map(compact));

    for (const candidate of aCandidates) {
        if (bSet.has(candidate)) return true;
        if (bCompactSet.has(compact(candidate))) return true;
    }
    return false;
}

Finally, I slimmed the comparison summary by comparing only cleaned items that survive both isMeaningfulHeading() and isEntityLike():

function compareHeadings(myPageData, competitorsData) {
    const myRaw = getComparisonItems(myPageData).filter(isMeaningfulHeading).filter(isEntityLike);
    const compRaw = competitorsData.flatMap((page) =>
        getComparisonItems(page).filter(isMeaningfulHeading).filter(isEntityLike),
    );

    const sharedHeadings = [];
    const myPageOnlyHeadings = [];
    const competitorOnlyHeadings = [];

    for (const myItem of myRaw) {
        const hasMatch = compRaw.some((compItem) => areHeadingsSimilar(myItem, compItem));
        if (hasMatch) sharedHeadings.push(extractLabel(myItem));
        else myPageOnlyHeadings.push(extractLabel(myItem));
    }

    for (const compItem of compRaw) {
        const hasMatch = myRaw.some((myItem) => areHeadingsSimilar(compItem, myItem));
        if (!hasMatch) competitorOnlyHeadings.push(extractLabel(compItem));
    }

    return {
        sharedHeadings: uniqueList(sharedHeadings),
        myPageOnlyHeadings: uniqueList(myPageOnlyHeadings),
        competitorOnlyHeadings: uniqueList(competitorOnlyHeadings),
    };
}

Results

After all three changes, the comparison output looked like this:

Enhanced comparison summary showing clean tool names without noise

The difference is immediate: most of the noise is gone. The extracted lists now behave like actual entity sets, not raw headings. Shared tools are clean and consistent (clickup, asana, wrike), and both myPageOnlyHeadings and competitorOnlyHeadings are dominated by recognizable product names rather than mixed content. There are no scores, no editorial sections, and no CSS fragments.

The individual blog pages looked clean too, each tool clearly separated:

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Clean extraction result from paymoapp.com after normalization

Clean extraction result from zapier.com after normalization

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Clean extraction result from learn-dev-tools.blog

Clean extraction result from project-management.com

This output shows a clean separation between informational sections and the actual comparison block. I intentionally keep the full outline for each page instead of over-filtering it. The goal is not just extraction. Each individual page gives me a structural view of how competitors position their content: what sections they introduce, how they group tools, and what supporting topics they emphasize.

The comparison result serves a different purpose. It compresses everything into a single snapshot: what overlaps, what I’m missing, and what competitors are covering. The two outputs are complementary. The outline helps me understand how competitors structure their pages. The comparison helps me understand what I should add or remove.

Deploying the Actor on Apify

Once the crawler was working locally, I deployed it so it could run on demand or on a schedule. Since the project already follows the Actor structure, deployment is one command:

apify push

The push itself went fine. But the first run failed. I checked the logs and saw that my inputs weren’t present — which made sense once I thought about it. On the Apify platform, each Actor run gets its own isolated storage: a fresh dataset, key-value store, and request queue managed by the platform. The local INPUT.json I had been using doesn’t carry over after a push. I needed to define the input directly in the Console.

After a few seconds, the Actor appears in your Apify Console under Actors:

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Apify Console home dashboard showing the deployed Actor

Open the Actor and set your input:

{
  "topic": "project management tools",
  "myPage": "https://www.learn-dev-tools.blog/best-legal-project-management-software/",
  "competitors": [
    "https://www.paymoapp.com/blog/project-management-software/",
    "https://project-management.com/top-10-project-management-software/",
    "https://zapier.com/blog/free-project-management-software"
  ]
}

Then click Save & Start to run the Actor. After the run completes, go to the Output tab to preview results:

Actor output tab showing crawl results in table format

You can also open the Dataset to inspect the full JSON output — the Apify Dataset storage docs explain how to export it in JSON, CSV, or Excel if you need it elsewhere.

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Actor dataset view showing full JSON comparison output

Scheduling runs

I do competitor analysis monthly, so I set the Actor to run on a schedule instead of triggering it manually each time. Open your Actor in the Apify Console, click the three dots in the top-right corner, and select “Schedule Actor”:

Apify Console showing the Schedule Actor option in the dropdown menu

Then configure the frequency:

How to Build an SEO competitor-analysis Actor with Crawlee and Apify: Schedule Actor settings showing monthly run configuration

Running the crawler on a schedule turns it into a monitoring tool instead of a one-time analysis. Each run gives me an updated snapshot of new tools added by competitors, structural changes in their pages, and gaps that appear over time.

If you want to make your Actor public and earn from it, follow the Apify Actor publishing guide to list it on the Apify Store.

Enhancing capabilities

The current version is reliable for extracting and comparing tools across different page structures. A few improvements that naturally follow from this setup:

Handling dynamic pages: Some comparison pages rely heavily on JavaScript rendering. Switching to PlaywrightCrawler would allow extracting content that isn’t present in the initial HTML.
Improving resilience against structural drift: As sites update their layouts, selectors and assumptions can break. Adding fallback strategies or scoring multiple candidate nodes would make extraction more robust over time.
Tracking changes over time: Instead of a one-time comparison, storing snapshots and detecting changes across runs would turn this into a proper competitor monitoring tool.

The full code is available in the GitHub repository — the entity detection logic is the part most worth adapting if your content niche differs from comparison articles.

Conclusion

When I ran this against my Windows OS alternatives article, the output was immediately useful. Several tools my competitors were covering didn’t appear anywhere on my page. Pop!OS was one of them — it showed up across multiple competitor pages while being completely absent from mine. I went back, added it, restructured a few sections, and added some supporting content based on what the comparison revealed.

The key lesson wasn’t about scraping. It was about the extraction layer. Raw headings are noisy and misleading. The real work is isolating actual content, stripping presentation artifacts, and making headings comparable across pages with completely different structures. Once that’s solid, the comparison itself is straightforward.

If you want to take it further, the natural next steps are switching to PlaywrightCrawler for JavaScript-heavy pages, adding a scheduled diff to track changes over time, and extending the entity detection logic to cover more content niches. The GitHub repository includes all the supporting functions referenced in this article.

Share on

Twitter Facebook LinkedIn Reddit

Author: Learndevtools

Enjoyed the article? Please share it or subscribe for more updates from LearnDevTools.

Read also

How to Analyze a CSV File with Python and Pandas

Data Analysis

Also, explore other topics and expand your knowledge.

#Actor #AI #alternative tools #Analytics #Android Studio #Apify #apis #aws #Beginner's Guide #blog writing #Bulma css #business performance #Causes and Fixes #CD/CI #ChromeOS #cloud architecture #CMS #code writing #contentful #Crawlee #cross-platform #css #css courses #css framework #css frameworks #css grid #css properties #css tutorials #data #developer tools #Development Companies #difference between #docker #documentation #drawing tools #ecommerce solutions #Email builder #email deliverability #email delivery #flexbox #Flutter #foundation css #framework #free software #Free tool #global SaaS products #How-to guide #html #html tutorials #iinbox placement #Internationalization #IT #js #Kubernetes #llmops #Localization #macOS #ML #netflix #Open source #organizational improvment #OS #plugins #PR #Private markets #Project Management #QR Code #React Native #Remote tools #renewable energy #saas #SaaS localization #seo #SEO Compatitor Analysis #Serverless #Software #software developer tools #store #storyblok #strapi #Stripe #tailwind #tailwind css #Tech hacks #Technical Writing #Technical Writing Tips #Technical Writing Tools #Tips and tricks #TOP 10 #Translation #ubuntu #UX #Windows #wordpress #writing #Xcode #Youtube

How to Build an SEO competitor-analysis Actor with Crawlee and Apify

Building an SEO competitor-analysis Actor with Crawlee and Apify

Prerequisites

What the Actor does

Why this problem is harder than it looks

Project setup

Defining the Actor input

Crawling strategy

What I actually need from each page

Adding a comparison snapshot

Testing the crawler on another topic

Results

Enhancing the extraction

Results

Deploying the Actor on Apify

Scheduling runs

Enhancing capabilities

Conclusion

Latest articles

Private Credit Is Not Out of Money: Is It Out of Easy Deals?