Skip to main content

In 2025, web scraping (the automated extraction of data from webpages) is no longer simply “fetch HTML and parse it”. Most meaningful data is rendered dynamically in-browser. And most target websites assume (correctly) that people will attempt to scrape them.

As a result, the modern posture is adversarial:

  • websites defend
  • bots disguise
  • regulators intervene

How modern scrapers work: the two fundamentals

1. Fetch

The scraper requests the page (HTML, JSON, WebSocket payload, GraphQL).

2. Extract

The scraper isolates the specific fields (prices / names / detail blocks / timestamps). However, because most sites hydrate content with JavaScript, scrapers now use headless browsers (a web browser that operates without a graphical user interface), such as:

These are effectively Chrome / Safari / Firefox — running invisibly.

Defensive evolution → behavioural mimicry is now a requirement

Modern sites don’t just block IPs. They analyse and correlate:

  • WebGL fingerprint
  • TLS handshake signature
  • browser entropy
  • scroll cadence
  • typing latency
  • resource loading heuristics

Modern scrapers therefore must simulate human behaviour signatures.

Identity ≠ IP address anymore.
Identity = IP + fingerprint + behaviour + timing.

The new paradigm: Semantic and agentic extraction

We’ve now crossed a key threshold:

  • The bottleneck is not GET requests — it is meaning.
  • CSS/XPath selectors break when class names change.

So 2025 extraction is shifting to:

  • LLM-based semantic extraction
  • HTML chunking + embeddings
  • vector retrieval (RAG)
  • multi-agent navigation

The scraper becomes an agent, not a selector script.

The legal realities in short

Scraping is not illegal by default.

But the two legal tests are:

  1. Access legality → how you got the data
  2. Processing legality → what you intend to do with it

Jurisdiction posture summary:

RegionDefault stance
United StatesPublic scraping may be civil not criminal if no access barrier is bypassed.
UK / EUMinor barrier bypass can trigger criminal unauthorised access + database right exposure.
GCC (UAE / KSA)Broad cybercrime laws — scraping commercial competitive data can be per se criminal.
APAC (Singapore / Japan / AUS)System interference or bypassing access controls can be a criminal obstruction.
South AfricaPOPIA applies to public data; Cybercrimes Act applies to intrusion/bypass; stance aligns closer to UK/EU.

And crucially:

Public data ≠ implied consent
Especially where AI training is the purpose.

High-risk scraping scenarios

ScenarioRisk
Scraping unprotected public pagesUS: lower, EU/UK/SA: medium
Scraping behind login/paywallHigh — often criminal
Bypassing CAPTCHA / JS challengesVery high
Scraping personal data for AIHigh regulatory exposure
Scraping after cease-and-desistRisk escalates outside US

FAQs

Is scraping public data always legal?

No. Public ≠ free to repurpose. A lawful basis is still required.

Can I scrape LinkedIn profiles into my CRM or AI model?

This is extremely high-risk under GDPR / POPIA. Especially for AI training.

What if I don’t store the scraped data — is it legal?

Processing begins the moment you collect/use it. Storage isn’t the trigger.

If I only collect pricing data — is that personal information?

Possibly not — but product metadata can still be IP-protected.

Is using residential proxies enough to avoid detection?

No. Identity now spans behavioural + fingerprint + network.

Can browser automation tools themselves be illegal?

Tools aren’t illegal; bypassing access barriers may be.

Is scraping behind a login always unlawful?

Almost always. Authentication = access control in nearly every jurisdiction.

Is scraping a competitor’s pricing considered “corporate espionage”?

Not necessarily — but if access controls are bypassed, it can become cybercrime.

Can AI models “inherit illegality” from scraped data?

Yes. If the training corpus was unlawfully obtained — the model is contaminated.

Can I claim “research” as a lawful basis?

Not automatically. Research is not a universal exemption.

How ITLawCo can help

Most organisations approach scraping backwards: they build the pipeline, then ask the lawyer for sign-off. In 2025 that is organisationally dangerous.

ITLawCo supports clients at the exact collision point where scraping now operates:

  • data engineering ↔ data governance
  • AI model training ↔ lawful basis strategy
  • extraction architecture ↔ cybercrime thresholds
  • POPIA / GDPR operationalisation ↔ minimisation controls

We help clients:

  • determine when method = criminal vs contractual risk
  • run POPIA + GDPR purpose and legitimate interest tests
  • implement delete-on-contact personal data filters
  • structure “no-criminal-threshold” access boundaries
  • produce defensible governance artefacts for regulators/audits

We don’t block scraping.
We make it lawful.

We design compliance before the first request is sent.