Crawling Patterns
Patterns enable quick and easy restrictions on pages crawled or processes based on simple URL string matches. For instance, to crawl only pages wihin the “sports” category at http://www.example.com/sports/heres-a-sports-article.html, you can specify a crawl pattern of/sports/
(including slashes ensures precision and avoids matching a “sports” string elsewhere in the URL).
By using a crawl pattern, you can limit crawling to a particular subdomain. For example, on a crawl starting at a specific domain, you can enter a crawl pattern to prevent crawl from unwanted links.
In a Web Crawl configuration, you can enter multiple patterns to match multiple strings by placing each individual pattern on a new line.
These are the available crawling patterns:
Limiting Matches to the Beginning of URLs
Limiting Matches to the Beginning of URLs
Use the caret character (
^
) to restrict pattern matches to the beginning of a URL. For example, a processing pattern of ^https://example.com
will limit processing only to pages which URLs begin with https://example.com
.Negative-Match Patterns
Negative-Match Patterns
Use the exclamation point (
!
) for a “negative match” to explicitly exclude pages from being crawled or processed.
With multiple patterns, negative matches will override other crawl patterns (except for Regex, those would take precedence over any other pattern).
Regular Expressions Crawl and Processing
Regular Expressions Crawl and Processing
For precise control over your crawling or processing URL matches, you can create a regular expression (regex) to exclusively crawl or process URLs that match your defined expression. For example, to process pages at
https://example.com/
under the “/crawl
” path and containing the term “regex”, you could use a processing regex similar to: \/crawl.*?regex
.
The crawlbot employs a custom regular expression engine for optimal performance while evaluating pages. In terms of character class syntax, commmonly used in Crawlbot parsing, the crawlbot supports all ASCII processing characters and most Perl/Tcl shortcuts.
Crawling and processing regex cannot be used simultaneously with other crawling patterns. If both are provided, regex will override other crawl patterns.
HTML Processing Pattern
HTML Processing Pattern
The crawl allows for limiting processed pages based on HTML processing patterns, which will only examine the raw source and does not execute JavaScript/AJAX during crawl time. However, the downside of this option is the crawl speed.
If you’d like a faster crawling speed, you should use regex crawling and processing.
When to Use a Crawler
The crawler is designed for extracting structured, high-value information from the web. It works best in scenarios where data is publicly available and regularly updated.Best Use Cases
- News and Media – Extract articles from sites like BBC News
- Financial Data – Collect market insights from stock sites such as NASDAQ
- E-Commerce – Gather product listings, reviews, and pricing from marketplaces
- Knowledge Resources – Capture blog posts, forum discussions, or FAQ content to enrich your Knowledge Base.
When Not to Use the Crawler
Avoid crawling in these situations to prevent errors or compliance issues:- Authentication Required – Websites that require login credentials (e.g., LinkedIn).
- Highly Dynamic Dashboards – Real-time dashboards (such as live stock tickers) may not yield complete results.
- Excessive Site Load – Respect each site’s usage policies–always review the
robot.txt
file before crawling.
Create a New Crawler
- In your Odin AI project, navigate to the Knowledge Base.
- In the Knowledge Base, click the gear icon at the top-right corner.
- Go to the Crawlers tab.
- Click + Create New.
- In the Input section:
a. Enter a name for your crawler.
b. Provide the Seed URL, the link to the website you want to crawl. This is the initial link that is accessed by the crawler. From there, the crawler explores the web pages and links available within the website. - In the Crawler Settings section:
- Limit to Root Domain – This sets the crawler to focus solely on the root domain, streamlining relevant information extraction.
- Download Files – Enable this option with you want the crawler to download and process files found during crawling.
- Max Pages to Crawl – Set the maximum number of pages to crawl, to optimize resource usage and efficiency.
- Max Depth – Set the maximum depth to crawl from the seed URL (i.e., how many links deep to follow).
- Crawl Strategy – Choose how the crawler prioritizes which pages to visit first.
- Best First – If you select this option, you need to configure the keywords the crawler should use to find these pages and the weight of these keywords.
- Keywords for Best First – Pages containing the keywords configured in this text box will be crawled first. Enter one keyword per line.
- Keywords Weight – Configure the keyword weight for keyword matching; the higher the value, the more it prioritizes keyword matching.
- Breadth First – This method prioritizes visiting all directly linked pages from a current page before delving deeper.
- Depth First – This method prioritizes exploring as deeply as possible along a single path of links before backtracking and exploring other paths.
- Best First – If you select this option, you need to configure the keywords the crawler should use to find these pages and the weight of these keywords.
- Limit to Domains – Restrict the crawler to specific domains instead of crawling all subdomains within the root domain. Enter one domain per line or leave blank to crawl all subdomains.
- Limit to Patterns – Enter the crawling patterns you’d like the crawler to use. Enter one crawling pattern per line.If you’re using multiple crawling patterns, this is the hierarchy:
1. Regex
2. Negative Matches
3. All other patterns.
- In the Scheduling section, define the frequency and time for scheduled crawls:
- Crawling Enabled – Enable this option to set the crawling schedule.
- Repeat Every n Days – Set the number for days in which you’d like the crawl to run. For example:
- Daily =
1
- Weekly =
7
- Bi-weekly =
14
- Monthly =
30
- Daily =
- Next Schedule – Depending on the number of days you enter, you’ll see the next crawl date.
- Once you’re done configuring your web crawl, click Crawl Now to start the first crawl. You’ll be redirected to the new crawler’s configuration page with the following tabs:
- Overview – This is where you can see the crawl request information and status.
- Settings – This is where you can edit the crawler’s settings. You can edit these settings even after a crawl has ran.
- Crawled Report – This tab provides real-time updates on the pages being crawled.
Now your crawler is configured and running its first crawl!
Best Practices
Following these best practices ensures your crawler runs efficiently, avoids unnecessary duplication, and brings only the most relevant data into your Knowledge Base.
1. Use Clear and Concise Seed URLs
Choose URLs that point directly to the section of a website you want to extract..- Good Example:
https://www.example.com/news
(targets only the “news” section) - Bad Example:
https://www.example.com?user=1234
(Using dynamic parameters can cause errors or redundant crawling)
- the URL is accessible and relevant to the data you want to extract.
- use subpages when you want to restrict crawling to a specific section.
2. Restrict to Domains and Subdomains
Keep your crawl focused to avoid pulling in unnecessary or unrelated data.- Limit to Root Domain – Crawling
example.com
captures only that domain and ignores external links (e.g., otherwebsite.com). - Limit to Domain – Crawling
blog.example.com
won’t include other subdomains likeshop.example.com
.
Example
Crawlinghttps://www.bbc.com
, with Limit to Root Domain enabled ensures only BBC content is included–not external news sources linked on the site.
3. Use Crawling and Processing Patterns
Patterns let you fine-tune what the crawler fetches and processes.- Crawling Patterns – Define which URLs should be crawled.
- Processing Patterns – Specify which content should be extracted into your Knowledge Base. Both use a simplified regex syntax:
- Implied Wildcard – Entering
products
matches any URL containing “products.” - Negation(
!
) –!products
matches any URL without “products.” - Starts with (
^
) –^https://example.com/products/
matches URLs beginning with that path. - Ends with(
$
) –products/$
matches URLs ending in “products.”
Example
- Crawling Pattern –
https://example.com/products/*
> crawls all product pages. - Processing Pattern –
https://example.com/products/*reviews
> extracts only reviews from product pages.
4. Enable Scheduled Crawls (When Needed)
For dynamic websites, schedule crawls to keep your Knowledge Base up to date.Example
Track stock prices onhttps://www.nasdaq.com
by enabling daily crawls.