Record Extractor
info
The following content is for the new DocSearch infrastructure. If you haven't received an email to migrate your account yet, please refer to the legacy documentation.
Introduction
info
This documentation will only contain information regarding the helpers.docsearch method, see Algolia Crawler Documentation for more information on the Algolia Crawler.
Pages are extracted by a recordExtractor
. These extractors are assigned to actions
via the recordExtractor
parameter. This parameter links to a function that returns the data you want to index, organized in a array of JSON objects.
The helpers are a collection of functions to help you extract content and generate Algolia records.
Useful links
Usage
The most common way to use the DocSearch helper, is to return its result to the recordExtractor
function.
recordExtractor: ({ helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
},
Complex extractors
Using the Cheerio instance ($
)
We provide a Cheerio instance ($)
for you to retrieve or remove content from the DOM:
recordExtractor: ({ $, helpers }) => {
// Removing DOM elements we don't want to crawl
$(".my-warning-message").remove();
return helpers.docsearch({
recordProps: {
lvl0: "header h1",
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
},
Handling fallback DOM selectors
Fallback selectors can be useful when retrieving content that might not exist in some pages:
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
// `.exists h1` will be selected if `.exists-probably h1` does not exists.
lvl0: {
selectors: [".exists-probably h1", ".exists h1"],
}
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
// `.exists p, .exists li` will be selected.
content: [
".does-not-exists p, .does-not-exists li",
".exists p, .exists li",
],
},
});
},
With custom variables
These selectors also support defaultValue
and fallback selectors
Custom variables are added to your Algolia records to be used as filters in the frontend (e.g. version
, lang
, etc.):
recordExtractor: ({ helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
// The variables below can be used to filter your search
foo: ".bar",
language: {
// It also supports the fallback DOM selectors syntax!
selectors: ".does-not-exists",
// Since custom variables are used for filtering, we allow sending
// multiple raw values
defaultValue: ["en", "en-US"],
},
version: {
// You can send raw values without `selectors`
defaultValue: ["latest", "stable"],
},
},
});
},
The version
, lang
and foo
attribute of these records will be :
foo: "valueFromBarSelector",
language: ["en", "en-US"],
version: ["latest", "stable"]
You can now use them to filter your search in the frontend
With raw text (defaultValue
)
Only the lvl0
and custom variables selectors support this option
You might want to structure your search results differently than your website, or provide a defaultValue
to a potentially non-existent selector:
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
// It also supports the fallback DOM selectors syntax!
selectors: ".exists-probably h1",
defaultValue: "myRawTextIfDoesNotExists",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
// The variables below can be used to filter your search
language: {
// It also supports the fallback DOM selectors syntax!
selectors: ".exists-probably .language",
// Since custom variables are used for filtering, we allow sending
// multiple raw values
defaultValue: ["en", "en-US"],
},
},
});
},
Boosting search results with pageRank
pageRank
used to be an integer, it is now a string
This parameter helps to boost records built from the current pathsToMatch
. Pages with highest pageRank
will be returned before pages with a lower pageRank
. Note that you can pass any numeric value as a string, including negative values:
{
indexName: "YOUR_INDEX_NAME",
pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: "header h1",
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "article p, article li",
pageRank: "30",
},
indexHeadings: true,
});
},
},
recordProps
API Reference
lvl0
type: Lvl0
| required
type Lvl0 = {
selectors: string | string[];
defaultValue?: string;
};
lvl1
, content
type: string | string[]
| required
lvl2
, lvl3
, lvl4
, lvl5
, lvl6
type: string | string[]
| optional
pageRank
type: string
| optional
See the live example
Custom variables ([k: string]
)
type: string | string[] | CustomVariable
| optional
type CustomVariable =
| {
defaultValue: string | string[];
}
| {
selectors: string | string[];
defaultValue?: string | string[];
};
Contains values that can be used as facetFilters