Custom Parsers (JavaScript)
For more complex parsing jobs you can write custom parsers in JavaScript. To use it set engine
to
javascript
and use your JavaScript code in the source
field of the root schema.
Note: Although the JavaScript engine is fairly performant, it doesn’t match the performance of native engines, so avoid using it for tasks that could be achieved with other parsers.
JavaScript API
The Optimus Mine JavaScript engine uses ES2015 with only minor additions.
Code | Description |
---|---|
$input |
The $input variable contains the incoming data. |
$callback |
The $callback function is used to pass data to the next task. |
require |
Used to import modules. Imported modules need to be CommonJS modules using ES5 syntax. |
Limitations
You can only have one script per schema which needs to be set on the source
property of the root
schema.
Examples
Generating URLs for scraping
In the following example we’re using the JavaScript engine to build a list of URLs to be consumed by a scrape task.
tasks:
- task: parse
engine: javascript
input: '1-10'
schema:
type: string
source: |
const [start, end] = $input.split('-');
for (let i = start; i <= end; i++) {
$callback(`https://example.com/products/page-${i}.html`);
}
Output:
https://example.com/products/page-1.html
https://example.com/products/page-2.html
https://example.com/products/page-3.html
https://example.com/products/page-4.html
https://example.com/products/page-5.html
https://example.com/products/page-6.html
https://example.com/products/page-7.html
https://example.com/products/page-8.html
https://example.com/products/page-9.html
https://example.com/products/page-10.html
Parsing more complex data structures
Since you can only use JavaScript on the root schema you need to return an object containing the
structure defined in the schema, omitting the source
property for all sub-schemas. You can however
still use filters on them as you would for any other parser.
See the example below on how to achieve this.
tasks:
- task: parse
engine: javascript
input: 'Bob:34:02071234567'
schema:
type: object
source: |
const [name, age, phone] = $input.split(':');
$callback({ name, age, phone });
properties:
name:
type: string
age:
type: integer
phone:
type: string
filter:
type: phone
country: GB
Output:
name: Bob
age: 34
phone: +4402071234567