Optimus Configuration Reference
Optimus Mine pipelines are described using a YAML file. Each pipeline is made up of one or more tasks, with a task being either a web scraper, a data parser or a data exporter.
Root Parameters
These parameters are at the root of the YAML file.
defaults
Used to specify parameters which are applied to all tasks in the pipeline.
defaults:
timeout: 1m # all tasks below now have a 1m timeout
tasks:
- name: task_1
# ...
- name: task_2
# ...
tasks (required)
tasks contains an array of all your pipeline tasks. A pipeline requires at least one task.
tasks:
- name: task_1
task: scrape
engine: xpath
# ...
- name: task_2
task: export
# ...
Global Task Parameters
These parameters apply to all task types.
name (required)
Specify a unique task name to be used in log messages and for refrencing data using input.
task (required)
task can be either scrape, parse or export.
| task | Description |
|---|---|
scrape |
Incoming data will be treated as URLs and will be requested and their responses parsed. |
parse |
Incoming data will be parsed using your chosen engine. |
export |
Incoming data will be exported using your chosen provider. |
engine
engine is used to specify the engine with which to parse incoming data in the case of scrape &
parse tasks, or the export provider in the case of export tasks.
If the current task is a parse or scrape task and engine is omitted the task will return the
data as-is, without parsing it.
Parse Engines
| engine | Description |
|---|---|
csv |
The csv engine is used to parse incoming CSV data. |
javascript |
Use custom JavaScript to parse incoming data. |
json |
The json engine is used to parse incoming JSON data. |
regex |
The regex engine is used to parse incoming data using regular expressions. |
xpath |
The xpath engine is used to parse incoming XML or HTML data using xpath. |
Export Engines
| engine | Description |
|---|---|
bigquery |
Export incoming data to Google BigQuery. |
mysql |
Export incoming data to a MySQL database. |
s3 |
Export incoming data to S3 compatible file storage providers such as AWS S3 or Google Cloud Storage. |
input
Specify data to be sent to the task. This can either be a string, an array of strings or the output
of another task.
To specify the output of a previous task as input, use the dollar sign ($) followed by the name of
the task, e.g. to get the output of task_1 you would set the input parameter of that task to
$task_1. You can get the output of multiple tasks by using an array of task variables.
If left blank the input will be the output of the preceding task.
tasks:
# the input of `task_1` will be `foo` & `bar`
- name: task_1
input:
- foo
- bar
# `task_2` has no input parameter, so the input of `task_2` will be the output of `task_1`
- name: task_2
# the input of `task_3` will be the output of `task_1`
- name: task_3
input: $task_1
# the input of `task_4` will be the outputs of both `task_2` and `task_3`
- name: task_4
input:
- $task_2
- $task_3
verbose
Set to true to enable verbose logging for debugging, thus logging every single request made.
Please note this will result in a very large number of log messages and should be disabled for
production.
Parse Parameters
These parameters apply to parse & scrape tasks.
schema
schema is a superset of JSON schema, defining the data to scrape or parse. It supports all
standard types of JSON Schema and adds 2 new properties: source & filter.
For more information see the parsing data documentation.
schema:filter
Set a filter to manipulate parsed data. See the filter documentation.
schema:source
Set the source where to find the current value. The format depends on the engine you’re using e.g.
if the engine is regex, then source needs to be a regular expression, if it’s xpath or xml,
the source needs to be a valid XPath.
For more information see the parsing data documentation.
Scrape Parameters
These parameters apply to scrape tasks only.
method
HTTP method to make requests e.g. GET, POST, PUT etc. Defaults to GET.
delay
How long to wait after each request. Durations are formatted as integers with a unit e.g. 1m30s.
Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 0.
error_delay
How long to wait before retrying a failed request. Durations are formatted as integers with a unit
e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 500ms.
retry
How often to retry failed requests. Defaults to 3.
threads
Specify how many concurrent requests you’d like to make. Defaults to 5.
timeout
Set a timeout for the HTTP requests. Durations are formatted as integers with a unit e.g. 1m30s.
Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 30s.
header
Add headers to http requests.
header:
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
handle
handle allows you to set actions for specific HTTP response codes, timeouts etc. It is defined as
a key-value map in the following format:
handle:
<event>: <handler-definition>
For example you could switch the current VPN whenever you get a 429 (Too Many Requests) response:
handle:
429:
action: switch_vpn
Events
The event key can be any HTTP response code (i.e. a number between 200 and 599), timeout or
before.
| Event | Description |
|---|---|
200-599 |
Perform action on response code. See full list of response codes. |
timeout |
Perform action if request times out. |
before |
Perform action before request. |
handle:action
The handler action defines what action to perform on the specified event.
| Action | Description |
|---|---|
skip |
Skip the current page. |
sleep |
Sleep for specified amount of time. |
switch_vpn |
Switch to a different VPN server. |
retry |
Retry the current request. |
shell |
Run a custom shell script. |
handle:before
| Operator | Description |
|---|---|
if |
Perform action if condition is true. |
unless |
Perform action if condition is false. |
| Condition | Description |
|---|---|
has_prefix |
true if URL has prefix prefix. |
has_suffix |
true if URL has prefix suffix. |
contains |
true if URL contains contains. |
handle:
before:
if: has_prefix
prefix: https://example.com/blog
action: skip
handle:
before:
unless: contains
contains: /blog
action: skip
handle:action:skip
handle:
403:
action: skip
handle:action:sleep
handle:
403:
action: sleep
duration: 10s
handle:action:switch_vpn
handle:
403:
action: switch_vpn
handle:action:retry
handle:
403:
action: retry
handle:action:shell
handle:
403:
action: shell
script: /path/to/script
vpn
Setting vpn enables the VPN service.
vpn:
provider: hma # currently only HMA is supported
username: your_username
password: your_password
Export Parameters
Local File
Store incoming data in a local file.
| Parameter | Description |
|---|---|
engine (required) |
Must be file. |
format (required) |
The file format to use. Can be either json, ndjson or csv. |
path (required) |
The relative or absolute path. If the file already exists it will be overwritten. |
Formats
| Parameter | Description |
|---|---|
ndjson (recommended) |
Stores the result in a newline delimited JSON file. See http://jsonlines.org/. |
json |
Stores the result in a JSON array. |
csv |
Stores the result in CSV file. |
Example
tasks:
- name: example_export
task: export
engine: file
format: JSON
path: /path/to/my/data.json
MySQL
Export incoming data to a MySQL database.
| Parameter | Description |
|---|---|
engine (required) |
Must be mysql. |
address (required) |
The network address of the MySQL server e.g. db.example.com:3306. |
database (required) |
The name of the MySQL database. |
table (required) |
The name of the MySQL table. |
username |
The username to authenticate to MySQL. |
password |
The password to authenticate to MySQL. Requires username. |
Example
tasks:
- name: example_export
task: export
engine: mysql
address: localhost:3306
database: mydatabase
table: mytable
Google BigQuery
Export incoming data to Google BigQuery.
Important: You will need to set up a service account for Google Cloud. See the Google instructions on how to create a service account and set up the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable.
| Parameter | Description |
|---|---|
engine (required) |
Must be bigquery. |
project (required) |
The name of the Google Cloud project. |
dataset (required) |
The name of the BigQuery dataset. |
table (required) |
The name of the BigQuery table. |
credentials |
The path to a local service account file. |
replace |
Delete existing data in target table. Defaults to false. |
max_errors |
How many errors to allow before aborting. Defaults to 5. |
Example
tasks:
- name: example_export
task: export
engine: bigquery
project: mycoolproject
dataset: mydataset
table: mytable
credentials: '{{ userDir }}/Documents/mycoolproject-4b7b0af746ef.json'
S3
Export incoming data to a newline delimited JSON file & upload to an S3-compatible file storage service.
| Parameter | Description |
|---|---|
engine (required) |
Must be s3. |
bucket (required) |
Bucket name to store file in. Creates the bucket if it doesn’t exist. |
access_key_id (required) |
Access key ID to authenticate with. |
secret_key (required) |
Secret key to authenticate with. |
endpoint |
Defaults to s3.amazonaws.com. |
use_ssl |
Defaults to true. |
Google Cloud Storage
To upload the data to GCS, set endpoint to storage.googleapis.com and use
HMAC keys for authentication.