Optimus Configuration Reference
Optimus Mine pipelines are described using a YAML file. Each pipeline is made up of one or more tasks, with a task being either a web scraper, a data parser or a data exporter.
Root Parameters
These parameters are at the root of the YAML file.
defaults
Used to specify parameters which are applied to all tasks in the pipeline.
defaults:
timeout: 1m # all tasks below now have a 1m timeout
tasks:
- name: task_1
# ...
- name: task_2
# ...
tasks (required)
tasks
contains an array of all your pipeline tasks. A pipeline requires at least one task.
tasks:
- name: task_1
task: scrape
engine: xpath
# ...
- name: task_2
task: export
# ...
Global Task Parameters
These parameters apply to all task types.
name (required)
Specify a unique task name to be used in log messages and for refrencing data using input
.
task (required)
task
can be either scrape
, parse
or export
.
task | Description |
---|---|
scrape |
Incoming data will be treated as URLs and will be requested and their responses parsed. |
parse |
Incoming data will be parsed using your chosen engine. |
export |
Incoming data will be exported using your chosen provider. |
engine
engine
is used to specify the engine with which to parse incoming data in the case of scrape
&
parse
tasks, or the export provider in the case of export
tasks.
If the current task is a parse
or scrape
task and engine
is omitted the task will return the
data as-is, without parsing it.
Parse Engines
engine | Description |
---|---|
csv |
The csv engine is used to parse incoming CSV data. |
javascript |
Use custom JavaScript to parse incoming data. |
json |
The json engine is used to parse incoming JSON data. |
regex |
The regex engine is used to parse incoming data using regular expressions. |
xpath |
The xpath engine is used to parse incoming XML or HTML data using xpath. |
Export Engines
engine | Description |
---|---|
bigquery |
Export incoming data to Google BigQuery. |
mysql |
Export incoming data to a MySQL database. |
s3 |
Export incoming data to S3 compatible file storage providers such as AWS S3 or Google Cloud Storage. |
input
Specify data to be sent to the task. This can either be a string, an array of strings or the output
of another task.
To specify the output of a previous task as input, use the dollar sign ($
) followed by the name of
the task, e.g. to get the output of task_1
you would set the input parameter of that task to
$task_1
. You can get the output of multiple tasks by using an array of task variables.
If left blank the input will be the output of the preceding task.
tasks:
# the input of `task_1` will be `foo` & `bar`
- name: task_1
input:
- foo
- bar
# `task_2` has no input parameter, so the input of `task_2` will be the output of `task_1`
- name: task_2
# the input of `task_3` will be the output of `task_1`
- name: task_3
input: $task_1
# the input of `task_4` will be the outputs of both `task_2` and `task_3`
- name: task_4
input:
- $task_2
- $task_3
verbose
Set to true
to enable verbose logging for debugging, thus logging every single request made.
Please note this will result in a very large number of log messages and should be disabled for
production.
Parse Parameters
These parameters apply to parse
& scrape
tasks.
schema
schema
is a superset of JSON schema, defining the data to scrape or parse. It supports all
standard types of JSON Schema and adds 2 new properties: source
& filter
.
For more information see the parsing data documentation.
schema:filter
Set a filter to manipulate parsed data. See the filter documentation.
schema:source
Set the source where to find the current value. The format depends on the engine you’re using e.g.
if the engine is regex
, then source
needs to be a regular expression, if it’s xpath
or xml
,
the source
needs to be a valid XPath.
For more information see the parsing data documentation.
Scrape Parameters
These parameters apply to scrape
tasks only.
method
HTTP method to make requests e.g. GET
, POST
, PUT
etc. Defaults to GET
.
delay
How long to wait after each request. Durations are formatted as integers with a unit e.g. 1m30s
.
Valid time units are ns
, us
(or µs
), ms
, s
, m
, h
. Defaults to 0
.
error_delay
How long to wait before retrying a failed request. Durations are formatted as integers with a unit
e.g. 1m30s
. Valid time units are ns
, us
(or µs
), ms
, s
, m
, h
. Defaults to 500ms
.
retry
How often to retry failed requests. Defaults to 3
.
threads
Specify how many concurrent requests you’d like to make. Defaults to 5
.
timeout
Set a timeout for the HTTP requests. Durations are formatted as integers with a unit e.g. 1m30s
.
Valid time units are ns
, us
(or µs
), ms
, s
, m
, h
. Defaults to 30s
.
header
Add headers to http requests.
header:
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
handle
handle
allows you to set actions for specific HTTP response codes, timeouts etc. It is defined as
a key-value map in the following format:
handle:
<event>: <handler-definition>
For example you could switch the current VPN whenever you get a 429 (Too Many Requests)
response:
handle:
429:
action: switch_vpn
Events
The event key can be any HTTP response code (i.e. a number between 200
and 599
), timeout
or
before
.
Event | Description |
---|---|
200 -599 |
Perform action on response code. See full list of response codes. |
timeout |
Perform action if request times out. |
before |
Perform action before request. |
handle:action
The handler action defines what action to perform on the specified event.
Action | Description |
---|---|
skip |
Skip the current page. |
sleep |
Sleep for specified amount of time. |
switch_vpn |
Switch to a different VPN server. |
retry |
Retry the current request. |
shell |
Run a custom shell script. |
handle:before
Operator | Description |
---|---|
if |
Perform action if condition is true. |
unless |
Perform action if condition is false. |
Condition | Description |
---|---|
has_prefix |
true if URL has prefix prefix . |
has_suffix |
true if URL has prefix suffix . |
contains |
true if URL contains contains . |
handle:
before:
if: has_prefix
prefix: https://example.com/blog
action: skip
handle:
before:
unless: contains
contains: /blog
action: skip
handle:action:skip
handle:
403:
action: skip
handle:action:sleep
handle:
403:
action: sleep
duration: 10s
handle:action:switch_vpn
handle:
403:
action: switch_vpn
handle:action:retry
handle:
403:
action: retry
handle:action:shell
handle:
403:
action: shell
script: /path/to/script
vpn
Setting vpn
enables the VPN service.
vpn:
provider: hma # currently only HMA is supported
username: your_username
password: your_password
Export Parameters
Local File
Store incoming data in a local file.
Parameter | Description |
---|---|
engine (required) |
Must be file . |
format (required) |
The file format to use. Can be either json , ndjson or csv . |
path (required) |
The relative or absolute path. If the file already exists it will be overwritten. |
Formats
Parameter | Description |
---|---|
ndjson (recommended) |
Stores the result in a newline delimited JSON file. See http://jsonlines.org/. |
json |
Stores the result in a JSON array. |
csv |
Stores the result in CSV file. |
Example
tasks:
- name: example_export
task: export
engine: file
format: JSON
path: /path/to/my/data.json
MySQL
Export incoming data to a MySQL database.
Parameter | Description |
---|---|
engine (required) |
Must be mysql . |
address (required) |
The network address of the MySQL server e.g. db.example.com:3306 . |
database (required) |
The name of the MySQL database. |
table (required) |
The name of the MySQL table. |
username |
The username to authenticate to MySQL. |
password |
The password to authenticate to MySQL. Requires username . |
Example
tasks:
- name: example_export
task: export
engine: mysql
address: localhost:3306
database: mydatabase
table: mytable
Google BigQuery
Export incoming data to Google BigQuery.
Important: You will need to set up a service account for Google Cloud. See the Google instructions on how to create a service account and set up the
GOOGLE_APPLICATION_CREDENTIALS
environment variable.
Parameter | Description |
---|---|
engine (required) |
Must be bigquery . |
project (required) |
The name of the Google Cloud project. |
dataset (required) |
The name of the BigQuery dataset. |
table (required) |
The name of the BigQuery table. |
credentials |
The path to a local service account file. |
replace |
Delete existing data in target table. Defaults to false . |
max_errors |
How many errors to allow before aborting. Defaults to 5 . |
Example
tasks:
- name: example_export
task: export
engine: bigquery
project: mycoolproject
dataset: mydataset
table: mytable
credentials: '{{ userDir }}/Documents/mycoolproject-4b7b0af746ef.json'
S3
Export incoming data to a newline delimited JSON file & upload to an S3-compatible file storage service.
Parameter | Description |
---|---|
engine (required) |
Must be s3 . |
bucket (required) |
Bucket name to store file in. Creates the bucket if it doesn’t exist. |
access_key_id (required) |
Access key ID to authenticate with. |
secret_key (required) |
Secret key to authenticate with. |
endpoint |
Defaults to s3.amazonaws.com . |
use_ssl |
Defaults to true . |
Google Cloud Storage
To upload the data to GCS, set endpoint
to storage.googleapis.com
and use
HMAC keys for authentication.