Optimus Configuration Reference

Optimus Mine pipelines are described using a YAML file. Each pipeline is made up of one or more tasks, with a task being either a web scraper, a data parser or a data exporter.

Root Parameters

These parameters are at the root of the YAML file.

defaults

Used to specify parameters which are applied to all tasks in the pipeline.

defaults:
  timeout: 1m # all tasks below now have a 1m timeout

tasks:
  - name: task_1
    # ...
  - name: task_2
    # ...

tasks (required)

tasks contains an array of all your pipeline tasks. A pipeline requires at least one task.

tasks:
  - name: task_1
    task: scrape
    engine: xpath
    # ...
  - name: task_2
    task: export
    # ...

Global Task Parameters

These parameters apply to all task types.

name (required)

Specify a unique task name to be used in log messages and for refrencing data using input.

task (required)

task can be either scrape, parse or export.

task Description
scrape Incoming data will be treated as URLs and will be requested and their responses parsed.
parse Incoming data will be parsed using your chosen engine.
export Incoming data will be exported using your chosen provider.

engine

engine is used to specify the engine with which to parse incoming data in the case of scrape & parse tasks, or the export provider in the case of export tasks.

If the current task is a parse or scrape task and engine is omitted the task will return the data as-is, without parsing it.

Parse Engines

engine Description
csv The csv engine is used to parse incoming CSV data.
javascript Use custom JavaScript to parse incoming data.
json The json engine is used to parse incoming JSON data.
regex The regex engine is used to parse incoming data using regular expressions.
xpath The xpath engine is used to parse incoming XML or HTML data using xpath.

Export Engines

engine Description
bigquery Export incoming data to Google BigQuery.
mysql Export incoming data to a MySQL database.
s3 Export incoming data to S3 compatible file storage providers such as AWS S3 or Google Cloud Storage.

input

Specify data to be sent to the task. This can either be a string, an array of strings or the output of another task.
To specify the output of a previous task as input, use the dollar sign ($) followed by the name of the task, e.g. to get the output of task_1 you would set the input parameter of that task to $task_1. You can get the output of multiple tasks by using an array of task variables.
If left blank the input will be the output of the preceding task.

tasks:
  # the input of `task_1` will be `foo` & `bar`
  - name: task_1
    input:
      - foo
      - bar
  # `task_2` has no input parameter, so the input of `task_2` will be the output of `task_1`
  - name: task_2
  # the input of `task_3` will be the output of `task_1`
  - name: task_3
    input: $task_1
  # the input of `task_4` will be the outputs of both `task_2` and `task_3`
  - name: task_4
    input:
      - $task_2
      - $task_3

verbose

Set to true to enable verbose logging for debugging, thus logging every single request made. Please note this will result in a very large number of log messages and should be disabled for production.

Parse Parameters

These parameters apply to parse & scrape tasks.

schema

schema is a superset of JSON schema, defining the data to scrape or parse. It supports all standard types of JSON Schema and adds 2 new properties: source & filter.

For more information see the parsing data documentation.

schema:filter

Set a filter to manipulate parsed data. See the filter documentation.

schema:source

Set the source where to find the current value. The format depends on the engine you’re using e.g. if the engine is regex, then source needs to be a regular expression, if it’s xpath or xml, the source needs to be a valid XPath.

For more information see the parsing data documentation.

Scrape Parameters

These parameters apply to scrape tasks only.

method

HTTP method to make requests e.g. GET, POST, PUT etc. Defaults to GET.

delay

How long to wait after each request. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 0.

error_delay

How long to wait before retrying a failed request. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 500ms.

retry

How often to retry failed requests. Defaults to 3.

threads

Specify how many concurrent requests you’d like to make. Defaults to 5.

timeout

Set a timeout for the HTTP requests. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 30s.

Add headers to http requests.

header:
  Cache-Control: no-cache
  User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36

handle

handle allows you to set actions for specific HTTP response codes, timeouts etc. It is defined as a key-value map in the following format:

handle:
  <event>: <handler-definition>

For example you could switch the current VPN whenever you get a 429 (Too Many Requests) response:

handle:
  429:
    action: switch_vpn

Events

The event key can be any HTTP response code (i.e. a number between 200 and 599), timeout or before.

Event Description
200-599 Perform action on response code. See full list of response codes.
timeout Perform action if request times out.
before Perform action before request.

handle:action

The handler action defines what action to perform on the specified event.

Action Description
skip Skip the current page.
sleep Sleep for specified amount of time.
switch_vpn Switch to a different VPN server.
retry Retry the current request.
shell Run a custom shell script.
handle:before
Operator Description
if Perform action if condition is true.
unless Perform action if condition is false.
Condition Description
has_prefix true if URL has prefix prefix.
has_suffix true if URL has prefix suffix.
contains true if URL contains contains.
handle:
  before:
    if: has_prefix
    prefix: https://example.com/blog
    action: skip
handle:
  before:
    unless: contains
    contains: /blog
    action: skip

handle:action:skip

handle:
  403:
    action: skip

handle:action:sleep

handle:
  403:
    action: sleep
    duration: 10s

handle:action:switch_vpn

handle:
  403:
    action: switch_vpn

handle:action:retry

handle:
  403:
    action: retry

handle:action:shell

handle:
  403:
    action: shell
    script: /path/to/script

vpn

Setting vpn enables the VPN service.

vpn:
  provider: hma # currently only HMA is supported
  username: your_username
  password: your_password

Export Parameters

Local File

Store incoming data in a local file.

Parameter Description
engine (required) Must be file.
format (required) The file format to use. Can be either json, ndjson or csv.
path (required) The relative or absolute path. If the file already exists it will be overwritten.

Formats

Parameter Description
ndjson (recommended) Stores the result in a newline delimited JSON file. See http://jsonlines.org/.
json Stores the result in a JSON array.
csv Stores the result in CSV file.

Example

tasks:
  - name: example_export
    task: export
    engine: file
    format: JSON
    path: /path/to/my/data.json

MySQL

Export incoming data to a MySQL database.

Parameter Description
engine (required) Must be mysql.
address (required) The network address of the MySQL server e.g. db.example.com:3306.
database (required) The name of the MySQL database.
table (required) The name of the MySQL table.
username The username to authenticate to MySQL.
password The password to authenticate to MySQL. Requires username.

Example

tasks:
  - name: example_export
    task: export
    engine: mysql
    address: localhost:3306
    database: mydatabase
    table: mytable

Google BigQuery

Export incoming data to Google BigQuery.

Important: You will need to set up a service account for Google Cloud. See the Google instructions on how to create a service account and set up the GOOGLE_APPLICATION_CREDENTIALS environment variable.

Parameter Description
engine (required) Must be bigquery.
project (required) The name of the Google Cloud project.
dataset (required) The name of the BigQuery dataset.
table (required) The name of the BigQuery table.
credentials The path to a local service account file.
replace Delete existing data in target table. Defaults to false.
max_errors How many errors to allow before aborting. Defaults to 5.

Example

tasks:
  - name: example_export
    task: export
    engine: bigquery
    project: mycoolproject
    dataset: mydataset
    table: mytable
    credentials: '{{ userDir }}/Documents/mycoolproject-4b7b0af746ef.json'

S3

Export incoming data to a newline delimited JSON file & upload to an S3-compatible file storage service.

Parameter Description
engine (required) Must be s3.
bucket (required) Bucket name to store file in. Creates the bucket if it doesn’t exist.
access_key_id (required) Access key ID to authenticate with.
secret_key (required) Secret key to authenticate with.
endpoint Defaults to s3.amazonaws.com.
use_ssl Defaults to true.

Google Cloud Storage

To upload the data to GCS, set endpoint to storage.googleapis.com and use HMAC keys for authentication.