Optimus Configuration Reference

Optimus Mine pipelines are described using a YAML file. Each pipeline is made up of one or more tasks, with a task being either a web scraper, a data parser or a data exporter.

Root Parameters

These parameters are at the root of the YAML file.

defaults

Used to specify parameters which are applied to all tasks in the pipeline.

defaults:
  timeout: 1m # all tasks below now have a 1m timeout

tasks:
  - name: task_1
    # ...
  - name: task_2
    # ...

tasks (required)

tasks contains an array of all your pipeline tasks. A pipeline requires at least one task.

tasks:
  - name: task_1
    task: scrape
    engine: xpath
    # ...
  - name: task_2
    task: export
    # ...

Global Task Parameters

These parameters apply to all task types.

name (required)

Specify a unique task name to be used in log messages and for refrencing data using input.

task (required)

task can be either scrape, parse or export.

task	Description
`scrape`	Incoming data will be treated as URLs and will be requested and their responses parsed.
`parse`	Incoming data will be parsed using your chosen engine.
`export`	Incoming data will be exported using your chosen provider.

engine

engine is used to specify the engine with which to parse incoming data in the case of scrape & parse tasks, or the export provider in the case of export tasks.

If the current task is a parse or scrape task and engine is omitted the task will return the data as-is, without parsing it.

Parse Engines

engine	Description
`csv`	The `csv` engine is used to parse incoming CSV data.
`javascript`	Use custom JavaScript to parse incoming data.
`json`	The `json` engine is used to parse incoming JSON data.
`regex`	The `regex` engine is used to parse incoming data using regular expressions.
`xpath`	The `xpath` engine is used to parse incoming XML or HTML data using xpath.

Export Engines

engine	Description
`bigquery`	Export incoming data to Google BigQuery.
`mysql`	Export incoming data to a MySQL database.
`s3`	Export incoming data to S3 compatible file storage providers such as AWS S3 or Google Cloud Storage.

input

Specify data to be sent to the task. This can either be a string, an array of strings or the output of another task.
To specify the output of a previous task as input, use the dollar sign ($) followed by the name of the task, e.g. to get the output of task_1 you would set the input parameter of that task to $task_1. You can get the output of multiple tasks by using an array of task variables.
If left blank the input will be the output of the preceding task.

tasks:
  # the input of `task_1` will be `foo` & `bar`
  - name: task_1
    input:
      - foo
      - bar
  # `task_2` has no input parameter, so the input of `task_2` will be the output of `task_1`
  - name: task_2
  # the input of `task_3` will be the output of `task_1`
  - name: task_3
    input: $task_1
  # the input of `task_4` will be the outputs of both `task_2` and `task_3`
  - name: task_4
    input:
      - $task_2
      - $task_3

verbose

Set to true to enable verbose logging for debugging, thus logging every single request made. Please note this will result in a very large number of log messages and should be disabled for production.

Parse Parameters

These parameters apply to parse & scrape tasks.

schema

schema is a superset of JSON schema, defining the data to scrape or parse. It supports all standard types of JSON Schema and adds 2 new properties: source & filter.

For more information see the parsing data documentation.

schema:filter

Set a filter to manipulate parsed data. See the filter documentation.

schema:source

Set the source where to find the current value. The format depends on the engine you’re using e.g. if the engine is regex, then source needs to be a regular expression, if it’s xpath or xml, the source needs to be a valid XPath.

For more information see the parsing data documentation.

Scrape Parameters

These parameters apply to scrape tasks only.

method

HTTP method to make requests e.g. GET, POST, PUT etc. Defaults to GET.

delay

How long to wait after each request. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 0.

error_delay

How long to wait before retrying a failed request. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 500ms.

retry

How often to retry failed requests. Defaults to 3.

threads

Specify how many concurrent requests you’d like to make. Defaults to 5.

timeout

Set a timeout for the HTTP requests. Durations are formatted as integers with a unit e.g. 1m30s. Valid time units are ns, us (or µs), ms, s, m, h. Defaults to 30s.

Add headers to http requests.

header:
  Cache-Control: no-cache
  User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36

handle

handle allows you to set actions for specific HTTP response codes, timeouts etc. It is defined as a key-value map in the following format:

handle:
  <event>: <handler-definition>

For example you could switch the current VPN whenever you get a 429 (Too Many Requests) response:

handle:
  429:
    action: switch_vpn

Events

The event key can be any HTTP response code (i.e. a number between 200 and 599), timeout or before.

Event	Description
`200`-`599`	Perform action on response code. See full list of response codes.
`timeout`	Perform action if request times out.
`before`	Perform action before request.

handle:action

The handler action defines what action to perform on the specified event.

Action	Description
`skip`	Skip the current page.
`sleep`	Sleep for specified amount of time.
`switch_vpn`	Switch to a different VPN server.
`retry`	Retry the current request.
`shell`	Run a custom shell script.

handle:before

Operator	Description
`if`	Perform action if condition is true.
`unless`	Perform action if condition is false.

Condition	Description
`has_prefix`	`true` if URL has prefix `prefix`.
`has_suffix`	`true` if URL has prefix `suffix`.
`contains`	`true` if URL contains `contains`.

handle:
  before:
    if: has_prefix
    prefix: https://example.com/blog
    action: skip

handle:
  before:
    unless: contains
    contains: /blog
    action: skip

handle:action:skip

handle:
  403:
    action: skip

handle:action:sleep

handle:
  403:
    action: sleep
    duration: 10s

handle:action:switch_vpn

handle:
  403:
    action: switch_vpn

handle:action:retry

handle:
  403:
    action: retry

handle:action:shell

handle:
  403:
    action: shell
    script: /path/to/script

vpn

Setting vpn enables the VPN service.

vpn:
  provider: hma # currently only HMA is supported
  username: your_username
  password: your_password

Export Parameters

Local File

Store incoming data in a local file.

Parameter	Description
`engine` (required)	Must be `file`.
`format` (required)	The file format to use. Can be either `json`, `ndjson` or `csv`.
`path` (required)	The relative or absolute path. If the file already exists it will be overwritten.

Formats

Parameter	Description
`ndjson` (recommended)	Stores the result in a newline delimited JSON file. See http://jsonlines.org/.
`json`	Stores the result in a JSON array.
`csv`	Stores the result in CSV file.

Example

tasks:
  - name: example_export
    task: export
    engine: file
    format: JSON
    path: /path/to/my/data.json

MySQL

Export incoming data to a MySQL database.

Parameter	Description
`engine` (required)	Must be `mysql`.
`address` (required)	The network address of the MySQL server e.g. `db.example.com:3306`.
`database` (required)	The name of the MySQL database.
`table` (required)	The name of the MySQL table.
`username`	The username to authenticate to MySQL.
`password`	The password to authenticate to MySQL. Requires `username`.

Example

tasks:
  - name: example_export
    task: export
    engine: mysql
    address: localhost:3306
    database: mydatabase
    table: mytable

Google BigQuery

Export incoming data to Google BigQuery.

Important: You will need to set up a service account for Google Cloud. See the Google instructions on how to create a service account and set up the GOOGLE_APPLICATION_CREDENTIALS environment variable.

Parameter	Description
`engine` (required)	Must be `bigquery`.
`project` (required)	The name of the Google Cloud project.
`dataset` (required)	The name of the BigQuery dataset.
`table` (required)	The name of the BigQuery table.
`credentials`	The path to a local service account file.
`replace`	Delete existing data in target table. Defaults to `false`.
`max_errors`	How many errors to allow before aborting. Defaults to `5`.

Example

tasks:
  - name: example_export
    task: export
    engine: bigquery
    project: mycoolproject
    dataset: mydataset
    table: mytable
    credentials: '{{ userDir }}/Documents/mycoolproject-4b7b0af746ef.json'

S3

Export incoming data to a newline delimited JSON file & upload to an S3-compatible file storage service.

Parameter	Description
`engine` (required)	Must be `s3`.
`bucket` (required)	Bucket name to store file in. Creates the bucket if it doesn’t exist.
`access_key_id` (required)	Access key ID to authenticate with.
`secret_key` (required)	Secret key to authenticate with.
`endpoint`	Defaults to `s3.amazonaws.com`.
`use_ssl`	Defaults to `true`.

Google Cloud Storage

To upload the data to GCS, set endpoint to storage.googleapis.com and use HMAC keys for authentication.