Download & transform multiple XML files to a single CSV

In this tutorial you will learn how to extract data from the UK Food Standard Agency and transform 403 XML files into a single CSV file.

You will use two scrape tasks using the xpath & xml engines, as well as an export task using the file engine.

Before you begin

If you do not have Optimus Mine installed already, follow the installation guide to download & install Optimus Mine on your machine.

Step one: Create a new project

  1. Launch the Optimus Mine application.
  2. On the top right hand side, click New Project. img
  3. Enter a name for your new project in the Title field. img

Step two: Configure your pipeline

Next you create the pipeline configuration. The following YAML document is used to create the pipeline you will use to scrape the RSS feed and create the JSON file. Copy & paste it into the editor below the title field and hit Save.

defaults:
  threads: 10
  timeout: 5m
  verbose: true

tasks:
  - name: scrape_urls
    task: scrape
    engine: xpath
    input: https://ratings.food.gov.uk/open-data/en-gb
    schema:
      type: array
      source: //table[@class="open-data-links"]//a/@href
      items:
        type: string
        source: .

  - name: scrape_data
    task: scrape
    engine: xml
    schema:
      type: array
      source: //EstablishmentDetail
      items:
        type: object
        properties:
          name:
            type: string
            source: ./BusinessName
          postcode:
            type: string
            source: ./PostCode
          type:
            type: string
            source: ./BusinessType
          rating:
            type: string
            source: ./RatingValue

  - name: export_data
    task: export
    engine: file
    format: csv
    path: '{{ userDir }}/Desktop/fsa.csv'

Configuration details

Let’s walk through each of the properties in your configuration to understand what they do.

defaults:

In this scraper we have three tasks, 2 scrapes and an export. Defaults apply to all tasks in your project.

threads: 10

The threads determines how many HTTP requests you would like to make concurrently.

timeout: 5m If a HTTP request has not completed within 5 minutes the scraper will consider it failed.

verbose: true

Enables detailed logging which can help you identify problems if your scraper is not working as intended. You would usually set this to false once your scraper is working.

tasks:

Contains the array of all your pipeline tasks.

The scrape_urls task

name: urls

This sets the name of the task to “urls”. The name will be displayed in the log messages relating to this task.

task: scrape

This tells Optimus Mine that this task as a scraper, meaning that input values will be treated as URLs and requested via HTTP.

engine: xpath

Tells the scraper to parse the data as HTML using XPath to extract values. We use xpath to begin with because we are scraping the download links from a HTML webpage.

input: https://ratings.food.gov.uk/open-data/en-gb

The input passes data to your task. In this example it tells the scraper to scrape the URL https://ratings.food.gov.uk/open-data/en-gb

schema:

The schema contains the information on how to parse & structure the scraped data.

type: array

Setting the root schema type to array tells the task that there are multiple items on each page.

source: //table[@class="open-data-links"]//a/@href

Since engine is set to xpath, the source parameter contains the XPath expression to the data you extract. As type is set to array it expects an XPath that selects multiple nodes in the XML document. This will extract the href attribute of each a tag within a table with the open-data-links class.

items:

Since the schema type is set to array we need to define the schema of each element within the array.

type: string

This tells the parser to treat the scraped data as a string of text.

source: .

As above the source property contains the XPath to the data we want to extract. A dot means get the current node of the parent (i.e. the href attribute we extracted above).

The scrape_data task

name: scrape_data

This sets the name of the task to “scrape_data”. The name will be displayed in the log messages relating to this task.

task: scrape

This tells Optimus Mine that this task as a scraper, meaning that input values will be treated as URLs and requested via HTTP. In this case we haven’t set input, meaning that the data extracted by the previous task will be passed to this as input.

engine: xml

Tells the scraper to parse the data as XML, using XPath expressions in source to extract the data.

schema:

As above, the schema contains the information on how to parse & structure the scraped data.

type: array

Again, setting the root schema type to array tells the task that there are multiple items on each page.

source: //EstablishmentDetail

Just like above the source property contains an XPath. As type is set to array it expects an XPath that selects multiple nodes in the XML document. In the XML documents of the FSA each item is wrapped in an EstablishmentDetail tag, so this will extract each EstablishmentDetail node within the XML document.

type: object

Setting type to object tells the parser that each item in the array is a key-value map.

properties:

Contains the schema for each key-value pair in the object. The keys nested directly under properties (i.e. name, postcode, type & rating) define the keys in the outputted data.

type: string

This tells the parser to treat the scraped data as a string of text.

source: ./BusinessName, source: ./PostCode, source: ./BusinessType, source: ./RatingValue

As with source: //EstablishmentDetail above, these contain the XPath expression to select the data to extract. Since they start with a . (dot) the parser will only look within the parent node (i.e. //EstablishmentDetail).

The export task

name: export_data

This sets the name of the task to “export_data”. The name will be displayed in the log messages relating to this task.

task: export

This tells Optimus Mine that this task as an export, meaning that incoming values from previous tasks will be exported.

engine: file

The file engine is used for storing the exported data on your local machine.

format: csv

This sets the format of the exported data to CSV.

path: '{{ userDir }}/Desktop/fsa-data.csv'

Used to specify the path of the resulting file on your local machine. The {{ userDir }} value points C:\Users\YOUR_USERNAME on Windows and /Users/YOUR_USERNAME on Mac. In this case we’re saving the data in a file called fsa-data.csv on your desktop.

Step three: Run the pipeline

You’re all set. Simply click on Start Job to run the scraper. img

Once completed you should see a new file called fsa-data.csv on your desktop with the following kind of data:

name,postcode,rating,type
Rob's,SR1 3JP,5,Manufacturers/packers
Royal Teas,SE10 8RT,5,Restaurant/Cafe/Canteen
Fairfield Bowling Club,G51 4XA,Improvement Required,Restaurant/Cafe/Canteen
Bradbury Milk,NP26 5JH,4,Distributors/Transporters
Margaret Well- Furby,RG12 8UX,4,Hospitals/Childcare/Caring Premises
...

Whats next

  • To learn more about the schema used for parsing data, see the Parsing Data page.
  • To learn more about the filters used for manipulating data, see the Filters page.
  • To dive into detail about all the other configuration properties, see the Configuration Reference.