Download & transform multiple XML files to a single CSV
In this tutorial you will learn how to extract data from the UK Food Standard Agency and transform 403 XML files into a single CSV file.
You will use two scrape tasks using the xpath
& xml
engines, as well as an export task using the
file
engine.
Before you begin
If you do not have Optimus Mine installed already, follow the installation guide to download & install Optimus Mine on your machine.
Step one: Create a new project
- Launch the Optimus Mine application.
- On the top right hand side, click New Project.
- Enter a name for your new project in the Title field.
Step two: Configure your pipeline
Next you create the pipeline configuration. The following YAML document is used to create the pipeline you will use to scrape the RSS feed and create the JSON file. Copy & paste it into the editor below the title field and hit Save.
defaults:
threads: 10
timeout: 5m
verbose: true
tasks:
- name: scrape_urls
task: scrape
engine: xpath
input: https://ratings.food.gov.uk/open-data/en-gb
schema:
type: array
source: //table[@class="open-data-links"]//a/@href
items:
type: string
source: .
- name: scrape_data
task: scrape
engine: xml
schema:
type: array
source: //EstablishmentDetail
items:
type: object
properties:
name:
type: string
source: ./BusinessName
postcode:
type: string
source: ./PostCode
type:
type: string
source: ./BusinessType
rating:
type: string
source: ./RatingValue
- name: export_data
task: export
engine: file
format: csv
path: '{{ userDir }}/Desktop/fsa.csv'
Configuration details
Let’s walk through each of the properties in your configuration to understand what they do.
defaults:
In this scraper we have three tasks, 2 scrapes and an export. Defaults apply to all tasks in your project.
threads: 10
The threads determines how many HTTP requests you would like to make concurrently.
timeout: 5m
If a HTTP request has not completed within 5 minutes the scraper will consider it
failed.
verbose: true
Enables detailed logging which can help you identify problems if your scraper is not working as
intended. You would usually set this to false
once your scraper is working.
tasks:
Contains the array of all your pipeline tasks.
The scrape_urls
task
name: urls
This sets the name of the task to “urls”. The name will be displayed in the log messages relating to this task.
task: scrape
This tells Optimus Mine that this task as a scraper, meaning that input
values will be treated as
URLs and requested via HTTP.
engine: xpath
Tells the scraper to parse the data as HTML using XPath to extract values. We use xpath to begin with because we are scraping the download links from a HTML webpage.
input: https://ratings.food.gov.uk/open-data/en-gb
The input passes data to your task. In this example it tells the scraper to scrape the URL https://ratings.food.gov.uk/open-data/en-gb
schema:
The schema contains the information on how to parse & structure the scraped data.
type: array
Setting the root schema type to array
tells the task that there are multiple items on each page.
source: //table[@class="open-data-links"]//a/@href
Since engine
is set to xpath
, the source
parameter contains the XPath expression to the data
you extract. As type
is set to array it expects an XPath that selects multiple nodes in the XML
document. This will extract the href
attribute of each a
tag within a table with the
open-data-links
class.
items:
Since the schema type is set to array we need to define the schema of each element within the array.
type: string
This tells the parser to treat the scraped data as a string of text.
source: .
As above the source
property contains the XPath to the data we want to extract. A dot means get
the current node of the parent (i.e. the href
attribute we extracted above).
The scrape_data
task
name: scrape_data
This sets the name of the task to “scrape_data”. The name will be displayed in the log messages relating to this task.
task: scrape
This tells Optimus Mine that this task as a scraper, meaning that input
values will be treated as
URLs and requested via HTTP. In this case we haven’t set input
, meaning that the data extracted by
the previous task will be passed to this as input.
engine: xml
Tells the scraper to parse the data as XML, using XPath expressions in source
to extract the data.
schema:
As above, the schema contains the information on how to parse & structure the scraped data.
type: array
Again, setting the root schema type to array
tells the task that there are multiple items on each
page.
source: //EstablishmentDetail
Just like above the source
property contains an XPath. As type
is set to array it expects an
XPath that selects multiple nodes in the XML document. In the XML documents of the FSA each item is
wrapped in an EstablishmentDetail
tag, so this will extract each EstablishmentDetail
node within
the XML document.
type: object
Setting type
to object
tells the parser that each item in the array is a key-value map.
properties:
Contains the schema for each key-value pair in the object
. The keys nested directly under
properties
(i.e. name
, postcode
, type
& rating
) define the keys in the outputted data.
type: string
This tells the parser to treat the scraped data as a string of text.
source: ./BusinessName
, source: ./PostCode
, source: ./BusinessType
, source: ./RatingValue
As with source: //EstablishmentDetail
above, these contain the XPath expression to select the data
to extract. Since they start with a .
(dot) the parser will only look within the parent node (i.e.
//EstablishmentDetail
).
The export task
name: export_data
This sets the name of the task to “export_data”. The name will be displayed in the log messages relating to this task.
task: export
This tells Optimus Mine that this task as an export, meaning that incoming values from previous tasks will be exported.
engine: file
The file engine is used for storing the exported data on your local machine.
format: csv
This sets the format of the exported data to CSV.
path: '{{ userDir }}/Desktop/fsa-data.csv'
Used to specify the path of the resulting file on your local machine. The {{ userDir }}
value
points C:\Users\YOUR_USERNAME
on Windows and /Users/YOUR_USERNAME
on Mac. In this case we’re
saving the data in a file called fsa-data.csv
on your desktop.
Step three: Run the pipeline
You’re all set. Simply click on Start Job to run the scraper.
Once completed you should see a new file called fsa-data.csv
on your desktop with the following
kind of data:
name,postcode,rating,type
Rob's,SR1 3JP,5,Manufacturers/packers
Royal Teas,SE10 8RT,5,Restaurant/Cafe/Canteen
Fairfield Bowling Club,G51 4XA,Improvement Required,Restaurant/Cafe/Canteen
Bradbury Milk,NP26 5JH,4,Distributors/Transporters
Margaret Well- Furby,RG12 8UX,4,Hospitals/Childcare/Caring Premises
...
Whats next
- To learn more about the schema used for parsing data, see the Parsing Data page.
- To learn more about the filters used for manipulating data, see the Filters page.
- To dive into detail about all the other configuration properties, see the Configuration Reference.