SMG Insight delivers B2B and B2C research and consulting solutions for the world's leading sports governing bodies, sponsors and sports investors.
Skyron developed a solution for SMG Insight which gathers all of the sports programming TV scheduling information globally. Allowing SMG Insight to know which sports are being covered, on what channel, when, and for how long each day. Coupled to this data capture functionality is a powerful search application which allows the data to be queried in multiple views, audit reports to be generated and supporting screen shots viewed.
In terms of the development journey, the overriding consideration when overcoming the challenge was to minimise the server load on the sites we were scraping. Part of that consideration is of course self-serving, i.e. not to get black-listed, but part of it is about doing the job well.
Modern browsers are so capable at handling client-side code that, as part of delivering a good User Experience, many websites make extensive use of client-side code and AJAX call-backs to load data dynamically or on the fly.
Unfortunately, for us, this blocks traditional scraping techniques, which effectively download the raw HTML content, but are incapable of executing post-download client-side code.
Add to that the use of static URLs. Historically, each page on a web-site would have had a unique URL. Today, however, the URL often remains static despite the fact a user may have clicked on a menu and gone to another page.
Many of the web-sites that we scrape contain data only accessible after clicking on various menus, scrolling, clicking on arrows and generally being a human being. In effect, we had to replicate programmatically a human interacting with a modern browser.
These repeatable set of interactions had to be executed by an automated service at a user-definable frequency, owing to the fact some sites may show more data and some less, based on their time window.
We also converted from the local language to English using Google Translate.
In terms of setting it up: a semi-technical operator views the site in Google Chrome and, using the Developer tools, produces XPath queries to obtain relevant data. These queries are entered into our custom built application.
The application allows a user to configure user-interactions. We did this by building a command-driven mechanism, allowing a page to have as many interactions as required. The commands are fairly simple e.g. wait for 15 seconds, press button with class name = “classname”, wait 15 seconds, press button with ID, “Id”.
The operator then clicks a test button that displays a browser view (IE11 equivalent capabilities with client-side exactable code being executed), showing the user-interactions being performed in real-time and scrapes the site of the XPath defined data.
Once the operator is happy the data was correct, they save the new scrape to the system. That scrape was then set up programmatically at set intervals' using a server-side scrape engine.
The engine was built with concurrency in mind, meaning multiple scrape engines can run simultaneously. Each scrape communicates with the database server via a secure RESTful Web API service layer. The scrape engine attempts each scrape, as defined in the tool, as and when it is deemed necessary (determined by the scrape frequency the operator configured).
All of the scrape attempts are logged with any issues shown on-screen in the web-admin portal. Issues are also sent via e-mail to selected people.
Consolidating the data was only the first part of the overall journey. The data needed to be exposed in a consumer facing website in one instance and complex reporting portal in another.