0%

Crawling is about visiting webpages
Scraping turns webpages to data

Scraping and crawling. Obviously one can’t go without the other. We crawl sites to have broad perspective how the site is structured, what are connections between pages, to estimate how much time we need to visit all pages we are interested in. Scraping is often harder to implement, but it’s an essence of data extraction. Let’s think of scraping as of covering website with sheet of paper with some rectangles cut out. We can now see only things we need, completely ignoring parts of website that are common for all pages (like navigation, footer, ads), or extraneous informations as comments or breadcrumbs.

This article aims to illustrate what steps are required in order to find and extract data from websites.

How to crawl website?

1. Specify start urls

The crawler needs to know how to begin its job. In most cases we need to have url to main page, eg: http://www.example.com/index.html, then decide where to go next. In more complex cases, more than one url may be provided, eg. If we just want to select only few categories or given articles.

2. Specify crawl rules

This is the moment when we wish to say: dear crawler, please go to each category, then go to each article found there. Crawling logic implementations vary from framework to framework, but usually specific <a href="...">elements are selected using XPath or CSS queries, then href attribute is used to tell crawler where to go next. Another technique is to extract all <a href="…"> attributes and filter the urls out using regular expressions or just if statements. The crawler also needs to skip urls it already visited in order to avoid endless crawling.

Like It’s been mentioned, crawling is relatively simple and most crawlers incorporate just those 2 rules to make their job done.

How to scrape website?

Let’s assume our crawler is visiting only pages we are interested in. Now it’s time to extract some data!

In this example we will use CSS selectors. CSS (Cascading Style Sheet) is one of web standards that allows web developers assign specific style properties (font, border, background) to any HTML element defined in query string. That’s why they are also very useful for web scraping. Moreover, they are almost the same as jQuery selectors. If you have an experience in any of mentioned technologies, it can really boost your “getting started” phase.

More sophisticated selectors can be written using XPath. A nice tutorial that may help getting started can be found here: https://www.liquid-technologies.com/xpath-tutorial.

For example, let’s consider following HTML representing blog article.

The title can be selected by id, so CSS query should look like:  #title, XPath selector like this: //h1[@id="title"]

Article text is contained in the <article>  element, so CSS query will be: atricle , and  //article  for XPath.

Tags are represented by <a>  elements inside unordered list <ul>  with class attribute equal to tags , so let’s write this query like this: ul.tags li a

Simple, isn’t it?

Of course there is tremendous amount of techniques used to scrape more sophisticated pages that may contain:

  • lists of items
  • AJAX popups
  • star ratings
  • infinite scrolls
  • forms
  • images
  • tables
  • pages without characteristic HTML elements, eg. where no id  or class  attributes are present

Basing on our experience, CSS queries can be used for most cases, XPath queries for little harder cases and programmatic solutions for the hardest ones.

Formatting

Raw data extracted using CSS or XPath selectors often contains extraneous informations like white spaces or embedded HTML. The data from several sites must be also normalized to single format. For example, if we scrape multiple forex sites, we need keep our currency data consistent over all of them. Another example is e-mail address. foo[at]bar[dot]com, FOO@BAR.COM emails should be canonicalized to common form: foo@bar.com.

Validation

The last thing before having data ready to use, is to make sure that for example all required fields (columns) are extracted (eg. every data row must contain non-empty title and description). Some of fields can also be checked against constraints like “url must begin with https://“ or “age must be greater than 0”.

In Python ecosystem, validation can be implemented using JSON-Schema or one of ORMs (Django-ORM, SQLAlchemy).

Finally, when our data passes all validation checks, it is ready to be stored somewhere for quality assurance or analysis.

Export

Lets assume we extracted data we were looking for and stored it in file. Common file formats for crawling data are XML, CSV, JSON. We can use it right now by implementing script or application that loads the data and makes use of it, but some of the projects need to store the data in some RDBMS like MySQL, Postgres, MS-SQL, etc.

Exporting is about data transformation and transferring. Exporter’s job is to create relationships, set foreign keys, set default values, skip existing rows, set updated_at field to current time, etc.

About Tarantoola

Tarantoola is a team of best engineers experienced in web-scraping and web-crawling. We started in 2009 as co-workers implementing web-crawlers for large file and video search engine – filestube.com (now defunct). Some of us were hired by Scrapinghub and were involved in one of DARPA’s project Memex by helping in setting up crawling infrastructure, and of course writing spiders.

After several years, we reunited to achieve our most important goal: enable people to use Internet as structured data.

Are you interested in professional web-scraping services? Contact us!

We offer:

  • Extracting structured data from any website (no matter how complicated)
  • Regular updates
  • Reports
  • REST API
  • Custom exporters
  • Custom anything

Limited offer: we will extract data for you from single website for free ($0)! After you receive initial dataset and you are satisfied, we can talk about money. We will really appreciate having your logo and testimonial on our website to help us grow.

And again. Really. Contact us. It’s free and we don’t bite 🙂

 

Publisher: Tarantoola - Big Data Management

1 Comment

  • John Leicht says:

    I think all searchable websites should maintain “date published” which would be required in the tag.

    Since this is not feasible in existing pages and not part of Html(x) – a search engine like Google, Bing, etc. would be extremely interested in an advanced capability for specifying a mm/dd/yyyy detection to filter & sort results (urls) by date.

    This would be useful for ANYONE doing research on the latest ANYTHING. Oh, if only the or tag had required this in HTML5 (and if a date range could be specified in a search)!!!

Leave a Reply

Your email address will not be published. Required fields are marked *