ninja_web_scraping_watir2

Web Scraping in Ruby with Watir

   Back to list

Introduction – what will we use?

Have you ever had the problem that some services miss an API integration and you need to click through a page manually? Or you wanted to automate a process?

Here comes Watir. Watir is an open source Ruby library build for automated tests – but it’s not only used for that! We can also use it to build a web-scraper which simulates a human who clicks through a page to perform an action – log in, post a comment, download some data, and a lot of other things besides.

One of the key feature is that it uses a built-in Selenium web-driver, it means that we can scrape a rich, dynamic page build in JavaScript. In the past I had tried to build a web-scraper with the Mechanize gem – it’s perfect for simple pages that are static and don’t use a lot of JavaScript or AJAX.

Another advantage of Watir is the fact that it allows making a screenshot. Why is that helpful? Imagine a situation in which your application tries to parse a page but somewhere it fails… and what now? How do you handle an error? We can make a screenshot and upload it to S3 or save it locally! With Machanize it would be impossible – it uses Nokogiri and it doesn’t allow to make any screenshot.

Two modes

There are two options to run Watir – in a normal browser, eg. Chrome/Firefox, or in “headless” mode. What does that mean?

“Headless” mode allows you to parse a page without a monitor – in most UNIX systems Watir requires pre-installed Xvfb on your machine (if you’re using Ubuntu). In this mode, Watir uses PhantomJS to simulate a web-browser and run a page in an emulator. When you want to parse a page using Chrome, you need to install chrome-driver.

Another great feature is mobile/device testing mode. It allows you to run a page as an iPhone, iPad or other mobile devices. It could be a great way to test if a page is responsible and well-scaled.

In this article, I’ll try to show some of Watir’s features. I built a simple ruby gem that allows us to sign in, sign up, invite a friend or like a page on Facebook. I’ll describe each part of the gem and cover how it works.

The full source code can be found here.

Source Code

Let’s start.

def initialize(email, password)
  @email = email
  @password = password
end

Well, I think that I don’t need to add any comments here – we just assign email and password to our class instance.

def browser
  @_browser ||= Watir::Browser.new(:chrome)
end

Browser method keeps memoized Watir’s instance. Here you can specify which browser should be run – chrome, firefox etc. If you pass phantomjs there, it will be run in headless mode.

def login
  return true if @logged_in

  browser.goto('https://www.facebook.com/')
  form = browser.form(id: 'login_form')

  return false unless form.exist?

  form.text_field(name: 'email').set(email)
  form.text_field(name: 'pass').set(password)
  form.input(value: 'Log In').click

  sleep(2)
  @logged_in = main_page?
end

Login method logs into Facebook with credentials passed during an instance initialization.
As you can find here, we use goto which changes the current page into passed parameter.
The form method searched for a form with passed params – in this case, we look for a form with id: login_form.

One important thing here, if you search for an element that doesn’t exist and you run some methods on it – your script will wait for this element (by default for 30s) and everything will be blocked. The best idea before running any method is to call the exist? Method to check if specified element really exists.

Text_field element looks for an input in a selected form with passed params and finally, the set method fills this input with the passed value.

As you can guess, the click method clicks on an element.

Why am I running the sleep method to wait for 2 seconds? To wait for all elements to load – javascript and all the other assets.

   def main_page?
      browser.element(id: 'userNavigationLabel').exist?
    end

Main_page? Method checks if user navigation exists. If it exists it means that we successfully logged in!

  def registration_params_valid?(params)
      return false unless params.keys.uniq.sort == REGISTRATION_INPUTS.uniq.sort
      return false if params.values.map(&:blank?).include?(true)
      return false if EMAIL_REGEX.match(params[:email]).nil?

      true
    end

Registation_params_valid? checks if all the sign up form’s field has been filled and validates if a passed email address is valid.

 def create_account(**args)
      raise unless registration_params_valid?(args)

      browser.goto('https://www.facebook.com/')
      form = browser.form(id: 'reg')
      form.text_field(name: 'firstname').set(args[:first_name])
      form.text_field(name: 'lastname').set(args[:last_name])
      form.text_field(name: 'reg_email__').set(email)
      form.text_field(name: 'reg_email_confirmation__').set(email)
      form.text_field(name: 'reg_passwd__').set(password)
      form.select_list(name: 'birthday_day').select(args[:day])
      form.select_list(name: 'birthday_month').select(args[:month])
      form.select_list(name: 'birthday_year').select(args[:year])
      form.radio(name: 'sex', value: sex(args[:sex])).set
      form.button(name: 'websubmit').click
    end

Create_account method tries to sign up on Facebook. It runs registration_params_valid? to check if it’s valid. Later it goes to the Facebook’s main page and fills in the sign-up form.

def sex(value)
  value.downcase.strip == 'male' ? '2' : '1'
end

This method formats a parameter and returns a valid value for radio input in the sign up form.

def search(query)
  login unless logged_in

  form = browser.form(action: '/search/web/direct_search.php')
  form.inputs.last.to_subtype.clear
  sleep(0.5)
  form.inputs.last.to_subtype.set(query)
  form.button(type: 'submit').click
end

This method searches for a requested query but first checks if we’re logged in. If not, we log in then search for a query. I use sleep here because sometimes watir has clicked too fast and not all the elements were loaded.

def perform(query, options = {})
  login unless logged_in

  search(query)
  browser.link(href: "/search/#{options[:name]}/?q=#{query}&ref=top_filter").click
  button = browser.button(class_name: options[:class_name])
  button.click if button.exist?
end

This method performs an action – invites a friend, likes a page etc. We need to pass a query there and options – class name of the button, which should be click and a tab name. But remember, a first button will be clicked.

def like_page(name)
  perform(name, name: 'pages', class_name: 'PageLikeButton')
end

It uses the perform method, just by passing a query and clicking the right button and switching to a correct tab.

def invite_friend(name)
  perform(name, name: 'people', class_name: 'FriendRequestAdd')
end

It is the same as like_page method, but now it invites a friend.

Testing

Well, so that’s all methods. You can download a source and test it by yourself. How do you do it?

Just clone the gem to your directory, run bundle install and:

$ bundle console
$ scraper = NopioScraper::Facebook.new(‘your_email’',’your password’)
$ scraper.like_page('nopio')

And that’s all! Remember that I didn’t cover any unexpected cases here like browser popups or alerts. Every browser behaves in a different way so it’s hard to predict how yours will work.

As you can see, web scraping and simulating has no limits, you can write a code which can do almost everything; it’s up to you!

Here you can find full the documentation to researching more knowledge and examples:

I hope that you liked this article and that it might be useful to you! Happy web scraping!

Send this to a friend