Scrapinghub ruby example

Scrapinghub Platform is the most advanced platform for deploying and running web crawlers.

Requirements

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.
Scrapinghub account

Step one: build your spider

NOTE: make sure you meet the website rules before scraping it.

Let's imagine that we want to get a list of articles from the Сodica website. I'll use typhoeus for HTTP requests and nokogiri as HTML parser.

app/crawler.rb

# require libraries
require 'typhoeus'
require 'nokogiri'
require 'json'

# determine where to write the result
begin
  outfile = File.open(ENV.fetch('SHUB_FIFO_PATH'), mode: 'w')
rescue IndexError
  outfile = STDOUT
end

# parse response
response = Typhoeus.get('https://www.codica.com/blog/').response_body
doc      = Nokogiri::HTML(response)

# select and save all titles
doc.css('.post-title').each do |title|
  result = JSON.generate(title: title.text.split.join(' '))
  outfile.write result
  outfile.write "\n"
end

Notes:

...

begin
  outfile = File.open(ENV.fetch('SHUB_FIFO_PATH'), mode: 'w')
rescue IndexError
  outfile = STDOUT
end

...

Here we set up where to write results. Scrapinghub provides a SHUB_FIFO_PATH ENV variable to store items on the website. You can locally pass a filename to this ENV variable to write on a disk.

$> ruby app/crawler.rb

#=>
{"title":"How To Start Your Own Online Marketplace"}
{"title":"MVP and Prototype: What’s Best to Validate Your Business Idea?"}
{"title":"5 Key Principles for a User-Friendly Website"}
{"title":"Building a Slack Bot for Internal Time Tracking"}
{"title":"4 Main JavaScript Development Trends in 2019"}

...

Step two: create required files by Scrapinghub

Docker image should be able to run via start-crawl command without arguments. start-crawl should be executable. At our project start-crawl is a app/crawler.rb. Second required file is shub-image-info.rb. Let's create it.

app/shub-image-info.rb

require 'json'

puts JSON.generate(project_type: 'other', spiders: ['c-spider'])
exit

Just change c-spider name to your own.

Step three: make required files executable

Add #!/usr/bin/env ruby to both app/shub-image-info.rb and app/crawler.rb. It makes these files executable.

Step four: create Dockerfile

FROM ruby:2.5.1-stretch
ENV LANG=C.UTF-8

RUN apt-get update

COPY . /app

WORKDIR /app

RUN bundle install
RUN ln -sfT /app/shub-image-info.rb /usr/sbin/shub-image-info && \
    ln -sfT /app/crawler.rb /usr/sbin/start-crawl

RUN chmod +x /app/shub-image-info.rb /app/crawler.rb

CMD /bin/bash

It's a basic Dockerfile. We install a project and point our files to files that scrapinghub will look for to start parsing.

Step five: deploy and start your spider

After you installed and logged in to shub, you need to create a project on the scrapinghub.

Copy project ID and create scrapinghub.yml file. You can read more about scrapinghub.yml here.

app/scrapinghub.yml

projects:
  c-spider:
    id: YOUR_PROJECT_ID
    image: images.scrapinghub.com/project/YOUR_PROJECT_ID

version: spider-1
apikey: YOUR_API_KEY

And upload your spider.

shub image upload c-spider

After spider deployed, go to the scrapinghub dashboard and run it. As a result, you will have something like this.

And now you can access your scrapped data with Items API.

License

About Codica

We love open source software! See our other projects or hire us to design, develop, and grow your product.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapinghub ruby example

Requirements

Step one: build your spider

Step two: create required files by Scrapinghub

Step three: make required files executable

Step four: create Dockerfile

Step five: deploy and start your spider

License

About Codica

About

Releases

Packages

Languages

codica2/scrapinghub-ruby-example

Folders and files

Latest commit

History

Repository files navigation

Scrapinghub ruby example

Requirements

Step one: build your spider

Step two: create required files by Scrapinghub

Step three: make required files executable

Step four: create Dockerfile

Step five: deploy and start your spider

License

About Codica

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages