Skip to content

Latest commit

 

History

History
188 lines (112 loc) · 8.39 KB

README.md

File metadata and controls

188 lines (112 loc) · 8.39 KB

Stories in Ready Build Status Code Climate Dependency Status

morph.io: A scraping platform

  • A Heroku for Scrapers
  • All code and collaboration through GitHub
  • Write your scrapers in Ruby, Python, PHP, Perl or JavaScript (NodeJS, PhantomJS)
  • Simple API to grab data
  • Schedule scrapers or run manually
  • Process isolation via Docker
  • Trivial to move scraper code and data from ScraperWiki Classic
  • Email alerts for broken scrapers

Dependencies

Ruby 2.3.1, Docker, MySQL, SQLite 3, Redis, mitmproxy and Elasticsearch.

On OS X for development also Docker Toolbox - see below for more.

On Linux your user account should be able to manipulate Docker (just add your user to the docker group).

Repositories

User-facing:

Docker images:

To Install

Using a Ruby Environment manager such as rbenv is recommended to manage Ruby versions. The requirement version is 2.3.1.

Running this on OSX? Read the OSX instructions below BEFORE doing any of this. Some additional issues with OSX system tools (in particular readline and openssl) may arise when trying to install Ruby 2.3.1 using rbenv. The workaround is to specify to rbenv to use the versions from homebrew via:

RUBY_CONFIGURE_OPTS=--with-readline-dir="$(brew --prefix readline)"--with-openssl-dir="$(brew --prefix openssl)" rbenv install 2.3.1

To install via bundler:

bundle install

If bundle install fails on OSX to build package mysql2, use the following flags to direct the compiler to use the specified version of openssl:

bundle config build.mysql2 --with-ldflags=-L/usr/local/opt/openssl/lib --with-cppflags=-I/usr/local/opt/openssl/include

If bundler install fails on nokogiri, try troubleshooting with these steps

Configure db

cp config/database.yml.example config/database.yml
cp env-example .env

Edit config/database.yml with your database settings

Create an application on GitHub so that morph.io can talk to GitHub. Fill in the following values

Note the use of 127.0.0.1 rather than localhost. Use this or it won't work.

In the .env file, fill in the Client ID and Client Secret details provided by GitHub for the application you've just created.

Now setup the databases:

bundle exec dotenv rake db:setup

Now you can start the server

bundle exec dotenv foreman start

and point your browser at http://127.0.0.1:3000

To get started, log in with GitHub. There is a simple admin interface accessible at http://127.0.0.1:3000/admin. To access this, run the following to give your account admin rights:

bundle exec rake app:promote_to_admin

Installing Docker on OSX

If you're doing your development on Linux you're in luck because installing Docker is pretty straightforward. Just follow the instructions on the Docker site.

Install Docker Toolbox. This will also, I think, prompt you to install VirtualBox if you don't already have it.

Then, from the command-line

docker-machine env

Paste the output of that command into the docker section of the .env file. Now the application will know how to contact the docker server running on a VM on your machine.

Running tests

If you're running guard (see above) the tests will also automatically run when you change a file.

By default, RSpec will skip tests that have been tagged as being slow. To change this behaviour, add the following to your .env:

RUN_SLOW_TESTS=1

By default, RSpec will run certain tests against a running Docker server. These tests are quite slow, but not have been tagged as slow. To stop Rspec from running these tests, add the following to your .env:

DONT_RUN_DOCKER_TESTS=1

Guard Livereload

We use Guard and Livereload so that whenever you edit a view in development the web page gets automatically reloaded. It's a massive time saver when you're doing design or lots of work in the view. To make it work run

bundle exec guard

Guard will also run tests when needed. Some tests do integration tests against a running docker server. These particular tests are very slow. If you want to disable them,

DONT_RUN_DOCKER_TESTS=1 bundle exec guard

Mail in development

By default in development mails are sent to Mailcatcher. To install

gem install mailcatcher

Deploying to production

This section will not be relevant to most people. It will however be relevant if you're deploying to a production server.

git-encrypt

We're using git-encrypt to encrypt certain files, like the private key for the SSL certificate. To make this work you have to do some special things before you clone the morph repository.

Production devops development

Install Vagrant and Ansible.

Install the hostsupdater plugin: vagrant plugin install vagrant-hostsupdater

Run vagrant up local. This will build and provision a box that looks and acts like production at dev.morph.io.

Once the box is created and provisioned, deploy the application to your Vagrant box:

cap local deploy

Now visit https://dev.morph.io/

Production provisioning and deployment

To deploy morph.io to production, normally you'll just want to deploy using Capistrano:

cap production deploy

When you've changed the Ansible playbooks to modify the infrastructure you'll want to run:

ansible-playbook --user=root --inventory-file=provisioning/hosts provisioning/playbook.yml

How to contribute

If you find what looks like a bug:

  • Check the GitHub issue tracker to see if anyone else has reported issue.
  • If you don't see anything, create an issue with information on how to reproduce it.

If you want to contribute an enhancement or a fix:

  • Fork the project on GitHub.
  • Make your changes with tests.
  • Commit the changes without making changes to any files that aren't related to your enhancement or fix.
  • Send a pull request.

We maintain a list of issues that are easy fixes. Fixing one of these is a great way to get started while you get familiar with the codebase.

Copyright & License

Copyright OpenAustralia Foundation Limited. Licensed under the Affero GPL. See LICENSE file for more details.