Get ready to query google like a pro and make awesome google searches with PHP
The library is being improved to make better pages parsing. Feel free to give any feedback/nfr from the issue tracker.
- Google SERP url generation
- Natural results parsing
- Adwords results parsing
- Proxy Usage
PLEASE READ ALL THE FOLLOWING SECTIONS BEFORE USING IT it contains important informations about the usage.
...that scrapping google is forbiden (what an irony for the biggest scrapper ever written)... But who cares ?
Google does. And it will stop you with a captcha if you submit too many requests in a short time.
Usually I delay each query with 30 seconds. But if you do a lot of requests it's still too short.
How to counter :
- Optimize your delays (this package contains a query delayer that does it for you with different delays, but i'm still trying to figure out what is the best timing to use)
- If you want to do very high number of requests on a short time you will have to use proxy(s). I'm looking for the best implementation I can do of it in the library.
The library is available on packgist : "sneakybobito/google-url": "dev-master"
If you are not familiar with packagist, you an also use the loader packaged in the repo. To do so download the library (e.g. as a zip from github)
and just include the file named ``autoload.php̀
:
<?php
include("path/to/googleurl/autoload.php");
$g = new \GoogleUrl();
// ......
When you use this library you have to keep in mind that querying google is something that you have to control.
You cant use it everytime someone loads a page on your webserver. Indeed it mays be long, it means long time to load the web page. You also have to control the number of query you do over the time. Or else google will consider you as a bot and you will get blocked by the captcha.
Instead you may use it in a cli program that will store results in database. And then query the database from the webpage script.
Once again think about using delays between each query. It is very important for not google to add your server to the blacklist. There is no universal rule for the delays to apply. It is hard to figure out the best delays to use and it requires many tests. That's why people want to keep it secret.
<?php
$googleUrl=new \GoogleUrl();
$googleUrl->setLang('en') // lang allows to adapt the query (tld, and google local params)
->setNumberResults(10); // 10 results per page
$acdcPage1=$googleUrl->setPage(0)->search("acdc"); // acdc results page 1 (results 1-10)
$acdcPage2=$googleUrl->setPage(1)->search("acdc"); // acdc results page 2 (results 11-20)
$googleUrl->setNumberResults(20);
$simpsonPage1=$googleUrl->setPage(0)->search("simpson"); // simpsons results page 1 (results 1-20)
// GET NATURAL RESULTS
$positions=$simpsonPage1->getPositions();
echo "results for " . $simpsonPage1->getKeywords();
echo "<ul>";
foreach($positions as $result){
echo "<li>";
echo "<ul>";
echo "<li>position : " . $result->getPosition() . "</li>";
echo "<li>title : " . utf8_decode($result->getTitle()) . "</li>";
echo "<li>website : " . $result->getWebsite() . "</li>";
echo "<li>URL : <a href='" . $result->getUrl() ."'>" . $result->getUrl() . "</a></li>";
echo "</ul>";
echo "</li>";
}
echo "</ul>";
// GET ADWORDS RESULTS
$commercialSearch = $googleUrl->setLang("fr")->setPage(0)->search("simpson tshirt");
$adwordsPositions = $commercialSearch->getAdwords();
echo "adwords for " . $commercialSearch->getKeywords();
echo "<ul>";
foreach($adwordsPositions as $result){
echo "<li>";
echo "<ul>";
echo "<li>location : " . $result->getLocation() . "</li>"; // adwords can be displayed in body or in column
echo "<li>position : " . $result->getPosition() . "</li>";
echo "<li>title : " . utf8_decode($result->getTitle()) . "</li>";
echo "<li>fake url : " . $result->getVisurl() . "</li>";
echo "<li>URL :" . $result->getAdwordsUrl() . "</li>";
echo "<li>Text : " . $result->getText() . "<li>";
echo "</ul>";
echo "</li>";
}
echo "</ul>";
// we can also get only results in body
echo "adwords <b>IN BODY</b> for " . $commercialSearch->getKeywords();
echo "<ul>";
foreach($adwordsPositions->getBodyResults() as $result){
echo "<li>" . $result->getPosition() . " : " . utf8_decode($result->getTitle()) . "</li>";
}
echo "</ul>";
// and obviously results in the right column
echo "adwords <b>IN COLUMN</b> for " . $commercialSearch->getKeywords();
echo "<ul>";
foreach($adwordsPositions->getColumnResults() as $result){
echo "<li>" . $result->getPosition() . " : " . utf8_decode($result->getTitle()) . "</li>";
}
echo "</ul>";
}
<?php
include __DIR__ . "/../autoload.php";
$proxy1 = new \GoogleUrl\SimpleProxy("localhost", "3128");
$proxy2 = new \GoogleUrl\SimpleProxy("someproxyAddress", "8080");
$googleUrl=new \GoogleUrl();
$googleUrl->setLang('fr')->setNumberResults(10)->search("simpson",$proxy1);
$googleUrl->setLang('fr')->setNumberResults(10)->search("simpsons",$proxy2);
Please refere to https://github.com/SneakyBobito/google-url/blob/master/doc/proxy_rotation.md for more complet usage of proxies.
Implemented Languages
- en (english)
- fr (french)
- de (german)
- nl (dutch)
- cs (czech)
- dk (danish)
- jp (japanese)
- es (spannish)
- ru (russian)
more are incoming over the time...
Feel free to open an issue for any request/question
- Page parsing improvment (images results, website results...)
- Delayer/query queue