Home / PHP / Web Scraping – Introduction and Example (Part 1)

Web Scraping – Introduction and Example (Part 1)

Web Scraping is a methodology followed to extract useful information from any website. This process does not involve API concept but pull all the HTML from target website or a web page and grab the part of the HTML code. I got impressed to web scraping after I came across an airline api provide skyscanner(dot)com. In the search results of this website you can notice the price of all the flights from different websites. There is not an API involved but purely web scraping.

There is a high demand for web scraping in the IT field and I would encourage budding developers to have at least a minimal knowledge. But, remember web scraping is sometimes illegal. There are several website that prohibit others to scrape their website contents and they can easily find if you. There are various websites blacklisted by google for stealing copyrighted data. So let us proceed by promising ourselves that we will never indulge in data theft 😉

Markup

Now send an ajax request to process.php with the user given url as a parameter (Optional, you can directly run php code).

Javascript

For an example I have scraped alexa website ranking. This code is for educational purpose and http://jqueryajaxphp.com is not responsible in anyway if used for illegal purposes.

PHP

First get the URL from ajax request and validate it. If the user input is not a valid url then throw an error

If the URL is valid, scrape the website or webpage in the else block

In Line 1 and Line 2 http, https and www. are remove from the url if exists. The URL structure for alexa is http://alexa.com/siteinfo/example.com so we need to form the URL structure before scraping.

In Line 3 $parsedurl is appended to http://alexa.com/siteinfo/ to for the URL structure for alexa and then passed to the scrape()b> function to be processed. The output of this function is stored in $page. At this point $page contains all the html generated from the URL

In the Above code file_get_contents($url) will stream all the HTML code generated from the alexa. The website rank is somewhere in the middle of the code and our objective is to retrieve it.

I Line 4 I’ve written fetchdata() function with three parameters fetchdata(htmlcode,startposition,endposition)

htmlcode -> Contains all the HTML code generated from alexa
startposition -> The starting point of HTML from where the data has to be scraped
endposition -> The end point of the HTML where the scraping has to stop

To scrape the website rank from alexa you need to understand the structure of their HTML code. Below is the screenshot for their code

You can notice that the rank of http://google.com is 1 enclosed within <strong class=”metrics-data align-vmiddle”> (startposition) and </strong> (endposition)

In Line 5 return the output in json format. In the ajax success function show the response in .rank class. That’s it!

Download   Preview

About Ashik

I am a Full Stack Developer and love to work on APIs and Apps. Hardcore lover of Ionic and Laravel <3

Check Also

Weather forecast API using jQuery, Ajax and PHP

Weather forecast API is widely used as widgets in small blogs to huge news website …

  • Jeet Kalariya

    hy, can you please give your mail ID so I can discuss some AJAX selection issues related to this kind of project?

  • hilary

    I have a question and that’s when I send an ajax request to a site , the response is
    {“readyState:0″,”responseText”:””,”status”:0,”statusText”:”e‌​rror”}

    • The URL you are accessing might be blocking your request. If this happens on all URLs you are trying then check your syntax

      • hilary

        that’s my code. plz take a look at it

        var url=’http://exploregreensboro.com/’;
        var param={
        data:{
        api_key:4,
        start_date:’09/06/2016′,
        per_page:18,
        page:2,
        sort_by:’upcoming’,
        view_type:’list’,
        show_featured:false,
        search_domain:’http://exploregreensboro.com/’,
        date_listing:’list-by-month’,
        enable_city:false,
        enable_regions:false,
        radius:20,
        view_type_default:’grid’
        }
        };
        $.ajax({
        url:url,
        type:’POST’,
        data:param,
        success:function(e){
        alert(JSON.stringify(e));
        },
        error: function(e){
        alert(JSON.stringify(e));
        }
        });

Do you wan't to get notified?

I will not spam your inbox. I will only send email whenever I publish a new article or to share information about technology

You have Successfully Subscribed!