12 Ağustos 2016 Cuma

Crawling a web sites with HtmlAgilityPack

Introduction

This is a first post of small series that I’m going to describe implementation and design of Crawler, that I've done recently for TDD demand analisys. I would split it up into several parts, covering its major architectural parts.
  • Part 1 - Crawling a web sites with HtmlAgilityPack
  • Part 2 - Regex to match a words in dictionary on page body 
  • Part 3 - EF4 Code First approach to store data

For references, you could use a source code - http://github.com/alexbeletsky/tdd.demand
Warning it’s quite long post, cause contain code examples, if you understand basic ideas I put here, best way it to go directly to repository and see the code, as best explanation material

Using HtmlAgilityPack

HtmlAgilityPack is one of the great open sources projects I ever worked with. It is a HTML parser for .NET applications, works with great performance, supports malformed HTML. I successfully used in one of the projects and really liked it. It contains very few documentation, but it designed so well that you can get basic understanding just by looking to Visual Studio Object Browser.
So, then you need to deal with HTML in .NET - HtmlAgilityPack is a definitely framework of choice.
I’ve downloaded latest version and were very pleased that now it supports Linq to Objects. That makes usage of HtmlAgilityPack more simple and fun. I’ll give you just a simple idea how it works. Task of every crawler is to extract some information from particular html page. Say, we need to get inner text from div element with class “required”. We have a 2 options here, classical one, using XPATH and brand new, using Linq to Objects.

XPATH approach

public string GetInnerTestWithXpath() {   var document = new HtmlDocument();   document.Load(new FileStream("test.html"FileMode.Open));   var node = document.DocumentNode.SelectSingleNode(@"//div[@class=""required""]");   returnnode.InnerText; } * This source code was highlighted with Source Code Highlighter.

Linq to Objects approach

public string GetInnerTextWithLinq() {   var document = new HtmlDocument();   document.Load(new FileStream("test.html"FileMode.Open));   var node = document.DocumentNode.Descendants("div").Where(     d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("required")).SingleOrDefault();   return node.InnerText; } * This source code was highlighted with Source Code Highlighter.
As I personally like Linq to Objects approach, sometimes XPATH is more convenient and elegant (especially in cases you refer to page elements with out ids or special attributes).

Loading pages using WebRequest

In previous example I loaded page content from file, located on disk. Now, our goal is to load pages by URL using HTTP. .NET framework has a special WebRequest. I’ve created a separate class HtmlDocumentLoader (that implements IHtmlDocumentLoader interface) that all the details inside.
using System; using System.Collections.Genericusing System.Linq; using System.Text; usingSystem.Net; using System.Threading; namespace Crawler.Core.Model {   public classHtmlDocumentLoader : IHtmlDocumentLoader   {     private WebRequest CreateRequest(string url)     {       var request = (HttpWebRequest)WebRequest.Create(url);       request.Timeout = 5000;       request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";       return request;     }     publicHtmlAgilityPack.HtmlDocument LoadDocument(string url)     {       var document = newHtmlAgilityPack.HtmlDocument();       try       {         using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream())         {           document.Load(responseStream, Encoding.UTF8);         }       }       catch(Exception )       {         //just do a second try         Thread.Sleep(1000);         using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream())         {           document.Load(responseStream, Encoding.UTF8);         }       }       return document;     }   } } * This source code was highlighted with Source Code Highlighter.
Several comments here. First, You can see that we load UserAgent property of WebRequest. We are making our request look that same as it would be a Firefox web browser. Some web servers could prevent web requests from “unknown" agents, so this is kind of preventive action. Second, is how document object is being intialized.. as you might see we have a try/catch block here and just repeat the same initialization steps in catch block. It might happen that web server fails to process requirest (due to different reasons), so WebRequest object will throw and exception. We just wait for one second and retry it. I’ve noticed that such simple approach could really improve robustness of crawler.

Generic Crawler

So, now we know how to load HTML documents by using of WebRequest, specifying document URL, also we know how to use HtmlAgilityPack to extract data from a document. Now, we have to create an engine, that would automatically go through the document, extract the links for next portion of data, process data and store it. That is something that is called crawler.
As I implemented and tested several crawlers, I’ve seen that all off them have the same structure and operations and differs only in particular details of how data is extracted from pages. So, I came up with a generic crawler, implemented as abstract class. If you need to build next crawler you just inherit generic crawler and implement all abstract operations. Let’s see the heart of crawler, StartCrawling() method.
    protected virtual void StartCrawling()     {       Logger.Log(BaseUrl + " crawler started...");       CleanUp();       for (var nextPage = 1; ; nextPage++)       {         varurl = CreateNextUrl(nextPage);         var document = Loader.LoadDocument(url);         Logger.Log("processing page: [" + nextPage.ToString() + "] with url: " + url);         var rows = GetJobRows(document);         var rowsCount = rows.Count();         Logger.Log("extracted " + rowsCount + " vacations on page");         if (rowsCount == 0)         {           Logger.Log("no more vacancies to process, breaking main loop");           break;         }         Logger.Log("starting to process all vacancies");         foreach (var row in rows)         {           Logger.Log("starting processing div, extracting vacancy href...");           var vacancyUrl = GetVacancyUrl(row);           if(vacancyUrl == null)           {             Logger.Log("FAILED to extract vacancy href, not stopped, proceed with next one");             continue;           }           Logger.Log("started to process vacancy with url: " + vacancyUrl);           varvacancyBody = GetVacancyBody(Loader.LoadDocument(vacancyUrl));           if (vacancyBody ==null)           {             Logger.Log("FAILED to extract vacancy body, not stopped, proceed with next one");             continue;           }           var position = GetPosition(row);           var company = GetCompany(row);           var technology = GetTechnology(position, vacancyBody);           var demand = GetDemand(vacancyBody);           var record = new TddDemandRecord()           {             Site = BaseUrl,             Company = company,             Position = position,             Technology = technology,             Demand = demand,             Url = vacancyUrl           };           Logger.Log("new record has been created and initialized");           Repository.Add(record);           Repository.SaveChanges();           Logger.Log("record has been successfully stored to database.");           Logger.Log("finished to process vacancy");         }         Logger.Log("finished to process page");       }       Logger.Log(BaseUrl + " crawler has successfully finished");     } * This source code was highlighted with Source Code Highlighter.
It uses abstract fields of Loader, Logger and Repository. We have already reviewed Loader functionality, Logger is simple interface with Log method (I’ve created one implementaion to put log messages to console, that is enough to me) and Repository that we will review next time.
GetTechnology, GetDemand methods are the same for all crawlers, so they are part of generic crawler, rest of operations are “site-dependent”, so each crawler overrides its behavior.
    protected abstract IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document);     protected abstract stringCreateNextUrl(int nextPage);     protected abstract stringGetVacancyUrl(HtmlAgilityPack.HtmlNode row);     protected abstract stringGetVacancyBody(HtmlAgilityPack.HtmlDocument htmlDocument);     protected abstract stringGetPosition(HtmlAgilityPack.HtmlNode row);     protected abstract stringGetCompany(HtmlAgilityPack.HtmlNode row); * This source code was highlighted with Source Code Highlighter.
Here, we’ll review one of the crawlers and how it implements all methods required by CrawlerImpl class.
namespace Crawler.Core.Crawlers {   public class RabotaUaCrawler : CrawlerImpl, ICrawler   {     private string _baseUrl = @"http://rabota.ua";     private string _searchBaseUrl =@"http://rabota.ua/jobsearch/vacancy_list?rubricIds=8,9&keyWords=&parentId=1";     publicRabotaUaCrawler(ILogger logger)     {       Logger = logger;     }     public voidCrawle(IHtmlDocumentLoader loader, ICrawlerRepository context)     {       Loader = loader;       Repository = context;       StartCrawling();     }     protected override stringBaseUrl     {       get { return _baseUrl; }     }     protected override stringSearchBaseUrl     {       get { return _searchBaseUrl; }     }     protected overrideIEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document)     {       var vacancyDivs = document.DocumentNode.Descendants("div")         .Where(d =>           d.Attributes.Contains("class") &&           d.Attributes["class"].Value.Contains("vacancyitem"));       return vacancyDivs;     }     protected override string GetVacancyUrl(HtmlAgilityPack.HtmlNode div)     {       var vacancyHref = div.Descendants("a").Where(         d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription"))         .Select(d => d.Attributes["href"].Value).SingleOrDefault();       return BaseUrl + vacancyHref;     }     private static string GetVacancyHref(HtmlAgilityPack.HtmlNode div)     {       var vacancyHref = div.Descendants("a").Where(         d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription"))         .Select(d => d.Attributes["href"].Value).SingleOrDefault();       return vacancyHref;     }     protected override string CreateNextUrl(int nextPage)     {       returnSearchBaseUrl + "&pg=" + nextPage;     }     protected override stringGetVacancyBody(HtmlAgilityPack.HtmlDocument vacancyPage)     {       if (vacancyPage == null)       {         //TODO: log event here and skip this page         return null;       }       var description = vacancyPage.DocumentNode.Descendants("div")         .Where(           d => d.Attributes.Contains("id") && d.Attributes["id"].Value.Contains("ctl00_centerZone_vcVwPopup_pnlBody"))         .Select(d => d.InnerHtml).SingleOrDefault();       return description;     }     protected override stringGetPosition(HtmlAgilityPack.HtmlNode div)     {       return div.Descendants("a").Where(         d => d.Attributes.Contains("class") &&         d.Attributes["class"].Value.Contains("vacancyName") || d.Attributes["class"].Value.Contains("jqKeywordHighlight")         ).Select(d => d.InnerText).First();     }     protected override string GetCompany(HtmlAgilityPack.HtmlNode div)     {       return div.Descendants("div").Where(         d => d.Attributes.Contains("class") &&         d.Attributes["class"].Value.Contains("companyName")).Select(d => d.FirstChild.InnerText).First();     }   } } * This source code was highlighted with Source Code Highlighter.
To make a picture complete, just review implementation of the rest of crawlers-http://github.com/alexbeletsky/tdd.demand/tree/master/src/Crawler/Core/Crawlers/