WebCrawling,Foundations

andTrendsRinInformationRetrievalVol.4,No.3(2010)175–246c2010C.OlstonandM.NajorkDOI:10.1561/1500000017 WebCrawling ByherOlstonandMarcNajork Contents 1Introduction 176 1.1Challenges 178 1.2Outline 179 2CrawlerArchitecture 180 2.1Chronology 180 2.2ArchitectureOverview 184 2.3KeyDesignPoints 185 3CrawlOrderingProblem 194 3.1Model 195 3.2WebCharacteristics 197 3.3TaxonomyofCrawlOrderingPolicies 202 4BatchCrawlOrdering 203 4.1ComprehensiveCrawling 204 4.2ScopedCrawling 208 4.3EﬃcientLarge-ScaleImplementation 213 5IncrementalCrawlOrdering 215 5.1MaximizingFreshness 217 5.2CapturingUpdates 222 5.3EﬃcientLarge-ScaleImplementation 223 6AvoidingProblematicandUndesirableContent 225 6.1RedundantContent 225 6.2CrawlerTraps 226 6.3WebSpam 227 6.4CloakedContent 228 7DeepWebCrawling 230 7.1TypesofDeepWebSites 230 7.2ProblemOverview 232 7.3ContentExtraction 232 8FutureDirections 236 References 239 FoundationsandTrendsRinInformationRetrievalVol.4,No.3(2010)175–246c2010C.OlstonandM.NajorkDOI:10.1561/1500000017 WebCrawling herOlston1andMarcNajork2 1Yahoo!
Research,701FirstAvenue,Sunnyvale,CA,94089,USAolston@ 2MicrosoftResearch,1065LaAvenida,MountainView,CA,94043,USAnajork@ Abstract Thisisasurveyofthescienceandpracticeofwebcrawling.Whileatﬁrstglancewebcrawlingmayappeartobemerelyanapplicationofbreadth-ﬁrst-search,thetruthisthattherearemanychallengesrangingfromsystemsconcernssuchasmanagingverylargedatastructurestotheoreticalquestionssuchashowoftentorevisitevolvingcontentsources.Thissurveyoutlinesthefundamentalchallengesanddescribesthestate-of-the-artmodelsandsolutions.Italsohighlightsavenuesforfuturework.
1 Introduction Awebcrawler(alsoknownasarobotoraspider)isasystemforthebulkdownloadingofwebpages.Webcrawlersareusedforavarietyofpurposes.Mostprominently,theyareoneoftheponentsofwebsearchengines,systemsthatassembleacorpusofwebpages,indexthem,andallowuserstoissuequeriesagainsttheindexandﬁndthewebpagesthatmatchthequeries.Arelateduseiswebarchiving(aserviceprovidedbye.g.,thearchive[77]),wherelargesetsofwebpagesareperiodicallycollectedandarchivedforposterity.Athirduseiswebdatamining,wherewebpagesareanalyzedforstatisticalproperties,orwheredataanalyticsisperformedonthem(anexamplewouldbeAttributor[7],panythatmonitorsthewebforcopyrightandtrademarkinfringements).Finally,webmonitoringservicesallowtheirclientstosubmitstandingqueries,ortriggers,andtheycontinuouslycrawlthewebandnotifyclientsofpagesthatmatchthosequeries(anexamplewouldbeGigaAlert[64]). Theraisond’ˆetreforwebcrawlersliesinthefactthatthewebisnotacentrallymanagedrepositoryofinformation,butratherconsists 176 177 ofhundredsofmillionsofindependentwebcontentproviders,eachoneprovidingtheirownservices,andpetingwithoneanother.Inotherwords,thewebcanbeviewedasafederatedinformationrepository,heldtogetherbyasetofagreed-uponprotocolsanddataformats,suchastheTransmissionControlProtocol(TCP),theDomainNameService(DNS),theHypertextTransferProtocol(HTTP),theHypertextMarkupLanguage(HTML)andtherobotsexclusionprotocol.So,contentaggregators(suchassearchenginesorwebdataminers)havetwochoices:Theycaneitheradoptapullmodelwheretheywillproactivelyscourthewebforneworupdatedinformation,ortheycouldtrytoestablishaconventionandasetofprotocolsenablingcontentproviderstopushcontentofinteresttotheaggregators.Indeed,theHarvestsystem[24],oneoftheearliestsearchservices,adoptedsuchapushmodel.However,thisapproachdidnoteed,andvirtuallyallcontentaggregatorsadoptedthepullapproach,withafewprovisostoallowcontentproviderstoexcludeallorpartoftheircontentfrombeingcrawled(therobotsexclusionprotocol)andtoprovidehintsabouttheircontent,itsimportanceanditsrateofchange(theSitemapsprotocol[110]). Thereareseveralreasonswhythepushmodeldidnotetheprimarymeansofacquiringcontentforsearchenginesandothercontentaggregators:Thefactthatwebserversarehighlyautonomousmeansthatthebarrierofentrytoingacontentproviderisquitelow,andthefactthatthewebprotocolswereatleastinitiallyextremelysimpleloweredthebarrierevenfurther—infact,thissimplicityisviewedbymanyasthereasonwhythewebeededwhereearlierhypertextsystemshadfailed.Addingpushprotocolswouldplicatedthesetofwebprotocolsandthusraisedthebarrierofentryforcontentproviders,whilethepullmodeldoesnotrequireanyextraprotocols.Bythesametoken,thepullmodellowersthebarrierofentryforcontentaggregatorsaswell:Launchingacrawlerdoesnotrequireanyaprioribuy-infromcontentproviders,andindeedthereareover1,500operatingcrawlers[47],extendingfarbeyondthesystemsemployedbythebigsearchengines.Finally,thepushmodelrequiresatrustrelationshipbetweencontentproviderandcontentaggregator,somethingthatisnotgivenonthewebatlarge—indeed,therelationshipbetween 178Introduction contentprovidersandsearchenginesischaracterizedbybothmutualdependenceandadversarialdynamics(seeSection6). 1.1Challenges Thebasicwebcrawlingalgorithmissimple:GivenasetofseedUniformResourceLocators(URLs),acrawlerdownloadsallthewebpagesaddressedbytheURLs,extractsthehyperlinkscontainedinthepages,anditerativelydownloadsthewebpagesaddressedbythesehyperlinks.Despitetheapparentsimplicityofthisbasicalgorithm,webcrawlinghasmanyinherentchallenges: •Scale.Thewebisverylargeandcontinuallyevolving.Crawlersthatseekbroadcoverageandgoodfreshnessmustachieveextremelyhighthroughput,whichposesmanydiﬃcultengineeringproblems.Modernsearchpaniesemploythousandsputersanddozensofworklinks. •Contentselectiontradeoﬀs.Eventhehighest-throughputcrawlersdonotpurporttocrawlthewholeweb,orkeepupwithallthechanges.Instead,crawlingisperformedselectivelyandinacarefullycontrolledorder.Thegoalsaretoacquirehigh-valuecontentquickly,ensureeventualcoverageofallreasonablecontent,andbypasslow-quality,irrelevant,redundant,andmaliciouscontent.Thecrawlermustpetingobjectivessuchascoverageandfreshness,whileobeyingconstraintssuchasper-siteratelimitations.Abalancemustalsobestruckbetweenexplorationofpotentiallyusefulcontent,andexploitationofcontentalreadyknowntobeuseful. •Socialobligations.Crawlersshouldbe“goodcitizens”oftheweb,i.e.,notimposetoomuchofaburdenonthewebsitestheycrawl.Infact,withouttherightsafetymechanismsahigh-throughputcrawlercaninadvertentlycarryoutadenial-of-serviceattack. •Adversaries.Somecontentprovidersseektoinjectuselessormisleadingcontentintothecorpusassembledby 1.2Outline179 thecrawler.Suchbehaviorisoftenmotivatedbyﬁnancialincentives,forexample(mis)directingtraﬃcmercialwebsites. 1.2Outline Webcrawlingisaic,andaswithmosticsitcannotbesplitintofullyorthogonalics.Bearingthatinmind,westructurethesurveyordingtoﬁverelativelydistinctlinesofworkthaturintheliterature: •Buildinganeﬃcient,robustandscalablecrawler(Section2).•Selectingatraversalorderofthewebgraph,assuming contentiswell-behavedandisinterconnectedviaHTMLhyperlinks(Section4).•Schedulingrevisitationofpreviouslycrawledcontent(Section5).•Avoidingproblematicandundesirablecontent(Section6).•Crawlingso-called“deepweb”content,whichmustbeessedviaHTMLformsratherthanhyperlinks(Section7). Section3introducesthetheoreticalcrawlorderingproblemstudiedinSections4and5,anddescribesstructuralandevolutionarypropertiesofthewebthatinﬂuencecrawlordering.Section8givesalistofopenproblems.
2 CrawlerArchitecture Thissectionﬁrstpresentsachronologyofwebcrawlerdevelopment,andthendescribesthegeneralarchitectureandkeydesignpointsofmodernscalablecrawlers. 2.1Chronology Webcrawlersarealmostasoldasthewebitself.Inthespringof1993,shortlyafterthelaunchofNCSAMosaic,MatthewGrayimplementedtheWorldWideWebWanderer[67].TheWandererwaswritteninPerlandranonasinglemachine.Itwasuseduntil1996tocollectstatisticsabouttheevolutionoftheweb.Moreover,thepagescrawledbytheWandererpiledintoanindex(the“Wandex”),thusgivingrisetotheﬁrstsearchengineontheweb.InDecember1993,threemorecrawler-basedSearchenginesbecameavailable:JumpStation(implementedbyJonathanFletcher;thedesignhasnotbeenwrittenup),theWorldWideWebWorm[90],andtheRBSEspider[57].WebCrawler[108]joinedtheﬁeldinApril1994,andMOMspider[61]wasdescribedthesameyear.Thisﬁrstgenerationofcrawlersidentiﬁedsomeofthedeﬁningissuesinwebcrawlerdesign.Forexample,MOM- 180 2.1Chronology181 spiderconsideredpolitenesspolicies:Itlimitedtherateofrequeststoeachsite,itallowedwebsitestoexcludethemselvesfrompurviewthroughthenascentrobotsexclusionprotocol[83],anditprovideda“black-list”mechanismthatallowedthecrawloperatortoexcludesites.WebCrawlersupportedparalleldownloadingofwebpagesbystructuringthesystemintoacentralcrawlmanagerand15separatedownloadingprocesses.However,thedesignoftheseearlycrawlersdidnotfocusonscalability,andseveralofthem(RBSEspiderandWebCrawler)usedgeneral-purposedatabasemanagementsystemstostorethestateofthecrawl.EventheoriginalLycoscrawler[89]ranonasinglemachine,waswritteninPerl,andusedPerl’sassociativearrays(spiltontodiskusingtheDBMdatabasemanager)tomaintainthesetofURLstocrawl. Thefollowingfewyearssawthearrivalofmercialsearchengines(Lycos,Infoseek,Excite,AltaVista,andHotBot),allofwhichusedcrawlerstoindextensofmillionsofpages;however,thedesignofthesecrawlersremainsundocumented. MikeBurner’sdescriptionoftheArchivecrawler[29]wastheﬁrstpaperthatfocusedonthechallengescausedbythescaleoftheweb.TheArchivecrawlingsystemwasdesignedtocrawlontheorderof100millionURLs.Atthisscale,itisnolongerpossibletomaintainalltherequireddatainmainmemory.ThesolutionproposedbytheIApaperwastocrawlonasite-by-sitebasis,andtopartitionthedatastructuresordingly.ThelistofURLstobecrawledwasimplementedasadisk-basedqueueperwebsite.ToavoidaddingmultipleinstancesofthesameURLtothequeue,theIAcrawlermaintainedanin-memoryBloomﬁlter[20]ofallthesite’sURLsdiscoveredsofar.ThecrawlprogressedbydequeuingaURL,downloadingtheassociatedpage,extractingalllinks,enqueuingfreshlydiscoveredonsitelinks,writingalloﬀ-sitelinkstodisk,anditerating.Eachcrawlingprocesscrawled64sitesinparallel,usingnon-blockinginput/output(I/O)andasinglethreadofcontrol.asionally,abatchprocesswouldintegratetheoﬀ-sitelinkinformationintothevariousqueues.TheIAdesignmadeitveryeasytothrottlerequeststoagivenhost,therebyaddressingpolitenessconcerns,andDNSandrobotexclusionlookupsforagivenwebsitewereamortizedoverallthesite’sURLscrawledinasingleround.However,itisnotclearwhetherthebatch 182CrawlerArchitecture processofintegratingoﬀ-sitelinksintotheper-sitequeueswouldscaletosubstantiallylargerwebcrawls. BrinandPage’s1998paperoutliningthearchitectureoftheﬁrstgenerationGoogle[25]systemcontainsashortdescriptionoftheircrawler.TheoriginalGooglecrawlingsystemconsistedofasingleURLserverprocessthatmaintainedthestateofthecrawl,andaroundfourcrawlingprocessesthatdownloadedpages.BothURLserverandcrawlerswereimplementedinPython.ThecrawlingprocessusedasynchronousI/Oandwouldtypicallyperformabout300downloadsinparallel.Thepeakdownloadratewasabout100pagespersecond,withanaveragesizeof6KBperpage.BrinandPageidentiﬁedsocialaspectsofcrawling(e.g.,dealingwithwebmasters’plaints)asamajorchallengeinoperatingacrawlingsystem. WiththeMercatorwebcrawler,HeydonandNajorkpresenteda“blueprintdesign”forwebcrawlers[75,94].MercatorwaswritteninJava,highlyscalable,andeasilyextensible.Theﬁrstversion[75]wasnon-distributed;alaterdistributedversion[94]partitionedtheURLspaceoverthecrawlersordingtohostname,andavoidedthepotentialbottleneckofacentralizedURLserver.ThesecondMercatorpapergavestatisticsofa17-day,four-machinecrawlthatcovered891millionpages.Mercatorwasusedinanumberofwebminingprojects[27,60,71,72,95],andin2001replacedtheﬁrst-generationAltaVistacrawler. ShkapenyukandSuel’sPolybotwebcrawler[111]representsanother“blueprintdesign.”Polybotisadistributedsystem,consistingofacrawlmanagerprocess,multipledownloaderprocesses,andaDNSresolverprocess.ThepaperdescribesscalabledatastructuresfortheURLfrontierandthe“seen-URL”setusedtoavoidcrawlingthesameURLmultipletimes;italsodiscussestechniquesforensuringpolitenesswithoutslowingdownthecrawl.Polybotwasabletodownload120millionpagesover18daysusingfourmachines. TheIBMWebFountaincrawler[56]representedanotherindustrialstrengthdesign.TheWebFountaincrawlerwasfullydistributed.Thethreeponentsweremulti-threadedcrawlingprocesses(“Ants”),duplicatedetectionprocessesresponsibleforidentifyingdownloadedpageswithnear-duplicatecontent,andacentralcontroller 2.1Chronology183 processresponsibleforassigningworktotheAntsandformonitoringtheoverallstateofthesystem.WebFountainfeaturedaveryﬂexiblecrawlschedulingmechanismthatallowedURLstobeprioritized,maintainedapolitenesspolicy,andevenallowedthepolicytobechangedontheﬂy.Itwasdesignedfromthegrounduptosupportincrementalcrawling,i.e.,theprocessofrecrawlingpagesregularlybasedontheirhistoricalchangerate.TheWebFountaincrawlerwaswritteninC++andusedMPI(theMessagePassingInterface)tomunicationbetweenthevariousprocesses.Itwasreportedlydeployedonaclusterof48crawlingmachines[68]. UbiCrawler[21]isanotherscalabledistributedwebcrawler.ItusesconsistenthashingtopartitionURLsordingtotheirponentacrosscrawlingmachines,leadingtogracefulperformancedegradationintheeventofthefailureofacrawlingmachine.UbiCrawlerwasabletodownloadabout10millionpagesperdayusingﬁvecrawlingmachines.UbiCrawlerhasbeenusedforstudiesofpropertiesoftheAfricanweb[22]andpileseveralreferencecollectionsofwebpages[118]. Recently,Yanetal.describedIRLbot[84],asingle-processwebcrawlerthatisabletoscaletoextremelylargewebcollectionswithoutperformancedegradation.IRLbotfeaturesa“seen-URL”datastructurethatusesonlyaﬁxedamountofmainmemory,andwhoseperformancedoesnotdegradeasitgrows.Thepaperdescribesacrawlthatranovertwomonthsanddownloadedabout6.4billionwebpages.Inaddition,theauthorsaddresstheissueofcrawlertraps(websiteswithalarge,possiblyinﬁnitenumberoflow-utilitypages,seeSection6.2),andproposewaystoamelioratetheimpactofsuchsitesonthecrawlingprocess. Finally,thereareanumberofopen-sourcecrawlers,twoofwhichdeservespecialmention.Heritrix[78,93]isthecrawlerusedbytheArchive.ItiswritteninJavaandponentized,anditsdesignisquitesimilartothatofMercator.Heritrixismultithreaded,butnotdistributed,andassuchsuitableforconductingmoderatelysizedcrawls.TheNutchcrawler[62,81]iswritteninJavaaswell.Itsupportsdistributedoperationandshouldthereforebesuitableforverylargecrawls;butasofthewritingof[81]ithasnotbeenscaledbeyond100millionpages. 184CrawlerArchitecture 2.2ArchitectureOverview Figure2.1showsthehigh-levelarchitectureofaprototypicaldistributedwebcrawler.Thecrawlerconsistsofmultipleprocessesrunningondiﬀerentmachinesconnectedbyawork.Eachcrawlingprocessconsistsofmultipleworkerthreads,andeachworkerthreadperformsrepeatedworkcycles. Atthebeginningofeachworkcycle,aworkerobtainsaURLfromtheFrontierdatastructure,whichdispensesURLsordingtotheirpriorityandtopolitenesspolicies.TheworkerthreadtheninvokestheHTTPfetcher.ThefetcherﬁrstcallsaDNSsub-moduletoresolvetheponentoftheURLintotheIPaddressofthecorrespondingwebserver(usingcachedresultsofpriorresolutionsifpossible),andthenconnectstothewebserver,checksforanyrobotsexclusionrules(whichtypicallyarecachedaswell),andattemptstodownloadthewebpage. Ifthedownloadeeds,thewebpagemayormaynotbestoredinarepositoryofharvestedwebpages(notshown).Ineithercase,thepageispassedtotheLinkextractor,whichparsesthepage’sHTMLcontentandextractshyperlinkscontainedtherein.ThecorrespondingURLsarethenpassedtoaURLdistributor,whichassignseachURLtoacrawlingprocess.ThisassignmentistypicallymadebyhashingtheURLsponent,itsdomain,oritsIPaddress(thelatterrequiresadditionalDNSresolutions).Sincemosthyperlinksrefertopagesonthesamewebsite,assignmenttothelocalcrawlingprocessismoncase. Next,theURLpassesthroughtheCustomURLﬁlter(e.g.,toexcludeURLsbelongingto“black-listed”sites,orURLswithparticularﬁleextensionsthatarenotofinterest)andintotheDuplicateURLeliminator,whichmaintainsthesetofallURLsdiscoveredsofarandpassesononlynever-before-seenURLs.Finally,theURLprioritizerselectsapositionfortheURLintheFrontier,basedonfactorssuchasestimatedpageimportanceorrateofchange.1 1Changeratesplayaroleinincrementalcrawlers(Section2.3.5),whichroutefetchedURLsbacktotheprioritizerandFrontier. DNSserversWebservers Crawlingprocess1 DNSresolver&cache Hostnames IPaddresses HTTPfetcher HTMLpages Linkextractor URLs URLdistributor URLsURLs CustomURLfilter URLs URLs DuplicateURLeliminator URLs URLprioritizer URLs Frontier 2.3KeyDesignPoints185 Crawlingprocess2 DNSresolver&cache Hostnames IPaddresses HTTPfetcher HTMLpages Linkextractor URLs URLdistributor URLsURLs CustomURLfilter URLs URLs DuplicateURLeliminator URLs URLprioritizer URLs Frontier DNSserversWebservers Fig.2.1Basiccrawlerarchitecture. 2.3KeyDesignPoints WebcrawlersdownloadwebpagesbystartingfromoneormoreseedURLs,downloadingeachoftheassociatedpages,extractingthe 186CrawlerArchitecture hyperlinkURLscontainedtherein,andrecursivelydownloadingthosepages.Therefore,anywebcrawlerneedstokeeptrackbothoftheURLsthataretobedownloaded,aswellasthosethathavealreadybeendownloaded(toavoidunintentionallydownloadingthesamepagerepeatedly).TherequiredstateisasetofURLs,eachassociatedwithaﬂagindicatingwhetherthepagehasbeendownloaded.Theoperationsthatmustbesupportedare:AddinganewURL,retrievingaURL,markingaURLasdownloaded,andtestingwhetherthesetcontainsaURL.Therearemanyalternativein-memorydatastructures(e.g.,treesorsortedlists)thatsupporttheseoperations.However,suchanimplementationdoesnotscaletowebcorpussizesthatexceedtheamountofmemoryavailableonasinglemachine. Toscalebeyondthislimitation,onecouldeithermaintainthedatastructure(e.g.,thetreeorsortedlist)ondisk,oruseanoﬀ-the-shelfdatabasemanagementsystem.Eithersolutionallowsmaintainingsetsizesthatexceedmainmemory;however,thecostofessingitemsintheset(particularlyforthepurposeofsetmembershiptest)typicallyinvolvesadiskseek,makingitafairlyexpensiveoperation.Toachievehighperformance,amorespecializedapproachisneeded. Virtuallyeverymodernwebcrawlersplitsthecrawlstateintotwomajordatastructures:OnedatastructureformaintainingthesetofURLsthathavebeendiscovered(whetherdownloadedornot),andaseconddatastructureformaintainingthesetofURLsthathaveyettobedownloaded.Theﬁrstdatastructure(sometimescalledthe“URL-seentest”orthe“duplicatedURLeliminator”)mustsupportsetadditionandsetmembershiptesting,whiletheseconddatastructure(usuallycalledthefrontier)mustsupportaddingURLs,andselectingaURLtofetchnext. 2.3.1FrontierDataStructureandPoliteness AstraightforwardimplementationofthefrontierdatastructureisaFirst-in-First-out(FIFO)queue.Suchanimplementationresultsinabreadth-ﬁrsttraversalofthewebgraph.However,thissimpleapproachhasdrawbacks:Mosthyperlinksonthewebare“relative”(i.e.,refertoanotherpageonthesamewebserver).Therefore,afrontierrealized 2.3KeyDesignPoints187 asaFIFOqueuecontainslongrunsofURLsreferringtopagesonthesamewebserver,resultinginthecrawlerissuingmanyconsecutiveHTTPrequeststothatserver.Abarrageofrequestsinshortorderisconsidered“impolite,”andmaybeconstruedasadenial-of-serviceattackonthewebserver.Ontheotherhand,itwouldbewastefulforthewebcrawlertospaceoutrequeststothesameserverwithoutdoingotherusefulworkinthemeantime.ThisproblempoundedinamultithreadedordistributedcrawlerthatissuesmanyHTTPrequestsinparallel. Mostwebcrawlersobeyapolicyofnotissuingmultipleoverlappingrequeststothesameserver.Aneasywaytorealizethisistomaintainamappingfromwebserverstocrawlingthreads,e.g.,byhashingtheponentofeachURL.2Inthisdesign,eachcrawlingthreadhasaseparateFIFOqueue,anddownloadsonlyURLsobtainedfromthatqueue. Amoreconservativepolitenesspolicyistospaceoutrequeststoeachwebserverordingtothatserver’scapabilities.Forexample,acrawlermayhaveapolicytodelaysubsequentrequeststoaserverbyamultiple(say10×)ofthetimeittooktodownloadthelastpagefromthatserver.Thispolicyensuresthatthecrawlerconsumesaboundedfractionofthewebserver’sresources.Italsomeansthatinagiventimeinterval,fewerpageswillbedownloadedfromsloworpoorlyconnectedwebserversthanfromfast,responsivewebservers.Inotherwords,thiscrawlingpolicyisbiasedtowardwell-provisionedwebsites.Suchapolicyiswell-suitedtotheobjectivesofsearchengines,sincelargeandpopularwebsitestendalsotobewell-provisioned. TheMercatorwebcrawlerimplementedsuchanadaptivepolitenesspolicy.Itdividedthefrontierintotwoparts,a“frontend”anda“backend.”ThefrontendconsistedofasinglequeueQ,andURLswereaddedtothefrontierbyenqueuingthemintothatqueue.Theback 2Toamortizehardwarecost,manywebserversusevirtualhosting,meaningthatmultiplesymbolichostnamesresolvetothesameIPaddress.SimplyhashingtheponentofeachURLtogovernpolitenesshasthepotentialtooverloadsuchwebservers.AbetterschemeistoresolvetheURLssymbolichostnametoanIPaddressanduseahashofthataddresstoassignURLstoaqueue.ThedrawbackofthatapproachisthatthelatencyofDNSresolutioncanbehigh(seeSection2.3.3),butfortunatelytheretendstobeahighamountoflocalityinthestreamofdiscoveredhostnames,therebymakingcachingeﬀective. 188CrawlerArchitecture endconsistedofmanyseparatequeues;typicallythreetimesasmanyqueuesascrawlingthreads.EachqueuecontainedURLsbelongingtoasinglewebserver;atableTonthesidemaintainedamappingfromwebserverstoback-endqueues.Inaddition,associatedwitheachback-endqueueqwasatimetatwhichthenextURLfromqmaybeprocessed.These(q,t)pairsanizedintoanin-memorypriorityqueue,withthepairwithlowestthavingthehighestpriority.EachcrawlingthreadobtainedaURLtodownloadbyremovingthehighest-priorityentry(q,t)fromthepriorityqueue,waitingifnecessaryuntiltimethadbeenreached,dequeuingthenextURLufromq,downloadingit,andﬁnallyreinsertingthepair(q,tnow+k·x)intothepriorityqueue,wheretnowisthecurrenttime,xistheamountoftimeittooktodownloadu,andkisa“politenessparameter”;typically10.Ifdequeuingufromqleftqempty,thecrawlingthreadwouldremovethemappingfromhost(u)toqfromT,repeatedlydequeueaURLufromQandenqueueuintotheback-endqueueidentiﬁedbyT(host(u)),untilitfoundausuchthathost(u)wasnotcontainedinT.Atthispoint,itwouldenqueueuinqandupdateTtomaphost(u)toq. Inadditiontoobeyingpolitenesspoliciesthatgoverntherateatwhichpagesaredownloadedfromagivenwebsite,webcrawlersmayalsowanttoprioritizetheURLsinthefrontier.Forexample,itmaybedesirabletoprioritizepagesordingtotheirestimatedusefulness(basedforexampleontheirPageRank[101],thetraﬃctheyreceive,thereputationofthewebsite,ortherateatwhichthepagehasbeenupdatedinthepast).ThepageorderingquestionisdiscussedinSection4. Assumingamechanismforassigningcrawlprioritiestowebpages,acrawlercanstructurethefrontier(orintheMercatordesigndescribedabove,thefront-endqueue)asadisk-basedpriorityqueueorderedbyusefulness.Thestandardimplementationofapriorityqueueisaheap,andinsertionsintoaheapofnelementsrequirelog(n)elementesses,eachesspotentiallycausingadiskseek,whichwouldlimitthedatastructuretoafewhundredinsertionspersecond—farlessthantheURLingressrequiredforhigh-performancecrawling. Analternativesolutionisto“discretize”prioritiesintoaﬁxednumberofprioritylevels(say10to100levels),andmaintainaseparateURL 2.3KeyDesignPoints189 FIFOqueueforeachlevel.AURLisassignedadiscreteprioritylevel,andinsertedintothecorrespondingqueue.TodequeueaURL,eitherthehighest-prioritynonemptyqueueischosen,orarandomizedpolicybiasedtowardhigher-priorityqueuesisemployed. 2.3.2URLSeenTest Asoutlinedabove,thesecondmajordatastructureinanymoderncrawlerkeepstrackofthesetofURLsthathavebeenpreviouslydiscoveredandaddedtofrontier.ThepurposeofthisdatastructureistoavoidaddingmultipleinstancesofthesameURLtothefrontier;forthisreason,itissometimescalledtheURL-seentest(UST)ortheduplicateURLeliminator(DUE).Inasimplebatchcrawlingsettinginwhichpagesaredownloadedonlyonce,theUSTneedstosupportinsertionandsetmembershiptesting;inacontinuouscrawlingsettinginwhichpagesareperiodicallyre-downloaded(seeSection2.3.5),itmustalsosupportdeletion,inordertocopewithURLsthatnolongerpointtoavalidpage. Therearemultiplestraightforwardin-memoryimplementationsofaUST,e.g.,ahashtableorBloomﬁlter[20].Asmentionedabove,inmemoryimplementationsdonotscaletoarbitrarilylargewebcorpora;however,theyscalemuchfurtherthanin-memoryimplementationsofthefrontier,sinceeachURLcanpressedtoamuchsmallertoken(e.g.,a10-bytehashvalue).Commercialsearchenginesemploydistributedcrawlers(Section2.3.4),andahashtablerealizingtheUSTcanbepartitionedacrossthemachinesinthecrawlingcluster,furtherincreasingthelimitofhowfarsuchanin-memoryimplementationcanbescaledout. Ifmemoryisatapremium,thestateoftheUSTmustresideondisk.Inadisk-basedhashtable,eachlookuprequiresadiskseek,severelylimitingthethroughput.CachingpopularURLscanincreasethethroughputbyaboutanorderofmagnitude[27]toafewthousandlookupspersecond,butgiventhattheaveragewebpagecontainsontheorderofahundredlinksandthateachlinkneedstobetestedfornovelty,thecrawlingratewouldstillbelimitedtotensofpagespersecondundersuchanimplementation. 190CrawlerArchitecture Whilethelatencyofdiskseeksispoor(afewhundredseekspersecond),thebandwidthofdiskreadsandwritesisquitehigh(ontheorderof50–100MBpersecondinmoderndisks).So,implementationsperformingrandomﬁleessesperformpoorly,butthosethatperformstreamingsequentialreadsorwritescanachievereasonablethroughput.TheMercatorcrawlerleveragedthisobservationbyaggregatingmanysetlookupandinsertionoperationsintoasinglelargebatch,andprocessingthisbatchbysequentiallyreadingasetofsortedURLhashesfromdiskandwritingthem(plusthehashesofpreviouslyundiscoveredURLs)outtoanewﬁle[94]. Thisapproachimpliesthatthesetmembershipisdelayed:WeonlyknowwhetheraURLisnewafterthebatchcontainingtheURLhasbeenmergedwiththediskﬁle.Therefore,wecannotdecidewhethertoaddtheURLtothefrontieruntilthemergeurs,i.e.,weneedtoretainalltheURLsinabatch,notjusttheirhashes.However,itispossibletostoretheseURLstemporarilyondiskandreadthembackattheconclusionofthemerge(againusingpurelysequentialI/O),onceitisknownthattheyhadnotpreviouslybeenencounteredandshouldthusbeaddedtothefrontier.AddingURLstothefrontierinadelayedfashionalsomeansthatthereisalowerboundonhowsoontheycanbecrawled;however,giventhatthefrontierisusuallyfarlargerthanaDUEbatch,thisdelayisimperceptibleexceptforthemosthigh-priorityURLs. TheIRLbotcrawler[84]usesareﬁnementoftheMercatorscheme,wherethebatchofURLsarrivingattheDUEisalsowrittentodisk,distributedovermultipleﬁleskeyedbythepreﬁxofeachhash.Oncethesizeofthelargestﬁleexceedsacertainthreshold,theﬁlesthattogetherholdthebatcharereadbackintomemoryonebyoneandmerge-sortedintothemainURLhashﬁleondisk.Attheconclusionofthemerge,URLsareforwardedtothefrontierasintheMercatorscheme.BecauseIRLbotstoresthebatchondisk,thesizeofasinglebatchcanbemuchlargerthanMercator’sin-memorybatches,sothecostofthemerge-sortwiththemainURLhashﬁleisamortizedoveramuchlargersetofURLs. IntheMercatorschemeanditsIRLbotvariant,mergingabatchofURLsintothedisk-basedhashﬁleinvolvesreadingtheentireoldhash 2.3KeyDesignPoints191 ﬁleandwritingoutanupdatedversion.Hence,thetimerequirementisproportionaltothenumberofdiscoveredURLs.AmodiﬁcationofthisdesignistostoretheURLhashesondiskinsortedorderasbefore,butsparselypackedratherthandenselypacked.Thekhighest-orderbitsofahashdeterminethediskblockwherethishashresides(ifitispresent).Mergingabatchintothediskﬁleisdoneinplace,byreadingablockforwhichtherearehashesinthebatch,checkingwhichhashesarenotpresentinthatblock,andwritingtheupdatedblockbacktodisk.Thus,thetimerequirementformergingabatchisproportionaltothesizeofthebatch,notthenumberofdiscoveredURLs(albeitwithhighconstantduetodiskseeksresultingfromskippingdiskblocks).Onceanyblockintheﬁleﬁllspletely,thediskﬁleisrewrittentobetwiceaslarge,andeachblockcontainshashesthatnowsharetheirk+1highest-orderbits. 2.3.3AuxiliaryDataStructures InadditiontothetwomaindatastructuresdiscussedinSections2.3.1and2.3.2—thefrontierandtheUST/DUE—webcrawlersmaintainvariousauxiliarydatastructures.Wediscusstwo:TherobotsexclusionrulecacheandtheDNScache. WebcrawlersaresupposedtoadheretotheRobotsExclusionProtocol[83],aconventionthatallowsawebsiteadministratortobarwebcrawlersfromcrawlingtheirsite,orsomepageswithinthesite.ThisisdonebyprovidingaﬁleatURL/robots.txtcontainingrulesthatspecifywhichpagesthecrawlerisallowedtodownload.Beforeattemptingtocrawlasite,acrawlershouldcheckwhetherthesitesuppliesa/robots.txtﬁle,andifso,adheretoitsrules.Ofcourse,downloadingthisﬁleconstitutescrawlingactivityinitself.Toavoidrepeatedlyrequesting/robots.txt,crawlerstypicallycachetheresultsofpreviousrequestsofthatﬁle.Toboundthesizeofthatcache,entriesmustbediscardedthroughsomecacheevictionpolicy(e.g.,least-recentlyused);additionally,webserverscanspecifyanexpirationtimefortheir/robots.txtﬁle(viatheHTTPExpiresheader),andcacheentriesshouldbediscardedordingly. URLscontainaponent(e.g.,),whichis“resolved”usingtheDomainNameService(DNS),aprotocolthat 192CrawlerArchitecture exposesagloballydistributedmappingfromsymbolichostnamestoIPaddresses.DNSrequestscantakequitealongtimeduetotherequestforwardingnatureoftheprotocol.Therefore,crawlersoftenmaintaintheirownDNScaches.Aswiththerobotsexclusionrulecache,entriesareexpiredordingtobothastandardevictionpolicy(suchasleastrecentlyused),andtoexpirationdirectives. 2.3.4DistributedCrawling Webcrawlerscanbedistributedovermultiplemachinestoincreasetheirthroughput.ThisisdonebypartitioningtheURLspace,suchthateachcrawlermachineornodeisresponsibleforasubsetoftheURLsontheweb.TheURLspaceisbestpartitionedacrosswebsiteboundaries[40](wherea“website”mayrefertoallURLswiththesamesymbolichostname,samedomain,orsameIPaddress).PartitioningtheURLspaceacrosssiteboundariesmakesiteasytoobeypolitenesspolicies,sinceeachcrawlingprocesscanscheduledownloadswithouthavingmunicatewithothercrawlernodes.Moreover,allthemajordatastructurescaneasilybepartitionedacrosssiteboundaries,i.e.,thefrontier,theDUE,andtheDNSandrobotsexclusioncachesofeachnodecontainURL,robotsexclusionrules,andname-to-addressmappingsassociatedwiththesitesassignedtothatnode,andnothingelse. CrawlingprocessesdownloadwebpagesandextractURLs,andthankstotheprevalenceofrelativelinksontheweb,theywillbethemselvesresponsibleforthelargemajorityofextractedURLs.WhenaprocessextractsaURLuthatfallsundertheresponsibilityofanothercrawlernode,itforwardsutothatnode.ForwardingofURLscanbedonethroughpeer-to-peerTCPconnections[94],asharedﬁlesystem[70],oracentralcoordinationprocess[25,111].TheamountmunicationwithothercrawlernodescanbereducedbymaintainingacacheofpopularURLs,usedtoavoidrepeatforwardings[27]. Finally,avariantofdistributedwebcrawlingispeer-to-peercrawling[10,87,100,112,121],whichspreadscrawlingoveralooselycollaboratingsetofcrawlernodes.Peer-to-peercrawlerstypicallyemploysomeformofdistributedhashtableschemetoassignURLstocrawler 2.3KeyDesignPoints193 nodes,enablingthemtocopewithsporadicarrivalanddepartureofcrawlingnodes. 2.3.5IncrementalCrawling Webcrawlerscanbeusedtoassembleoneormorestaticsnapshotsofawebcorpus(batchcrawling),ortoperformincrementalorcontinuouscrawling,wheretheresourcesofthecrawleraredividedbetweendownloadingnewlydiscoveredpagesandre-downloadingpreviouslycrawledpages.Eﬃcientincrementalcrawlingrequiresafewchangestothemajordatastructuresofthecrawler.First,asmentionedinSection2.3.2,theDUEshouldsupportthedeletionofURLsthatarenolongervalid(e.g.,thatresultina404HTTPreturncode).Second,URLsareretrievedfromthefrontieranddownloadedasinbatchcrawling,buttheyaresubsequentlyreenteredintothefrontier.IfthefrontierallowsURLstobeprioritized,thepriorityofapreviouslydownloadedURLshouldbedependentonamodelofthepage’stemporalbehaviorbasedonpastobservations(seeSection5).ThisfunctionalityisbestfacilitatedbyaugmentingURLsinthefrontierwithadditionalinformation,inparticularpreviousprioritiespactsketchesoftheirpreviouscontent.Thisextrainformationallowsthecrawlerparethesketchofthejust-downloadedpagetothatofthepreviousversion,forexampleraisingthepriorityifthepagehaschangedandloweringitifithasnot.Inadditiontocontentevolution,otherfactorssuchaspagequalityarealsooftentakenintoount;indeedtherearemanyfast-changing“spam”webpages.
3 CrawlOrderingProblem Asidefromtheintra-sitepolitenessconsiderationsdiscussedinSection2,acrawlerisfreetovisitURLsinanyorder.Thecrawlorderisextremelysigniﬁcant,becauseforthepurposeofcrawlingthewebcanbeconsideredinﬁnite—duetothegrowthrateofnewcontent,andespeciallyduetodynamicallygeneratedcontent[8].Indeed,despitetheirimpressivecapacity,mercialsearchenginesonlyindex(andlikelyonlycrawl)afractionofdiscoverablewebpages[11].Thecrawlerorderingquestionisevenmorecrucialforthecountlesssmallerscalecrawlersthatperformscopedcrawlingoftargetedsubsetsoftheweb. Sections3–5surveyworkonselectingagoodcrawlerorder,withafocusontwobasicconsiderations: •Coverage.Thefractionofdesiredpagesthatthecrawleracquiresessfully. •Freshness.Thedegreetowhichtheacquiredpagesnapshotsremainup-to-date,relativetothecurrent“live”webcopies. Issuesrelatedtoredundant,maliciousormisleadingcontentarecoveredinSection6.Generallyspeaking,techniquestoavoidunwantedcontent 194 3.1Model195canbeincorporatedintothebasiccrawlorderingapproacheswithoutmuchdiﬃculty. 3.1Model Mostworkoncrawlorderingabstractsawaythearchitecturaldetailsofacrawler(Section2),andassumesthatURLsinthefrontierdatastructurecanbereorderedfreely.TheresultingsimpliﬁedcrawlorderingmodelisdepictedinFigure3.1.Atagivenpointintime,somehistoricalcrawlorderhasalreadybeenexecuted(P1,P2,P3,P4,P5inthediagram),andsomefuturecrawlorderhasbeenplanned(P6,P7,P4,P8,...).1 Inthemodel,allpagesrequirethesameamountoftimetodownload;the(constant)rateofpagedownloadingiscalledthecrawlrate,typicallymeasuredinpages/second.(Section2discussedhowtomaximizethecrawlrate;hereitisassumedtobeﬁxed.)Thecrawlrateisnotrelevanttobatchcrawlorderingmethods,butitisakeyfactorwhenschedulingpagerevisitationsinincrementalcrawling. Fig.3.1Crawlorderingmodel. 1Someapproachestreatthecrawlorderingproblemhierarchically,e.g.,selectavisitationorderforwebsites,andwithineachsiteselectapagevisitationorder.Thisapproachhelpsmitigateplexityofmanagingacrawlorderingpolicy,andiswellalignedwithpoliciesthatrelyprimarilyonsite-levelmetricssuchassite-levelPageRanktodrivecrawlorderingdecisions.Manyoftheinsightsaboutpage-levelcrawlorderingalsoapplyatthesitelevel. 196CrawlOrderingProblem Pagesdownloadedbythecrawlerarestoredinarepository.Thefuturecrawlorderisdetermined,atleastinpart,byanalyzingtherepository.Forexample,onesimplepolicymentionedearlier,breadthﬁrstsearch,extractshyperlinksfrompagesenteringtherepository,identiﬁeslinked-topagesthatarenotalreadypartofthe(historicalorplanned)crawlorder,andaddsthemtotheendoftheplannedcrawlorder. Thecontentofawebpageissubjecttochangeovertime,anditissometimesdesirabletore-downloadapagethathasalreadybeendownloaded,toobtainamorerecentsnapshotofitscontent.AsmentionedinSection2.3.5,twoapproachesexistformanagingrepeateddownloads: •Batchcrawling.Thecrawlorderdoesnotcontainduplicateurrencesofanypage,buttheentirecrawlingprocessisperiodicallyhaltedandrestartedasawaytoobtainmorerecentsnapshotsofpreviouslycrawledpages.Informationgleanedfrompreviouscrawliterations(e.g.,pageimportancescoreestimates)maybefedtosubsequentones. •Incrementalcrawling.Pagesmayappearmultipletimesinthecrawlorder,andcrawlingisacontinuousprocessthatconceptuallyneverterminates. Itisbelievedthatmostmercialcrawlersperformincrementalcrawling,whichismorepowerfulbecauseitallowsre-visitationofpagesatdiﬀerentrates.(AparisonbetweenincrementalandbatchcrawlingismadebyChoandGarc´ıa-Molina[39].) 3.1.1Limitations Thismodelhasledtoagooddealofresearchwithpracticalimplications.However,aswithallmodels,itsimpliﬁesreality.Foronething,asdiscussedinSection2,alarge-scalecrawlermaintainsitsfrontierdatastructureondisk,whichlimitsopportunitiesforreordering.Generallyspeaking,theapproachofmaintainingaprioritizedensembleofFIFOqueues(seeSection2.3.1)canbeusedtoapproximateadesiredcrawlorder.WerevisitthisissueinSections4.3and5.3. 3.2WebCharacteristics197 Otherreal-worldconsiderationsthatfalloutsidethemodelinclude: •Somepages(orevenversionsofapage)takelongertodownloadthanothers,duetodiﬀerencesinsizeworklatency. •Crawlerstakespecialcaretospaceoutdownloadsofpagesfromthesameserver,toobeypolitenessconstraints,seeSection2.3.1.Crawlorderingpoliciesthatassumeasinglecrawlrateconstraintcan,atleastinprinciple,beappliedonaperserverbasis,i.e.,runnindependentcopiesofthepolicyfornservers. •AsdescribedinSection2,mercialcrawlersutilizemanysimultaneouspagedownloaderthreads,runningonmanyindependentmachines.Henceratherthanasingletotallyorderedlistofpagestodownload,itismoreuratetothinkofasetofparallellists,encodingapartialorder. •Specialcaremustbetakentoavoidcrawlingredundantandmaliciouscontent;wetreattheseissuesinSection6. •Ifthepagerepositoryrunsoutofspace,andexpandingitisnotconsideredworthwhile,isesnecessarytoretiresomeofthepagesstoredthere(althoughitmaymakesensetoretainsomemetadataaboutthepage,toavoidrecrawlingit).Wearenotawareofanyscholarlyworkonhowtoselectpagesforretirement. 3.2WebCharacteristics Beforeproceeding,wedescribesomestructuralandevolutionarypropertiesofthewebthatarerelevanttothecrawlorderingquestion.Theﬁndingspresentedherearedrawnfromstudiesthatuseddatasetsofwidelyvaryingsizeandscope,takenatdiﬀerentdatesoverthespanofadecade,andanalyzedviaawidearrayofmethods.Hence,cautioniswarrantedintheirinterpretation. 3.2.1StaticCharacteristics Severalstudiesofthestructureofthewebgraph,inwhichpagesareencodedasverticesandhyperlinksasdirectededges,havebeen 198CrawlOrderingProblem conducted.OnenotablestudyisbyBroderetal.[26],whichuncovereda“bowtie”structureconsistingofacentralstronglyponent(thecore),ponentthatcanreachthecorebutcannotbereachedfromthecore,andponentthatcanbereachedfromthecorebutcannotreachthecore.(Inadditiontothesethreeponentsthereareanumberofsmall,irregularstructuressuchasponentsandlong“tendrils.”) Hencethereexistmanyorderedpairsofpages(P1,P2)suchthatthereisnowaytoreachP2bystartingatP1andrepeatedlyfollowinghyperlinks.EvenincaseswhereP2isreachablefromP1,thedistancecanvarygreatly,andinmanycaseshundredsoflinksmustbetraversed.Theimplicationsforcrawlingare:(1)onecannotsimplycrawltodepthN,forareasonablevalueofNlikeN=20,andbeassuredofcoveringtheentirewebgraph;(2)crawling“seeds”(thepagesatwhichamences)shouldbeselectedcarefully,andmultipleseedsmaybenecessarytoensuregoodcoverage. Inanearlierstudy,Broderetal.[28]showedthatthereisanabundanceofnear-duplicatecontentoftheweb.Usingacorpusof30millionwebpagescollectedbytheAltaVistacrawler,theyusedtheshinglingalgorithmtoclusterthecorpusintogroupsofsimilarpages,andfoundthat29%ofthepagesweremorethan50%similartootherpagesinthecorpus,and11%ofthepageswereexactduplicatesofotherpages.Sourcesofnear-duplicationincludemirroringofsites(orportionsofsites)andURLsynonymy,seeSection6.1. Changetal.[35]studiedthe“deepweb,”i.e.,websiteswhosecontentisnotreachableviahyperlinksandinsteadcanonlyberetrievedbysubmittingHTMLforms.Theﬁndingsinclude:(1)thereareoveronemilliondeepwebsites;(2)moredeepwebsiteshavestructured(multiﬁeld)queryinterfacesthanunstructured(single-ﬁeld)ones;and(3)mostqueryinterfacesarelocatedwithinafewlinksoftherootofawebsite,andarethuseasytoﬁndbyshallowcrawlingfromtherootpage. 3.2.2TemporalCharacteristics Oneoftheobjectivesofcrawlingistomaintainfreshnessofthecrawledcorpus.Henceitisimportanttounderstandthetemporal 3.2WebCharacteristics199 characteristicsoftheweb,bothintermsofsite-levelevolution(theappearanceanddisappearanceofpagesonasite)andpage-levelevolution(changingcontentwithinapage). 3.2.2.1Site-LevelEvolution Dasguptaetal.[48]andNtoulasetal.[96]studiedcreationandretirementofpagesandlinksinsideanumberofwebsites,andfoundthefollowingcharacteristics(theserepresentaveragesacrossmanysites): •Newpagesarecreatedatarateof8%perweek.•Pagesareretiredatarapidpace,suchthatduringthecourse ofoneyear80%ofpagesdisappear.•Newlinksarecreatedattherateof25%perweek,whichis signiﬁcantlyfasterthantherateofnewpagecreation.•Linksareretiredataboutthesamepaceaspages,with80% disappearinginthespanofayear.•Itispossibletodiscover90%ofnewpagesbymonitoring linksspawnedfromasmall,well-chosensetofoldpages(formostsites,ﬁveorfewerpagessuﬃce,althoughforsomesiteshundredsofpagesmustbemonitoredforthispurpose).However,discoveringtheremaining10%requiressubstantiallymoreeﬀort. 3.2.2.2Page-LevelEvolution Somekeyﬁndingsaboutthefrequencywithwhichanindividualwebpageundergoesachangeare: •PagechangeeventsaregovernedbyaPoissonprocess,whichmeansthatchangesurrandomlyandindependently,atleastinthecaseofpagesthatchangelessfrequentlythanonceaday[39].2 •Pagechangefrequenciesspanmultipleordersofmagnitude(sub-hourly,hourly,daily,weekly,monthly,annually),andeachorderofmagnitudeincludesasubstantialfractionof 2APoissonchangemodelwasoriginallypostulatedbyCoﬀmanetal.[46]. 200CrawlOrderingProblem pagesontheweb[2,39].Thisﬁndingmotivatesthestudyofnon-uniformpagerevisitationschedules.•Changefrequencyiscorrelatedwithvisitationfrequency,URLdepth,domainic[2],aswellaspagelength[60].•Apage’schangefrequencytendstoremainstationaryovertime,suchthatpastchangefrequencyisafairlygoodpredictoroffuturechangefrequency[60]. Unfortunately,itappearsthatthereisnosimplerelationshipbetweenthefrequencywithwhichapagechangesandthecumulativeamountofcontentthatchangesovertime.Asonewouldexpect,pageswithmoderatechangefrequencytendtoexhibitahighercumulativeamountofchangedcontentthanpageswithalowchangefrequency.However,pageswithhighchangefrequencytendtoexhibitlesscumulativechangethanpageswithmoderatechangefrequency.Ontheencouragingside,theamountofcontentthatchangedonapageinthepastisafairlygoodpredictoroftheamountofcontentthatwillchangeinthefuture(althoughthedegreeofpredictabilityvariesfromwebsitetowebsite)[60,96]. Manychangesareconﬁnedtoasmall,contiguousregionofawebpage[60,85],and/oronlyaﬀecttransientwordsthatdonotcharacterizethecore,time-invariantthemeofthepage[2].Muchofthe“new”contentaddedtowebpagesisactuallytakenfromotherpages[96]. Thetemporalbehaviorof(regionsof)webpagescanbedividedintothreecategories:Static(nochanges),churn(newcontentsupplantsoldcontent,e.g.,quoteoftheday),andscroll(newcontentisappendedtooldcontent,e.g.,blogentries).Simplegenerativemodelsforthethreecategoriescollectivelyexplainnearlyallobservedtemporalwebpagebehavior[99]. Mostwebpagesincludeatleastsomestaticcontent,resultinginanupperboundonthedivergencebetweenanoldsnapshotofapageandthelivecopy.Theshapeofthe

本文地址：https://www.apjn.cn/w/32687/657.html

声明：该资讯来自于互联网网友发布，如有侵犯您的权益请联系我们。

标签： #网站 #发外链 #公司 #怎么做 #是怎么 #关键词 #域名 #怎么做

WebCrawling,Foundations

2.0时,２.０时代 •

怎么用谷歌seo在谷歌seo中怎样提高网站的权重

怎么用谷歌seo出口公司如何做好Google的seo优化