WebCrawling,Foundations

用谷歌 0
andTrendsRinInformationRetrievalVol.4,No.3(2010)175–246c2010C.OlstonandM.NajorkDOI:10.1561/1500000017 WebCrawling ByherOlstonandMarcNajork Contents 1Introduction 176 1.1Challenges 178 1.2Outline 179 2CrawlerArchitecture 180 2.1Chronology 180 2.2ArchitectureOverview 184 2.3KeyDesignPoints 185 3CrawlOrderingProblem 194 3.1Model 195 3.2WebCharacteristics 197 3.3TaxonomyofCrawlOrderingPolicies 202 4BatchCrawlOrdering 203 4.1ComprehensiveCrawling 204 4.2ScopedCrawling 208 4.3EfficientLarge-ScaleImplementation 213 5IncrementalCrawlOrdering 215 5.1MaximizingFreshness 217 5.2CapturingUpdates 222 5.3EfficientLarge-ScaleImplementation 223 6AvoidingProblematicandUndesirableContent 225 6.1RedundantContent 225 6.2CrawlerTraps 226 6.3WebSpam 227 6.4CloakedContent 228 7DeepWebCrawling 230 7.1TypesofDeepWebSites 230 7.2ProblemOverview 232 7.3ContentExtraction 232 8FutureDirections 236 References 239 FoundationsandTrendsRinInformationRetrievalVol.4,No.3(2010)175–246c2010C.OlstonandM.NajorkDOI:10.1561/1500000017 WebCrawling herOlston1andMarcNajork2 1Yahoo!
Research,701FirstAvenue,Sunnyvale,CA,94089,USAolston@ 2MicrosoftResearch,1065LaAvenida,MountainView,CA,94043,USAnajork@ Abstract Thisisasurveyofthescienceandpracticeofwebcrawling.Whileatfirstglancewebcrawlingmayappeartobemerelyanapplicationofbreadth-first-search,thetruthisthattherearemanychallengesrangingfromsystemsconcernssuchasmanagingverylargedatastructurestotheoreticalquestionssuchashowoftentorevisitevolvingcontentsources.Thissurveyoutlinesthefundamentalchallengesanddescribesthestate-of-the-artmodelsandsolutions.Italsohighlightsavenuesforfuturework.
1 Introduction Awebcrawler(alsoknownasarobotoraspider)isasystemforthebulkdownloadingofwebpages.Webcrawlersareusedforavarietyofpurposes.Mostprominently,theyareoneoftheponentsofwebsearchengines,systemsthatassembleacorpusofwebpages,indexthem,andallowuserstoissuequeriesagainsttheindexandfindthewebpagesthatmatchthequeries.Arelateduseiswebarchiving(aserviceprovidedbye.g.,thearchive[77]),wherelargesetsofwebpagesareperiodicallycollectedandarchivedforposterity.Athirduseiswebdatamining,wherewebpagesareanalyzedforstatisticalproperties,orwheredataanalyticsisperformedonthem(anexamplewouldbeAttributor[7],panythatmonitorsthewebforcopyrightandtrademarkinfringements).Finally,webmonitoringservicesallowtheirclientstosubmitstandingqueries,ortriggers,andtheycontinuouslycrawlthewebandnotifyclientsofpagesthatmatchthosequeries(anexamplewouldbeGigaAlert[64]). Theraisond’ˆetreforwebcrawlersliesinthefactthatthewebisnotacentrallymanagedrepositoryofinformation,butratherconsists 176 177 ofhundredsofmillionsofindependentwebcontentproviders,eachoneprovidingtheirownservices,andpetingwithoneanother.Inotherwords,thewebcanbeviewedasafederatedinformationrepository,heldtogetherbyasetofagreed-uponprotocolsanddataformats,suchastheTransmissionControlProtocol(TCP),theDomainNameService(DNS),theHypertextTransferProtocol(HTTP),theHypertextMarkupLanguage(HTML)andtherobotsexclusionprotocol.So,contentaggregators(suchassearchenginesorwebdataminers)havetwochoices:Theycaneitheradoptapullmodelwheretheywillproactivelyscourthewebforneworupdatedinformation,ortheycouldtrytoestablishaconventionandasetofprotocolsenablingcontentproviderstopushcontentofinteresttotheaggregators.Indeed,theHarvestsystem[24],oneoftheearliestsearchservices,adoptedsuchapushmodel.However,thisapproachdidnoteed,andvirtuallyallcontentaggregatorsadoptedthepullapproach,withafewprovisostoallowcontentproviderstoexcludeallorpartoftheircontentfrombeingcrawled(therobotsexclusionprotocol)andtoprovidehintsabouttheircontent,itsimportanceanditsrateofchange(theSitemapsprotocol[110]). Thereareseveralreasonswhythepushmodeldidnotetheprimarymeansofacquiringcontentforsearchenginesandothercontentaggregators:Thefactthatwebserversarehighlyautonomousmeansthatthebarrierofentrytoingacontentproviderisquitelow,andthefactthatthewebprotocolswereatleastinitiallyextremelysimpleloweredthebarrierevenfurther—infact,thissimplicityisviewedbymanyasthereasonwhythewebeededwhereearlierhypertextsystemshadfailed.Addingpushprotocolswouldplicatedthesetofwebprotocolsandthusraisedthebarrierofentryforcontentproviders,whilethepullmodeldoesnotrequireanyextraprotocols.Bythesametoken,thepullmodellowersthebarrierofentryforcontentaggregatorsaswell:Launchingacrawlerdoesnotrequireanyaprioribuy-infromcontentproviders,andindeedthereareover1,500operatingcrawlers[47],extendingfarbeyondthesystemsemployedbythebigsearchengines.Finally,thepushmodelrequiresatrustrelationshipbetweencontentproviderandcontentaggregator,somethingthatisnotgivenonthewebatlarge—indeed,therelationshipbetween 178Introduction contentprovidersandsearchenginesischaracterizedbybothmutualdependenceandadversarialdynamics(seeSection6). 1.1Challenges Thebasicwebcrawlingalgorithmissimple:GivenasetofseedUniformResourceLocators(URLs),acrawlerdownloadsallthewebpagesaddressedbytheURLs,extractsthehyperlinkscontainedinthepages,anditerativelydownloadsthewebpagesaddressedbythesehyperlinks.Despitetheapparentsimplicityofthisbasicalgorithm,webcrawlinghasmanyinherentchallenges: •Scale.Thewebisverylargeandcontinuallyevolving.Crawlersthatseekbroadcoverageandgoodfreshnessmustachieveextremelyhighthroughput,whichposesmanydifficultengineeringproblems.Modernsearchpaniesemploythousandsputersanddozensofworklinks. •Contentselectiontradeoffs.Eventhehighest-throughputcrawlersdonotpurporttocrawlthewholeweb,orkeepupwithallthechanges.Instead,crawlingisperformedselectivelyandinacarefullycontrolledorder.Thegoalsaretoacquirehigh-valuecontentquickly,ensureeventualcoverageofallreasonablecontent,andbypasslow-quality,irrelevant,redundant,andmaliciouscontent.Thecrawlermustpetingobjectivessuchascoverageandfreshness,whileobeyingconstraintssuchasper-siteratelimitations.Abalancemustalsobestruckbetweenexplorationofpotentiallyusefulcontent,andexploitationofcontentalreadyknowntobeuseful. •Socialobligations.Crawlersshouldbe“goodcitizens”oftheweb,i.e.,notimposetoomuchofaburdenonthewebsitestheycrawl.Infact,withouttherightsafetymechanismsahigh-throughputcrawlercaninadvertentlycarryoutadenial-of-serviceattack. •Adversaries.Somecontentprovidersseektoinjectuselessormisleadingcontentintothecorpusassembledby 1.2Outline179 thecrawler.Suchbehaviorisoftenmotivatedbyfinancialincentives,forexample(mis)directingtrafficmercialwebsites. 1.2Outline Webcrawlingisaic,andaswithmosticsitcannotbesplitintofullyorthogonalics.Bearingthatinmind,westructurethesurveyordingtofiverelativelydistinctlinesofworkthaturintheliterature: •Buildinganefficient,robustandscalablecrawler(Section2).•Selectingatraversalorderofthewebgraph,assuming contentiswell-behavedandisinterconnectedviaHTMLhyperlinks(Section4).•Schedulingrevisitationofpreviouslycrawledcontent(Section5).•Avoidingproblematicandundesirablecontent(Section6).•Crawlingso-called“deepweb”content,whichmustbeessedviaHTMLformsratherthanhyperlinks(Section7). Section3introducesthetheoreticalcrawlorderingproblemstudiedinSections4and5,anddescribesstructuralandevolutionarypropertiesofthewebthatinfluencecrawlordering.Section8givesalistofopenproblems.
2 CrawlerArchitecture Thissectionfirstpresentsachronologyofwebcrawlerdevelopment,andthendescribesthegeneralarchitectureandkeydesignpointsofmodernscalablecrawlers. 2.1Chronology Webcrawlersarealmostasoldasthewebitself.Inthespringof1993,shortlyafterthelaunchofNCSAMosaic,MatthewGrayimplementedtheWorldWideWebWanderer[67].TheWandererwaswritteninPerlandranonasinglemachine.Itwasuseduntil1996tocollectstatisticsabouttheevolutionoftheweb.Moreover,thepagescrawledbytheWandererpiledintoanindex(the“Wandex”),thusgivingrisetothefirstsearchengineontheweb.InDecember1993,threemorecrawler-basedSearchenginesbecameavailable:JumpStation(implementedbyJonathanFletcher;thedesignhasnotbeenwrittenup),theWorldWideWebWorm[90],andtheRBSEspider[57].WebCrawler[108]joinedthefieldinApril1994,andMOMspider[61]wasdescribedthesameyear.Thisfirstgenerationofcrawlersidentifiedsomeofthedefiningissuesinwebcrawlerdesign.Forexample,MOM- 180 2.1Chronology181 spiderconsideredpolitenesspolicies:Itlimitedtherateofrequeststoeachsite,itallowedwebsitestoexcludethemselvesfrompurviewthroughthenascentrobotsexclusionprotocol[83],anditprovideda“black-list”mechanismthatallowedthecrawloperatortoexcludesites.WebCrawlersupportedparalleldownloadingofwebpagesbystructuringthesystemintoacentralcrawlmanagerand15separatedownloadingprocesses.However,thedesignoftheseearlycrawlersdidnotfocusonscalability,andseveralofthem(RBSEspiderandWebCrawler)usedgeneral-purposedatabasemanagementsystemstostorethestateofthecrawl.EventheoriginalLycoscrawler[89]ranonasinglemachine,waswritteninPerl,andusedPerl’sassociativearrays(spiltontodiskusingtheDBMdatabasemanager)tomaintainthesetofURLstocrawl. Thefollowingfewyearssawthearrivalofmercialsearchengines(Lycos,Infoseek,Excite,AltaVista,andHotBot),allofwhichusedcrawlerstoindextensofmillionsofpages;however,thedesignofthesecrawlersremainsundocumented. MikeBurner’sdescriptionoftheArchivecrawler[29]wasthefirstpaperthatfocusedonthechallengescausedbythescaleoftheweb.TheArchivecrawlingsystemwasdesignedtocrawlontheorderof100millionURLs.Atthisscale,itisnolongerpossibletomaintainalltherequireddatainmainmemory.ThesolutionproposedbytheIApaperwastocrawlonasite-by-sitebasis,andtopartitionthedatastructuresordingly.ThelistofURLstobecrawledwasimplementedasadisk-basedqueueperwebsite.ToavoidaddingmultipleinstancesofthesameURLtothequeue,theIAcrawlermaintainedanin-memoryBloomfilter[20]ofallthesite’sURLsdiscoveredsofar.ThecrawlprogressedbydequeuingaURL,downloadingtheassociatedpage,extractingalllinks,enqueuingfreshlydiscoveredonsitelinks,writingalloff-sitelinkstodisk,anditerating.Eachcrawlingprocesscrawled64sitesinparallel,usingnon-blockinginput/output(I/O)andasinglethreadofcontrol.asionally,abatchprocesswouldintegratetheoff-sitelinkinformationintothevariousqueues.TheIAdesignmadeitveryeasytothrottlerequeststoagivenhost,therebyaddressingpolitenessconcerns,andDNSandrobotexclusionlookupsforagivenwebsitewereamortizedoverallthesite’sURLscrawledinasingleround.However,itisnotclearwhetherthebatch 182CrawlerArchitecture processofintegratingoff-sitelinksintotheper-sitequeueswouldscaletosubstantiallylargerwebcrawls. BrinandPage’s1998paperoutliningthearchitectureofthefirstgenerationGoogle[25]systemcontainsashortdescriptionoftheircrawler.TheoriginalGooglecrawlingsystemconsistedofasingleURLserverprocessthatmaintainedthestateofthecrawl,andaroundfourcrawlingprocessesthatdownloadedpages.BothURLserverandcrawlerswereimplementedinPython.ThecrawlingprocessusedasynchronousI/Oandwouldtypicallyperformabout300downloadsinparallel.Thepeakdownloadratewasabout100pagespersecond,withanaveragesizeof6KBperpage.BrinandPageidentifiedsocialaspectsofcrawling(e.g.,dealingwithwebmasters’plaints)asamajorchallengeinoperatingacrawlingsystem. WiththeMercatorwebcrawler,HeydonandNajorkpresenteda“blueprintdesign”forwebcrawlers[75,94].MercatorwaswritteninJava,highlyscalable,andeasilyextensible.Thefirstversion[75]wasnon-distributed;alaterdistributedversion[94]partitionedtheURLspaceoverthecrawlersordingtohostname,andavoidedthepotentialbottleneckofacentralizedURLserver.ThesecondMercatorpapergavestatisticsofa17-day,four-machinecrawlthatcovered891millionpages.Mercatorwasusedinanumberofwebminingprojects[27,60,71,72,95],andin2001replacedthefirst-generationAltaVistacrawler. ShkapenyukandSuel’sPolybotwebcrawler[111]representsanother“blueprintdesign.”Polybotisadistributedsystem,consistingofacrawlmanagerprocess,multipledownloaderprocesses,andaDNSresolverprocess.ThepaperdescribesscalabledatastructuresfortheURLfrontierandthe“seen-URL”setusedtoavoidcrawlingthesameURLmultipletimes;italsodiscussestechniquesforensuringpolitenesswithoutslowingdownthecrawl.Polybotwasabletodownload120millionpagesover18daysusingfourmachines. TheIBMWebFountaincrawler[56]representedanotherindustrialstrengthdesign.TheWebFountaincrawlerwasfullydistributed.Thethreeponentsweremulti-threadedcrawlingprocesses(“Ants”),duplicatedetectionprocessesresponsibleforidentifyingdownloadedpageswithnear-duplicatecontent,andacentralcontroller 2.1Chronology183 processresponsibleforassigningworktotheAntsandformonitoringtheoverallstateofthesystem.WebFountainfeaturedaveryflexiblecrawlschedulingmechanismthatallowedURLstobeprioritized,maintainedapolitenesspolicy,andevenallowedthepolicytobechangedonthefly.Itwasdesignedfromthegrounduptosupportincrementalcrawling,i.e.,theprocessofrecrawlingpagesregularlybasedontheirhistoricalchangerate.TheWebFountaincrawlerwaswritteninC++andusedMPI(theMessagePassingInterface)tomunicationbetweenthevariousprocesses.Itwasreportedlydeployedonaclusterof48crawlingmachines[68]. UbiCrawler[21]isanotherscalabledistributedwebcrawler.ItusesconsistenthashingtopartitionURLsordingtotheirponentacrosscrawlingmachines,leadingtogracefulperformancedegradationintheeventofthefailureofacrawlingmachine.UbiCrawlerwasabletodownloadabout10millionpagesperdayusingfivecrawlingmachines.UbiCrawlerhasbeenusedforstudiesofpropertiesoftheAfricanweb[22]andpileseveralreferencecollectionsofwebpages[118]. Recently,Yanetal.describedIRLbot[84],asingle-processwebcrawlerthatisabletoscaletoextremelylargewebcollectionswithoutperformancedegradation.IRLbotfeaturesa“seen-URL”datastructurethatusesonlyafixedamountofmainmemory,andwhoseperformancedoesnotdegradeasitgrows.Thepaperdescribesacrawlthatranovertwomonthsanddownloadedabout6.4billionwebpages.Inaddition,theauthorsaddresstheissueofcrawlertraps(websiteswithalarge,possiblyinfinitenumberoflow-utilitypages,seeSection6.2),andproposewaystoamelioratetheimpactofsuchsitesonthecrawlingprocess. Finally,thereareanumberofopen-sourcecrawlers,twoofwhichdeservespecialmention.Heritrix[78,93]isthecrawlerusedbytheArchive.ItiswritteninJavaandponentized,anditsdesignisquitesimilartothatofMercator.Heritrixismultithreaded,butnotdistributed,andassuchsuitableforconductingmoderatelysizedcrawls.TheNutchcrawler[62,81]iswritteninJavaaswell.Itsupportsdistributedoperationandshouldthereforebesuitableforverylargecrawls;butasofthewritingof[81]ithasnotbeenscaledbeyond100millionpages. 184CrawlerArchitecture 2.2ArchitectureOverview Figure2.1showsthehigh-levelarchitectureofaprototypicaldistributedwebcrawler.Thecrawlerconsistsofmultipleprocessesrunningondifferentmachinesconnectedbyawork.Eachcrawlingprocessconsistsofmultipleworkerthreads,andeachworkerthreadperformsrepeatedworkcycles. Atthebeginningofeachworkcycle,aworkerobtainsaURLfromtheFrontierdatastructure,whichdispensesURLsordingtotheirpriorityandtopolitenesspolicies.TheworkerthreadtheninvokestheHTTPfetcher.ThefetcherfirstcallsaDNSsub-moduletoresolvetheponentoftheURLintotheIPaddressofthecorrespondingwebserver(usingcachedresultsofpriorresolutionsifpossible),andthenconnectstothewebserver,checksforanyrobotsexclusionrules(whichtypicallyarecachedaswell),andattemptstodownloadthewebpage. Ifthedownloadeeds,thewebpagemayormaynotbestoredinarepositoryofharvestedwebpages(notshown).Ineithercase,thepageispassedtotheLinkextractor,whichparsesthepage’sHTMLcontentandextractshyperlinkscontainedtherein.ThecorrespondingURLsarethenpassedtoaURLdistributor,whichassignseachURLtoacrawlingprocess.ThisassignmentistypicallymadebyhashingtheURLsponent,itsdomain,oritsIPaddress(thelatterrequiresadditionalDNSresolutions).Sincemosthyperlinksrefertopagesonthesamewebsite,assignmenttothelocalcrawlingprocessismoncase. Next,theURLpassesthroughtheCustomURLfilter(e.g.,toexcludeURLsbelongingto“black-listed”sites,orURLswithparticularfileextensionsthatarenotofinterest)andintotheDuplicateURLeliminator,whichmaintainsthesetofallURLsdiscoveredsofarandpassesononlynever-before-seenURLs.Finally,theURLprioritizerselectsapositionfortheURLintheFrontier,basedonfactorssuchasestimatedpageimportanceorrateofchange.1 1Changeratesplayaroleinincrementalcrawlers(Section2.3.5),whichroutefetchedURLsbacktotheprioritizerandFrontier. DNSserversWebservers Crawlingprocess1 DNSresolver&cache Hostnames IPaddresses HTTPfetcher HTMLpages Linkextractor URLs URLdistributor URLsURLs CustomURLfilter URLs URLs DuplicateURLeliminator URLs URLprioritizer URLs Frontier 2.3KeyDesignPoints185 Crawlingprocess2 DNSresolver&cache Hostnames IPaddresses HTTPfetcher HTMLpages Linkextractor URLs URLdistributor URLsURLs CustomURLfilter URLs URLs DuplicateURLeliminator URLs URLprioritizer URLs Frontier DNSserversWebservers Fig.2.1Basiccrawlerarchitecture. 2.3KeyDesignPoints WebcrawlersdownloadwebpagesbystartingfromoneormoreseedURLs,downloadingeachoftheassociatedpages,extractingthe 186CrawlerArchitecture hyperlinkURLscontainedtherein,andrecursivelydownloadingthosepages.Therefore,anywebcrawlerneedstokeeptrackbothoftheURLsthataretobedownloaded,aswellasthosethathavealreadybeendownloaded(toavoidunintentionallydownloadingthesamepagerepeatedly).TherequiredstateisasetofURLs,eachassociatedwithaflagindicatingwhetherthepagehasbeendownloaded.Theoperationsthatmustbesupportedare:AddinganewURL,retrievingaURL,markingaURLasdownloaded,andtestingwhetherthesetcontainsaURL.Therearemanyalternativein-memorydatastructures(e.g.,treesorsortedlists)thatsupporttheseoperations.However,suchanimplementationdoesnotscaletowebcorpussizesthatexceedtheamountofmemoryavailableonasinglemachine. Toscalebeyondthislimitation,onecouldeithermaintainthedatastructure(e.g.,thetreeorsortedlist)ondisk,oruseanoff-the-shelfdatabasemanagementsystem.Eithersolutionallowsmaintainingsetsizesthatexceedmainmemory;however,thecostofessingitemsintheset(particularlyforthepurposeofsetmembershiptest)typicallyinvolvesadiskseek,makingitafairlyexpensiveoperation.Toachievehighperformance,amorespecializedapproachisneeded. Virtuallyeverymodernwebcrawlersplitsthecrawlstateintotwomajordatastructures:OnedatastructureformaintainingthesetofURLsthathavebeendiscovered(whetherdownloadedornot),andaseconddatastructureformaintainingthesetofURLsthathaveyettobedownloaded.Thefirstdatastructure(sometimescalledthe“URL-seentest”orthe“duplicatedURLeliminator”)mustsupportsetadditionandsetmembershiptesting,whiletheseconddatastructure(usuallycalledthefrontier)mustsupportaddingURLs,andselectingaURLtofetchnext. 2.3.1FrontierDataStructureandPoliteness AstraightforwardimplementationofthefrontierdatastructureisaFirst-in-First-out(FIFO)queue.Suchanimplementationresultsinabreadth-firsttraversalofthewebgraph.However,thissimpleapproachhasdrawbacks:Mosthyperlinksonthewebare“relative”(i.e.,refertoanotherpageonthesamewebserver).Therefore,afrontierrealized 2.3KeyDesignPoints187 asaFIFOqueuecontainslongrunsofURLsreferringtopagesonthesamewebserver,resultinginthecrawlerissuingmanyconsecutiveHTTPrequeststothatserver.Abarrageofrequestsinshortorderisconsidered“impolite,”andmaybeconstruedasadenial-of-serviceattackonthewebserver.Ontheotherhand,itwouldbewastefulforthewebcrawlertospaceoutrequeststothesameserverwithoutdoingotherusefulworkinthemeantime.ThisproblempoundedinamultithreadedordistributedcrawlerthatissuesmanyHTTPrequestsinparallel. Mostwebcrawlersobeyapolicyofnotissuingmultipleoverlappingrequeststothesameserver.Aneasywaytorealizethisistomaintainamappingfromwebserverstocrawlingthreads,e.g.,byhashingtheponentofeachURL.2Inthisdesign,eachcrawlingthreadhasaseparateFIFOqueue,anddownloadsonlyURLsobtainedfromthatqueue. Amoreconservativepolitenesspolicyistospaceoutrequeststoeachwebserverordingtothatserver’scapabilities.Forexample,acrawlermayhaveapolicytodelaysubsequentrequeststoaserverbyamultiple(say10×)ofthetimeittooktodownloadthelastpagefromthatserver.Thispolicyensuresthatthecrawlerconsumesaboundedfractionofthewebserver’sresources.Italsomeansthatinagiventimeinterval,fewerpageswillbedownloadedfromsloworpoorlyconnectedwebserversthanfromfast,responsivewebservers.Inotherwords,thiscrawlingpolicyisbiasedtowardwell-provisionedwebsites.Suchapolicyiswell-suitedtotheobjectivesofsearchengines,sincelargeandpopularwebsitestendalsotobewell-provisioned. TheMercatorwebcrawlerimplementedsuchanadaptivepolitenesspolicy.Itdividedthefrontierintotwoparts,a“frontend”anda“backend.”ThefrontendconsistedofasinglequeueQ,andURLswereaddedtothefrontierbyenqueuingthemintothatqueue.Theback 2Toamortizehardwarecost,manywebserversusevirtualhosting,meaningthatmultiplesymbolichostnamesresolvetothesameIPaddress.SimplyhashingtheponentofeachURLtogovernpolitenesshasthepotentialtooverloadsuchwebservers.AbetterschemeistoresolvetheURLssymbolichostnametoanIPaddressanduseahashofthataddresstoassignURLstoaqueue.ThedrawbackofthatapproachisthatthelatencyofDNSresolutioncanbehigh(seeSection2.3.3),butfortunatelytheretendstobeahighamountoflocalityinthestreamofdiscoveredhostnames,therebymakingcachingeffective. 188CrawlerArchitecture endconsistedofmanyseparatequeues;typicallythreetimesasmanyqueuesascrawlingthreads.EachqueuecontainedURLsbelongingtoasinglewebserver;atableTonthesidemaintainedamappingfromwebserverstoback-endqueues.Inaddition,associatedwitheachback-endqueueqwasatimetatwhichthenextURLfromqmaybeprocessed.These(q,t)pairsanizedintoanin-memorypriorityqueue,withthepairwithlowestthavingthehighestpriority.EachcrawlingthreadobtainedaURLtodownloadbyremovingthehighest-priorityentry(q,t)fromthepriorityqueue,waitingifnecessaryuntiltimethadbeenreached,dequeuingthenextURLufromq,downloadingit,andfinallyreinsertingthepair(q,tnow+k·x)intothepriorityqueue,wheretnowisthecurrenttime,xistheamountoftimeittooktodownloadu,andkisa“politenessparameter”;typically10.Ifdequeuingufromqleftqempty,thecrawlingthreadwouldremovethemappingfromhost(u)toqfromT,repeatedlydequeueaURLufromQandenqueueuintotheback-endqueueidentifiedbyT(host(u)),untilitfoundausuchthathost(u)wasnotcontainedinT.Atthispoint,itwouldenqueueuinqandupdateTtomaphost(u)toq. Inadditiontoobeyingpolitenesspoliciesthatgoverntherateatwhichpagesaredownloadedfromagivenwebsite,webcrawlersmayalsowanttoprioritizetheURLsinthefrontier.Forexample,itmaybedesirabletoprioritizepagesordingtotheirestimatedusefulness(basedforexampleontheirPageRank[101],thetraffictheyreceive,thereputationofthewebsite,ortherateatwhichthepagehasbeenupdatedinthepast).ThepageorderingquestionisdiscussedinSection4. Assumingamechanismforassigningcrawlprioritiestowebpages,acrawlercanstructurethefrontier(orintheMercatordesigndescribedabove,thefront-endqueue)asadisk-basedpriorityqueueorderedbyusefulness.Thestandardimplementationofapriorityqueueisaheap,andinsertionsintoaheapofnelementsrequirelog(n)elementesses,eachesspotentiallycausingadiskseek,whichwouldlimitthedatastructuretoafewhundredinsertionspersecond—farlessthantheURLingressrequiredforhigh-performancecrawling. Analternativesolutionisto“discretize”prioritiesintoafixednumberofprioritylevels(say10to100levels),andmaintainaseparateURL 2.3KeyDesignPoints189 FIFOqueueforeachlevel.AURLisassignedadiscreteprioritylevel,andinsertedintothecorrespondingqueue.TodequeueaURL,eitherthehighest-prioritynonemptyqueueischosen,orarandomizedpolicybiasedtowardhigher-priorityqueuesisemployed. 2.3.2URLSeenTest Asoutlinedabove,thesecondmajordatastructureinanymoderncrawlerkeepstrackofthesetofURLsthathavebeenpreviouslydiscoveredandaddedtofrontier.ThepurposeofthisdatastructureistoavoidaddingmultipleinstancesofthesameURLtothefrontier;forthisreason,itissometimescalledtheURL-seentest(UST)ortheduplicateURLeliminator(DUE).Inasimplebatchcrawlingsettinginwhichpagesaredownloadedonlyonce,theUSTneedstosupportinsertionandsetmembershiptesting;inacontinuouscrawlingsettinginwhichpagesareperiodicallyre-downloaded(seeSection2.3.5),itmustalsosupportdeletion,inordertocopewithURLsthatnolongerpointtoavalidpage. Therearemultiplestraightforwardin-memoryimplementationsofaUST,e.g.,ahashtableorBloomfilter[20].Asmentionedabove,inmemoryimplementationsdonotscaletoarbitrarilylargewebcorpora;however,theyscalemuchfurtherthanin-memoryimplementationsofthefrontier,sinceeachURLcanpressedtoamuchsmallertoken(e.g.,a10-bytehashvalue).Commercialsearchenginesemploydistributedcrawlers(Section2.3.4),andahashtablerealizingtheUSTcanbepartitionedacrossthemachinesinthecrawlingcluster,furtherincreasingthelimitofhowfarsuchanin-memoryimplementationcanbescaledout. Ifmemoryisatapremium,thestateoftheUSTmustresideondisk.Inadisk-basedhashtable,eachlookuprequiresadiskseek,severelylimitingthethroughput.CachingpopularURLscanincreasethethroughputbyaboutanorderofmagnitude[27]toafewthousandlookupspersecond,butgiventhattheaveragewebpagecontainsontheorderofahundredlinksandthateachlinkneedstobetestedfornovelty,thecrawlingratewouldstillbelimitedtotensofpagespersecondundersuchanimplementation. 190CrawlerArchitecture Whilethelatencyofdiskseeksispoor(afewhundredseekspersecond),thebandwidthofdiskreadsandwritesisquitehigh(ontheorderof50–100MBpersecondinmoderndisks).So,implementationsperformingrandomfileessesperformpoorly,butthosethatperformstreamingsequentialreadsorwritescanachievereasonablethroughput.TheMercatorcrawlerleveragedthisobservationbyaggregatingmanysetlookupandinsertionoperationsintoasinglelargebatch,andprocessingthisbatchbysequentiallyreadingasetofsortedURLhashesfromdiskandwritingthem(plusthehashesofpreviouslyundiscoveredURLs)outtoanewfile[94]. Thisapproachimpliesthatthesetmembershipisdelayed:WeonlyknowwhetheraURLisnewafterthebatchcontainingtheURLhasbeenmergedwiththediskfile.Therefore,wecannotdecidewhethertoaddtheURLtothefrontieruntilthemergeurs,i.e.,weneedtoretainalltheURLsinabatch,notjusttheirhashes.However,itispossibletostoretheseURLstemporarilyondiskandreadthembackattheconclusionofthemerge(againusingpurelysequentialI/O),onceitisknownthattheyhadnotpreviouslybeenencounteredandshouldthusbeaddedtothefrontier.AddingURLstothefrontierinadelayedfashionalsomeansthatthereisalowerboundonhowsoontheycanbecrawled;however,giventhatthefrontierisusuallyfarlargerthanaDUEbatch,thisdelayisimperceptibleexceptforthemosthigh-priorityURLs. TheIRLbotcrawler[84]usesarefinementoftheMercatorscheme,wherethebatchofURLsarrivingattheDUEisalsowrittentodisk,distributedovermultiplefileskeyedbytheprefixofeachhash.Oncethesizeofthelargestfileexceedsacertainthreshold,thefilesthattogetherholdthebatcharereadbackintomemoryonebyoneandmerge-sortedintothemainURLhashfileondisk.Attheconclusionofthemerge,URLsareforwardedtothefrontierasintheMercatorscheme.BecauseIRLbotstoresthebatchondisk,thesizeofasinglebatchcanbemuchlargerthanMercator’sin-memorybatches,sothecostofthemerge-sortwiththemainURLhashfileisamortizedoveramuchlargersetofURLs. IntheMercatorschemeanditsIRLbotvariant,mergingabatchofURLsintothedisk-basedhashfileinvolvesreadingtheentireoldhash 2.3KeyDesignPoints191 fileandwritingoutanupdatedversion.Hence,thetimerequirementisproportionaltothenumberofdiscoveredURLs.AmodificationofthisdesignistostoretheURLhashesondiskinsortedorderasbefore,butsparselypackedratherthandenselypacked.Thekhighest-orderbitsofahashdeterminethediskblockwherethishashresides(ifitispresent).Mergingabatchintothediskfileisdoneinplace,byreadingablockforwhichtherearehashesinthebatch,checkingwhichhashesarenotpresentinthatblock,andwritingtheupdatedblockbacktodisk.Thus,thetimerequirementformergingabatchisproportionaltothesizeofthebatch,notthenumberofdiscoveredURLs(albeitwithhighconstantduetodiskseeksresultingfromskippingdiskblocks).Onceanyblockinthefilefillspletely,thediskfileisrewrittentobetwiceaslarge,andeachblockcontainshashesthatnowsharetheirk+1highest-orderbits. 2.3.3AuxiliaryDataStructures InadditiontothetwomaindatastructuresdiscussedinSections2.3.1and2.3.2—thefrontierandtheUST/DUE—webcrawlersmaintainvariousauxiliarydatastructures.Wediscusstwo:TherobotsexclusionrulecacheandtheDNScache. WebcrawlersaresupposedtoadheretotheRobotsExclusionProtocol[83],aconventionthatallowsawebsiteadministratortobarwebcrawlersfromcrawlingtheirsite,orsomepageswithinthesite.ThisisdonebyprovidingafileatURL/robots.txtcontainingrulesthatspecifywhichpagesthecrawlerisallowedtodownload.Beforeattemptingtocrawlasite,acrawlershouldcheckwhetherthesitesuppliesa/robots.txtfile,andifso,adheretoitsrules.Ofcourse,downloadingthisfileconstitutescrawlingactivityinitself.Toavoidrepeatedlyrequesting/robots.txt,crawlerstypicallycachetheresultsofpreviousrequestsofthatfile.Toboundthesizeofthatcache,entriesmustbediscardedthroughsomecacheevictionpolicy(e.g.,least-recentlyused);additionally,webserverscanspecifyanexpirationtimefortheir/robots.txtfile(viatheHTTPExpiresheader),andcacheentriesshouldbediscardedordingly. URLscontainaponent(e.g.,),whichis“resolved”usingtheDomainNameService(DNS),aprotocolthat 192CrawlerArchitecture exposesagloballydistributedmappingfromsymbolichostnamestoIPaddresses.DNSrequestscantakequitealongtimeduetotherequestforwardingnatureoftheprotocol.Therefore,crawlersoftenmaintaintheirownDNScaches.Aswiththerobotsexclusionrulecache,entriesareexpiredordingtobothastandardevictionpolicy(suchasleastrecentlyused),andtoexpirationdirectives. 2.3.4DistributedCrawling Webcrawlerscanbedistributedovermultiplemachinestoincreasetheirthroughput.ThisisdonebypartitioningtheURLspace,suchthateachcrawlermachineornodeisresponsibleforasubsetoftheURLsontheweb.TheURLspaceisbestpartitionedacrosswebsiteboundaries[40](wherea“website”mayrefertoallURLswiththesamesymbolichostname,samedomain,orsameIPaddress).PartitioningtheURLspaceacrosssiteboundariesmakesiteasytoobeypolitenesspolicies,sinceeachcrawlingprocesscanscheduledownloadswithouthavingmunicatewithothercrawlernodes.Moreover,allthemajordatastructurescaneasilybepartitionedacrosssiteboundaries,i.e.,thefrontier,theDUE,andtheDNSandrobotsexclusioncachesofeachnodecontainURL,robotsexclusionrules,andname-to-addressmappingsassociatedwiththesitesassignedtothatnode,andnothingelse. CrawlingprocessesdownloadwebpagesandextractURLs,andthankstotheprevalenceofrelativelinksontheweb,theywillbethemselvesresponsibleforthelargemajorityofextractedURLs.WhenaprocessextractsaURLuthatfallsundertheresponsibilityofanothercrawlernode,itforwardsutothatnode.ForwardingofURLscanbedonethroughpeer-to-peerTCPconnections[94],asharedfilesystem[70],oracentralcoordinationprocess[25,111].TheamountmunicationwithothercrawlernodescanbereducedbymaintainingacacheofpopularURLs,usedtoavoidrepeatforwardings[27]. Finally,avariantofdistributedwebcrawlingispeer-to-peercrawling[10,87,100,112,121],whichspreadscrawlingoveralooselycollaboratingsetofcrawlernodes.Peer-to-peercrawlerstypicallyemploysomeformofdistributedhashtableschemetoassignURLstocrawler 2.3KeyDesignPoints193 nodes,enablingthemtocopewithsporadicarrivalanddepartureofcrawlingnodes. 2.3.5IncrementalCrawling Webcrawlerscanbeusedtoassembleoneormorestaticsnapshotsofawebcorpus(batchcrawling),ortoperformincrementalorcontinuouscrawling,wheretheresourcesofthecrawleraredividedbetweendownloadingnewlydiscoveredpagesandre-downloadingpreviouslycrawledpages.Efficientincrementalcrawlingrequiresafewchangestothemajordatastructuresofthecrawler.First,asmentionedinSection2.3.2,theDUEshouldsupportthedeletionofURLsthatarenolongervalid(e.g.,thatresultina404HTTPreturncode).Second,URLsareretrievedfromthefrontieranddownloadedasinbatchcrawling,buttheyaresubsequentlyreenteredintothefrontier.IfthefrontierallowsURLstobeprioritized,thepriorityofapreviouslydownloadedURLshouldbedependentonamodelofthepage’stemporalbehaviorbasedonpastobservations(seeSection5).ThisfunctionalityisbestfacilitatedbyaugmentingURLsinthefrontierwithadditionalinformation,inparticularpreviousprioritiespactsketchesoftheirpreviouscontent.Thisextrainformationallowsthecrawlerparethesketchofthejust-downloadedpagetothatofthepreviousversion,forexampleraisingthepriorityifthepagehaschangedandloweringitifithasnot.Inadditiontocontentevolution,otherfactorssuchaspagequalityarealsooftentakenintoount;indeedtherearemanyfast-changing“spam”webpages.
3 CrawlOrderingProblem Asidefromtheintra-sitepolitenessconsiderationsdiscussedinSection2,acrawlerisfreetovisitURLsinanyorder.Thecrawlorderisextremelysignificant,becauseforthepurposeofcrawlingthewebcanbeconsideredinfinite—duetothegrowthrateofnewcontent,andespeciallyduetodynamicallygeneratedcontent[8].Indeed,despitetheirimpressivecapacity,mercialsearchenginesonlyindex(andlikelyonlycrawl)afractionofdiscoverablewebpages[11].Thecrawlerorderingquestionisevenmorecrucialforthecountlesssmallerscalecrawlersthatperformscopedcrawlingoftargetedsubsetsoftheweb. Sections3–5surveyworkonselectingagoodcrawlerorder,withafocusontwobasicconsiderations: •Coverage.Thefractionofdesiredpagesthatthecrawleracquiresessfully. •Freshness.Thedegreetowhichtheacquiredpagesnapshotsremainup-to-date,relativetothecurrent“live”webcopies. Issuesrelatedtoredundant,maliciousormisleadingcontentarecoveredinSection6.Generallyspeaking,techniquestoavoidunwantedcontent 194 3.1Model195canbeincorporatedintothebasiccrawlorderingapproacheswithoutmuchdifficulty. 3.1Model Mostworkoncrawlorderingabstractsawaythearchitecturaldetailsofacrawler(Section2),andassumesthatURLsinthefrontierdatastructurecanbereorderedfreely.TheresultingsimplifiedcrawlorderingmodelisdepictedinFigure3.1.Atagivenpointintime,somehistoricalcrawlorderhasalreadybeenexecuted(P1,P2,P3,P4,P5inthediagram),andsomefuturecrawlorderhasbeenplanned(P6,P7,P4,P8,...).1 Inthemodel,allpagesrequirethesameamountoftimetodownload;the(constant)rateofpagedownloadingiscalledthecrawlrate,typicallymeasuredinpages/second.(Section2discussedhowtomaximizethecrawlrate;hereitisassumedtobefixed.)Thecrawlrateisnotrelevanttobatchcrawlorderingmethods,butitisakeyfactorwhenschedulingpagerevisitationsinincrementalcrawling. Fig.3.1Crawlorderingmodel. 1Someapproachestreatthecrawlorderingproblemhierarchically,e.g.,selectavisitationorderforwebsites,andwithineachsiteselectapagevisitationorder.Thisapproachhelpsmitigateplexityofmanagingacrawlorderingpolicy,andiswellalignedwithpoliciesthatrelyprimarilyonsite-levelmetricssuchassite-levelPageRanktodrivecrawlorderingdecisions.Manyoftheinsightsaboutpage-levelcrawlorderingalsoapplyatthesitelevel. 196CrawlOrderingProblem Pagesdownloadedbythecrawlerarestoredinarepository.Thefuturecrawlorderisdetermined,atleastinpart,byanalyzingtherepository.Forexample,onesimplepolicymentionedearlier,breadthfirstsearch,extractshyperlinksfrompagesenteringtherepository,identifieslinked-topagesthatarenotalreadypartofthe(historicalorplanned)crawlorder,andaddsthemtotheendoftheplannedcrawlorder. Thecontentofawebpageissubjecttochangeovertime,anditissometimesdesirabletore-downloadapagethathasalreadybeendownloaded,toobtainamorerecentsnapshotofitscontent.AsmentionedinSection2.3.5,twoapproachesexistformanagingrepeateddownloads: •Batchcrawling.Thecrawlorderdoesnotcontainduplicateurrencesofanypage,buttheentirecrawlingprocessisperiodicallyhaltedandrestartedasawaytoobtainmorerecentsnapshotsofpreviouslycrawledpages.Informationgleanedfrompreviouscrawliterations(e.g.,pageimportancescoreestimates)maybefedtosubsequentones. •Incrementalcrawling.Pagesmayappearmultipletimesinthecrawlorder,andcrawlingisacontinuousprocessthatconceptuallyneverterminates. Itisbelievedthatmostmercialcrawlersperformincrementalcrawling,whichismorepowerfulbecauseitallowsre-visitationofpagesatdifferentrates.(AparisonbetweenincrementalandbatchcrawlingismadebyChoandGarc´ıa-Molina[39].) 3.1.1Limitations Thismodelhasledtoagooddealofresearchwithpracticalimplications.However,aswithallmodels,itsimplifiesreality.Foronething,asdiscussedinSection2,alarge-scalecrawlermaintainsitsfrontierdatastructureondisk,whichlimitsopportunitiesforreordering.Generallyspeaking,theapproachofmaintainingaprioritizedensembleofFIFOqueues(seeSection2.3.1)canbeusedtoapproximateadesiredcrawlorder.WerevisitthisissueinSections4.3and5.3. 3.2WebCharacteristics197 Otherreal-worldconsiderationsthatfalloutsidethemodelinclude: •Somepages(orevenversionsofapage)takelongertodownloadthanothers,duetodifferencesinsizeworklatency. •Crawlerstakespecialcaretospaceoutdownloadsofpagesfromthesameserver,toobeypolitenessconstraints,seeSection2.3.1.Crawlorderingpoliciesthatassumeasinglecrawlrateconstraintcan,atleastinprinciple,beappliedonaperserverbasis,i.e.,runnindependentcopiesofthepolicyfornservers. •AsdescribedinSection2,mercialcrawlersutilizemanysimultaneouspagedownloaderthreads,runningonmanyindependentmachines.Henceratherthanasingletotallyorderedlistofpagestodownload,itismoreuratetothinkofasetofparallellists,encodingapartialorder. •Specialcaremustbetakentoavoidcrawlingredundantandmaliciouscontent;wetreattheseissuesinSection6. •Ifthepagerepositoryrunsoutofspace,andexpandingitisnotconsideredworthwhile,isesnecessarytoretiresomeofthepagesstoredthere(althoughitmaymakesensetoretainsomemetadataaboutthepage,toavoidrecrawlingit).Wearenotawareofanyscholarlyworkonhowtoselectpagesforretirement. 3.2WebCharacteristics Beforeproceeding,wedescribesomestructuralandevolutionarypropertiesofthewebthatarerelevanttothecrawlorderingquestion.Thefindingspresentedherearedrawnfromstudiesthatuseddatasetsofwidelyvaryingsizeandscope,takenatdifferentdatesoverthespanofadecade,andanalyzedviaawidearrayofmethods.Hence,cautioniswarrantedintheirinterpretation. 3.2.1StaticCharacteristics Severalstudiesofthestructureofthewebgraph,inwhichpagesareencodedasverticesandhyperlinksasdirectededges,havebeen 198CrawlOrderingProblem conducted.OnenotablestudyisbyBroderetal.[26],whichuncovereda“bowtie”structureconsistingofacentralstronglyponent(thecore),ponentthatcanreachthecorebutcannotbereachedfromthecore,andponentthatcanbereachedfromthecorebutcannotreachthecore.(Inadditiontothesethreeponentsthereareanumberofsmall,irregularstructuressuchasponentsandlong“tendrils.”) Hencethereexistmanyorderedpairsofpages(P1,P2)suchthatthereisnowaytoreachP2bystartingatP1andrepeatedlyfollowinghyperlinks.EvenincaseswhereP2isreachablefromP1,thedistancecanvarygreatly,andinmanycaseshundredsoflinksmustbetraversed.Theimplicationsforcrawlingare:(1)onecannotsimplycrawltodepthN,forareasonablevalueofNlikeN=20,andbeassuredofcoveringtheentirewebgraph;(2)crawling“seeds”(thepagesatwhichamences)shouldbeselectedcarefully,andmultipleseedsmaybenecessarytoensuregoodcoverage. Inanearlierstudy,Broderetal.[28]showedthatthereisanabundanceofnear-duplicatecontentoftheweb.Usingacorpusof30millionwebpagescollectedbytheAltaVistacrawler,theyusedtheshinglingalgorithmtoclusterthecorpusintogroupsofsimilarpages,andfoundthat29%ofthepagesweremorethan50%similartootherpagesinthecorpus,and11%ofthepageswereexactduplicatesofotherpages.Sourcesofnear-duplicationincludemirroringofsites(orportionsofsites)andURLsynonymy,seeSection6.1. Changetal.[35]studiedthe“deepweb,”i.e.,websiteswhosecontentisnotreachableviahyperlinksandinsteadcanonlyberetrievedbysubmittingHTMLforms.Thefindingsinclude:(1)thereareoveronemilliondeepwebsites;(2)moredeepwebsiteshavestructured(multifield)queryinterfacesthanunstructured(single-field)ones;and(3)mostqueryinterfacesarelocatedwithinafewlinksoftherootofawebsite,andarethuseasytofindbyshallowcrawlingfromtherootpage. 3.2.2TemporalCharacteristics Oneoftheobjectivesofcrawlingistomaintainfreshnessofthecrawledcorpus.Henceitisimportanttounderstandthetemporal 3.2WebCharacteristics199 characteristicsoftheweb,bothintermsofsite-levelevolution(theappearanceanddisappearanceofpagesonasite)andpage-levelevolution(changingcontentwithinapage). 3.2.2.1Site-LevelEvolution Dasguptaetal.[48]andNtoulasetal.[96]studiedcreationandretirementofpagesandlinksinsideanumberofwebsites,andfoundthefollowingcharacteristics(theserepresentaveragesacrossmanysites): •Newpagesarecreatedatarateof8%perweek.•Pagesareretiredatarapidpace,suchthatduringthecourse ofoneyear80%ofpagesdisappear.•Newlinksarecreatedattherateof25%perweek,whichis significantlyfasterthantherateofnewpagecreation.•Linksareretiredataboutthesamepaceaspages,with80% disappearinginthespanofayear.•Itispossibletodiscover90%ofnewpagesbymonitoring linksspawnedfromasmall,well-chosensetofoldpages(formostsites,fiveorfewerpagessuffice,althoughforsomesiteshundredsofpagesmustbemonitoredforthispurpose).However,discoveringtheremaining10%requiressubstantiallymoreeffort. 3.2.2.2Page-LevelEvolution Somekeyfindingsaboutthefrequencywithwhichanindividualwebpageundergoesachangeare: •PagechangeeventsaregovernedbyaPoissonprocess,whichmeansthatchangesurrandomlyandindependently,atleastinthecaseofpagesthatchangelessfrequentlythanonceaday[39].2 •Pagechangefrequenciesspanmultipleordersofmagnitude(sub-hourly,hourly,daily,weekly,monthly,annually),andeachorderofmagnitudeincludesasubstantialfractionof 2APoissonchangemodelwasoriginallypostulatedbyCoffmanetal.[46]. 200CrawlOrderingProblem pagesontheweb[2,39].Thisfindingmotivatesthestudyofnon-uniformpagerevisitationschedules.•Changefrequencyiscorrelatedwithvisitationfrequency,URLdepth,domainic[2],aswellaspagelength[60].•Apage’schangefrequencytendstoremainstationaryovertime,suchthatpastchangefrequencyisafairlygoodpredictoroffuturechangefrequency[60]. Unfortunately,itappearsthatthereisnosimplerelationshipbetweenthefrequencywithwhichapagechangesandthecumulativeamountofcontentthatchangesovertime.Asonewouldexpect,pageswithmoderatechangefrequencytendtoexhibitahighercumulativeamountofchangedcontentthanpageswithalowchangefrequency.However,pageswithhighchangefrequencytendtoexhibitlesscumulativechangethanpageswithmoderatechangefrequency.Ontheencouragingside,theamountofcontentthatchangedonapageinthepastisafairlygoodpredictoroftheamountofcontentthatwillchangeinthefuture(althoughthedegreeofpredictabilityvariesfromwebsitetowebsite)[60,96]. Manychangesareconfinedtoasmall,contiguousregionofawebpage[60,85],and/oronlyaffecttransientwordsthatdonotcharacterizethecore,time-invariantthemeofthepage[2].Muchofthe“new”contentaddedtowebpagesisactuallytakenfromotherpages[96]. Thetemporalbehaviorof(regionsof)webpagescanbedividedintothreecategories:Static(nochanges),churn(newcontentsupplantsoldcontent,e.g.,quoteoftheday),andscroll(newcontentisappendedtooldcontent,e.g.,blogentries).Simplegenerativemodelsforthethreecategoriescollectivelyexplainnearlyallobservedtemporalwebpagebehavior[99]. Mostwebpagesincludeatleastsomestaticcontent,resultinginanupperboundonthedivergencebetweenanoldsnapshotofapageandthelivecopy.Theshapeofthe

标签: #网站 #发外链 #公司 #怎么做 #是怎么 #关键词 #域名 #怎么做