Report

An EfficientTopology-Adaptive Membership Protocol for Large-Scale

To view this page ensure that Adobe Flash Player version 9.0.124 or greater is installed.

Get Adobe Flash player
Please login or register to make a comment!

...Description...... more. less.

19 0 5 10 15 20 25 30 35 40 45 50 Elapsed Time (sec) Throughput (req/sec)  Replica #1 disconnected  Failure detected Online Auction Service (Throughput) Throughput Request Rate Figure2: Serviceavailabilityduringthefailuredetectiontime.<br><br> Onestraightforwardapproachofamembershipservice istoleteverynodeperiodicallysenditsheartbeatsto othernodesandmeanwhilecollectheartbeatsfromother nodes[28].Thisisanall-to-allapproach.Eachheartbeat packetcontainsserviceinformationandcurrentloadstatus ofanode.Everynodebuildsitsownmembershipdirectory basedontheseheartbeatpackets.Thisisafullydistributed approachinthesensethateverynodemaintainsitsmem- bershipdirectoryindependently.Thisschemeworksfairly wellforaclusterofasmallormediumsizeanditcantoler- ateswitchfailureandunexpectednetworkpartitions.Butit isnotscalableforalargeclusterwiththousandsofnodesfor whichcommunicationandcomputationoverheadinmain- tainingalocalyellowpagedirectorycanbesigniLcant 1 . 1Itispossibletomodifythisalgorithmwithalazybroadcastap- proachifonlycertaininformationisneeded,suchasadetectedfailure Weperformedanexperimentthatmeasuresthemember- shipserviceoverheadimposedonaLinuxmachinewith dual1.4GHzP-IIIprocessorswhenvaryingthenumberof heartbeatpacketsreceivedbythismachine.Theresultof thisexperimentisshowninFigure3.Ifeachnodesendsa 1024-byteheartbeatpacketpersecond,theheartbeatpack- etscanconsume4MB/sbandwidthforeachnodeinaclus- terwith4000nodes,whichis32%oftherawbandwidthof aFastEthernetlinkfromthisnode.Itshouldbenotedthat GigabitcardsforPCmachinesandaGigabitswitchwith24 Gigabitportsorlessarecheapintoday 9smarket,howevera high-endsingleGigabitEthernetswitchwithalargenum- berofGigabitports(e.g. > 40 )isstillveryexpensive.As aresult,manylarge-scaleclusterswithhundredsorthou- sandsofmachinesinaproductionenvironmentuseGiga- bitswitcheswithfastEthernetports.Itispossibletolinka fewGigabitswitchesof24Gigabitports,andthencommu- nicationbandwidthamongtheseswitchesislimitedwithin fewGigabitsandtheoverallall-to-allcommunicationvol- umewilleasilysurpasssuchalimit.<br><br> 0 500 1000 1500 2000 2500 3000 3500 4000 0 1000 2000 3000 4000 Cluster Size (in number of nodes) Received Packets Bandwidth Overhead 0 500 1000 1500 2000 2500 3000 3500 4000 0 1 2 3 4 Bandwidth (MB/s) Figure3: Communicationoverheadoftheall-to-allapproachas theclustersizeincreases. Analternativeapproachusedinwide-areanetworkap- plicationsisgossipstylemembershipservice[24].Inagos- sipstyleapproach,eachnoderandomlyselectsasetof neighbornodesandsendsthemitscurrentmembershipdi- rectory.Therecipientnodesthenupdatetheirowndirec- toriesbasedonthenewinformation.Givenarateoferror tolerance,thisapproachcancontroltheamountofheart- beattrafLcbyselectinganappropriatenumberofneigh- bornodes.Thiscapabilityisessentialinwide-areaservices wherebandwidthislimited,networklatencyishighandfast multicastisusuallyunavailable.Whenitcomestolowla- (namely,broadcastonlywhenthereisafailure).Thensuchascheme maynotbegeneralforprovidingaperiodicupdateofotherservicesta- tusrequiredbyapplications. tencyandhighbandwidthsystemareanetworks,agossip stylemembershipservicehastheproblemofslowconver- gencetimeandhighcommunicationoverheadformaintain- ingacompleteviewoftheservicecluster.Furthermore,its probabilisticpropertydoesnotguarantee100%accuracy, whichcanbeunacceptableforsomehighreliableservices.<br><br> Ourintentionistopursueahierarchicalapproachwitha treestructureforcommunicatingmembershipinformation inalarge-scaleclustersocommunicationcostcanbere- ducedwhilechangedetectiontimeisstillcompetitive.Tree communicationhasbeenstudiedinthepreviousworkin variousdifferentcontextsandthechallengeistodevisea schemewhichistopology-awareandadaptivesothatnet- workcommunicationislocalizedandminimizedwhilenet- workfailurecanbeeffectivelytolerated. 3.Topology-AdaptiveHierarchicalMember- shipService Formachinesinalarge-scalecluster,weformahierar- chicaltreeamongnodes.Notethattheunderlyingnetwork topologydoesnotnecessarilyneedtobeatree.However, eachnodeisassumedtojointhemembershipgroupthrough onlyonenetworkinterface,whichisgenerallytrueinprac- tice.Ingeneral,aninternaltreenodepassesthemember- shipinformationfromitsparenttochildrennodes,andalso itcollectstheinformationfromitschildrenandpropagates toitsparent.Themembershipinformationfromonechild ofanodewillbepropagatedtootherchildrenofthisnode. Inourprotocol,aninternaltreenodeanditschildren formacommunicationgroup.Multicastisusedwithin thegroupforcommunication.Ahierarchicaltreeiscre- atedbasedonthenetworktopologyoflevel-3switchesby exploitingtheTime-To-Live(TTL)LeldinanIPpacket header.TheoriginalpurposeofTTListopreventpackets fallingintoinLniteroutingloops.WhenanIPpacketpasses arouter,therouterdecreasestheTTLofthispacketbyone andforwardsthispackettotherightsubnet.Whenthecount reacheszero,thepacketwillbediscarded.Weexploitthis featuretolimitthescopeofmulticastpacketswithineach communicationgroup.<br><br> 3.1.Topology-adaptiveGroupFormationfor NodeJoining Thekeyideaofourtopologyadaptivestrategyistoform anumberofsmallmulticastgroupsamongclusternodesus- ingthetopologyinformation.Theoverlappingofthesemul- ticastgroupsformsahierarchicalstructureamongnodes. ThemulticastwithineachgroupishighlyefLcientandthe scalabilityofthemembershipprotocolisachievedbydivid- ingnodesintosmallgroups.Inthisscheme,eachnodejoins join(local_addr) { ttl=1; channel=BASE_MULTICAST_CHANNEL; while (ttl<MAX_TTL){ join_group(channel,ttl); if (channelhasaleader){ bootstrapwiththeleader; break ; } else { elect_leader(channel); /*ifthereisonlyonenodein thisgroup,thisnodeistheleader*/ } if (is_leader(local_addr)){ ttl=ttl+1 channel=next_channel(channel); continue ; } break ; } } Figure4: Pseudo-codeforanodetojoinacommunicationgroup. amulticastgroupwiththesameTTLvalueandeachmul- ticastgrouphasacommunicationchannel(followingthe UDPprotocol).Aslevel3switchesseparatenodemulti- castcommunicationwithdifferentTTLvalue,atopology- awarehierarchycanbeformed,whichisalsoadaptiveto anytopologychangelateron.Figure4illustratesthegroup joiningprocedureperformedbyeachnodeatstartuptime.<br><br> Initiallywhenanodejoinsin,itsTTLvalueissetto oneanditusesthebasechannelofthemembershipproto- col.Bylisteningtotheselectedmulticastchannel,thenode canLndifthereisaleaderforthegroup.Ifthereisaleader, thenodewilluseabootstrapprotocoltoquicklybuilditslo- calyellow-pagedirectory.Otherwise,anelectionprocessis performed.Ifthisnewly-joinednodeisaleaderinthismul- ticastgroup,itincreasesitsTTLvalueandjoinsthemem- bershipchannelatahigherlevel.Thisprocesscontinuesun- tilthemaximumTTLcountisreached,whichisthelargest possiblehopcountaccordingtoitsnetworktopologyofthe cluster. Thebootstrapprotocolallowsanewlyjoinednodeto quicklybuilditslocalyellow-pagedirectory.Afterknow- ingtheleader,thenodecontactstheleadertoretrievethe membershipinformationthattheleaderknows.Meanwhile, theleaderofthiscommunicationgroupalsoqueriesthis newlyjoinednodeforitsmembershipinformationincase thatthenewnodeisaleaderforanothergroupinalower level.Whenthenewinformationisobtained,thegroup leaderofthisnewnodepropagatestheinformationfurther toallgroupmembersusinganupdatepropagationproto- col,whichwillbediscussedinSection3.3. Groupelectiondeterminesaleaderforagroupusingthe bullyalgorithm[5].EachnodeisassignedauniqueID(e.g., IPaddress).ThenodewiththelowestIDbecomesthegroup leader.Ifthereisalreadyagroupleader,anodewillnot participatetheleaderelection.Toreducethechanceofre- electionduetofailureofaleader,eachgroupmaintainsa groupleaderandabackupleader.Thebackupleaderisran- domlychosenbythegroupleaderanditwilltakeoverthe leadershipiftheprimaryleaderfails.Thisallowsquickre- coveryifonlytheprimaryleaderfails.Whenboththepri- maryandthebackupleaderfail,theelectionalgorithmis performedtoselectanewleader,whichwilldesignatea backupleaderthereafter.<br><br> Itshouldbenotedthatifagrouponlyhasonemem- ber(i.e.aLrstjoinednode),thenthisnodeisthedefault leader.Therecouldbeanumberofmulticastgroupswith onemember,especiallywhenthejoiningprocessreaches ahighTTLvalue.Westillkeepsuchgroupsfortopology adaptivitybecausesomenewnodesmayjoininthefuture. SincethemaximumTTLvalueisnormallysmallinalarge cluster,thereshouldnotbemanysuchgroups.Thecommu- nicationcostisnegligibleforgroupswithonlyonemem- ber,becausethecorrespondingroutermulticasttreeisvery small. 3.2.PropertiesandExamples Theabovegroupformationprocesscreatesahierarchical treewhereleavesareclusternodesandinternaltreenodes areleaderselected.WedeLnethelevelofagroupasthe TTLvalueofthegroupminusone,andwecanprovethata hierarchicaltreederivedbasedonthetopologyhasfollow- ingproperties: " Allalivenodesintheclusterwillbeeventuallyin- cludedinthehierarchicaltreeifthereisnonet- workpartition.Withnetworkpartitions,forestswill beformedandeverynodewillbeinoneoftrees.That meansthatthestatuschangeofanodewillbepropa- gatedtoallnodesconnectedinthesametree.<br><br> " Ifanodeispresentinagroupofcertainlevel,itis awareofthegroupleaderwhenthegroupisinthe steadystate.Therefore,itcaninformitsgroupleader whenachangeisdetected,i.e.messagescanbeprop- agatedupwards. " Ifanodeispresentatlevel i ,itmustjoinasleadersin lowerlevelgroups, 0 , 1 ,..., i 2 1 .Thismeansachange messageatlevel i willbepropagatedtolowerlevels, i.e.messagescanbepropagateddownwards. Weillustratethenodejoiningprotocolwithtwoexam- ples.TheLrstexampleisbasedonCaliforniadatacenterof Figure1inSection2.Intotal,fourmulticastgroupshave beenformedtobuildahierarchicalmembershippropaga- tiontreewiththemaximumTTLvalueas2.Groups 0 a , 0 b ,and 0 c areformedwithTTLvalueone.WithTTLvalue one,packetsfromonememberofthesethreegroupscan- notpassthelevel-3switchtoreachanothergroup.Eachof thethreegroupselectsaleader,whichjoinsGroup 1 a with TTLvaluetwo.<br><br> Group 2a 239.255.0.22 TTL=3 Group 2b 239.255.0.22 TTL=3 A B C A B C Group 1a 239.255.0.21 TTL=2 A B C Group 1b 239.255.0.21 TTL=2 Group 1c 239.255.0.21 TTL=2 Group 0a 239.255.0.20 TTL=1 A B C Group 0b 239.255.0.20 TTL=1 Group 0c 239.255.0.20 TTL=1 B Group 3a 239.255.0.23 TTL=4 Figure5: TheleftandtherightoftheLgureshowphysicalnet- worklayoutandmembershipgroupsrespectively. ThesecondexampleisillustratedinFigure5withnine multicastgroupsformedinthiscaseforhierarchicalcom- munication.Nodes A , B and C areleadersforgroups 0 a, 0 b, and 0 c respectively.Inthenextlevel,node A forms itsowngroupwithTTL=2.Thereasonisthatnode A can- notreachnode B withtwohops.Thennodes A and B form agroupcalled2awithTTL=3becausetheycanreacheach otherwith3hops.Similarlynode B and C formgroup2b withTTL=3.Weassumethat B isselectedasaleaderin group2a.Ingroup2b, B isalsoelected.Then B formsa groupbyitselfwithTTL=4. 3.3.UpdateProtocolforClusterChange Membershipchangeshouldbemadeawaretothenodes intheentirecluster.Inourscheme,amulticastgroup leaderpropagatessuchinformationpromptlybynotifying itsmembersanditsparentgroup.<br><br> Wehavediscussedthenodejoiningprocessintheabove subsection.Fordetectingthedepartureofanodedueto failureoranoperationdecision,weusetheheartbeating method.Anodealwayscontinuouslymulticastsitsavail- ability(heartbeatmessages)ineachmulticastgroupitre- sides.Sincemulticastheartbeatpacketsmaygetlost,anode isconsidereddead only whennoheartbeatpacketisre- ceivedfromthenodeafterapre-deLnedtimeperiod. Whenamulticastgroupleaderreceivesanupdatefrom itschildgroup,itneedstofurthermulticasttheupdatein- formationinitsparentmulticastgroup.Similarlywhena leaderreceivesanupdatemessagefromitsparentmulticast group,itneedstomulticastthenewinformationtoitschild multicastgroup.Inthisway,anupdateofnodestatuscan bepropagatedtotheentireclusterquickly.Figure6illus- tratesthepropagationofanupdatemessage.Nodes B , E and H aretheleadersofgroups 0 a, 0 b, and 0 c respectively. Node E istheleaderofgroup 1 a .Assumenode C isdead ingroup 0 a ,andnode B detectsthisfailureandremoves C fromitsmembershipdirectory.Thennode B alsomulti- caststhisupdatetoitsparentgroup 1 a atStep2.Atstep3, node E forwardsthisinformationtoallnodesingroup 0 b throughmulticastafteritreceivesthisupdateatgroup1a.<br><br> Atstep4,allnodesatgroup 0 b updateitslocalmember- shipdirectory.Concurrently,allnodesatgroup 0 c update itsmembershipdirectoryalsoandexcludenode C . Level 0 Level 1 Level 2 Figure6: Propagationofanupdatemessage.Thecirclednumbers showthepropagationorderoftheupdatemessage. Wewanttoaddtwotechniquesinthedesignoftheabove updateprotocol.<br><br> " Handlingunexpectednetworkpartitionsorswitch failurewithhierarchicaltimeout .Whenaleader bringsinnewinformationonthegrouporsubgroup itmanages,othergroupsneedtorememberthatthis leaderisinchargeforthespeciLcgroup.Ifagroup leaderisconsideredtobedead,thenallnodesman- agedbythisleaderareconsidereddeadtentativelyby othergroups,mainlyfordetectingaswitchfailure. ForexampleinFigure6,ifnode B isdead,itis possiblethatitiscausedbyanunexpectednetwork partitionorswitchfailuresothatallnodesingroup 0acanbenon-accessiblefromgroups 0 b and 0 c .The othermulticastgroupsinthesystemLrstassumethat allnodesingroup 0 a aredeadandpurgethemfromthe membershipdirectoryasapossibleswitchfailure,and thenletthenewleaderingroup 0 a re-announcetheir availabilityinthesubsequentiteration. Sinceittakestimetoremoveassumeddeadnodes fromthemembershiptableofanodeandthenadd themback,tominimizetheimpactofhandlingthefail- ureofaleader,differenttimeoutvaluesareassigned formulticastgroupsatdifferentlevels.Higherlevel groupsareassignedwithlargertimeoutvalue,namely alongerheartbeatwaitingperiodindeterminingifa nodeisdeadornot.Inthisway,whenagroupleader fails,anewleadercanbequicklyselectedtoreplace theoldleaderbeforethehigherlevelgroupdetects thefailure.Thiscanavoidtheunnecessarypurgingof nodesfromtheavailablemembershipdirectory.<br><br> " Deltainformationupdatingandrecoveryoflost messages. Whenaleaderupdatesnewinformationto amulticastgroup,itonlyannouncesthechangedpor- tiontominimizecommunicationsize.BecauseUDP multicastpacketscanbelostduringnetworktransmis- sion,tohelpdetectapacketloss,eachhostassignsa sequencenumberforanupdatemessage.Thusthere- ceivercanusethesequencenumbertodetectlostup- dates.Sinceeachupdateaboutanodedepartureor joinhasaveryshortmessage,weletanupdatemes- sagepiggybacklastthreeupdatessothatthereceiver cantolerateuptothreeconsecutivepacketlosses.If morethanthreeconsecutivepacketsarelost,there- ceiverwillpollthesendertosynchronizeitsmember- shipdirectory. 4.ScalabilityAnalysis Inthissection,wepresentanalyticresultstocompare thehierarchicalapproachwithtwoalternatives:theall-to- allapproachandthegossipapproachinaclusterenviron- ment.Weassumemulticastisusedtodisseminatemessages toagroupofnodes,andunicastisusedforonetoonecom- munication,suchasgossipmessages.<br><br> Weusethefollowingthreemetricsforacomparison: " Failuredetectiontime ( T fail )istheearliesttimethat afailureisdetectedbyanyofothernodes.Noticethat whenafailureisdetectedbyonenode,itmaynotbe knowntoothers. " Viewconvergencetime isthelengthofaninterval fromthetimeofastatuschangetothetimethatall nodeshaveaconsistentviewofthechange. " Communicationcost inmaintainingaprotocol.We measureitusingcommunicationbandwidthconsump- tionrequirementpersecond.<br><br> Sincefastdetectioncanrequiremorecommuni- cationvolumetomakesureaneweventisdetected promptlywhilewealsopreferlowcommunication cost,weusethemetricBDPthatcombinesbandwidth consumptionandfailuredetectiontime.Assumethe bandwidthconsumptionofaschemeis B bytesper secondwhentheclusterstatusisstable,themetricis calculatedas BDP = B × T fail ,where T fail isthe detectiontimeofanodefailure.Protocolswithlower BDP valuesarebetter,becausetheyuselesstimeto detectafailurewithaLxedbandwidth. 4.1.FailureDetectionTimeandCommunication Cost Let m betheaveragesizeofaprotocolmessage, n bethe totalnumberofnodesinacluster,and B bethetotalband- widthallowed.Thefailuredetectiontimeand BDP arecal- culatedasfollows. " All-to-all.<br><br> Themulticastfrequencyislimitedby f = B m × n 2 sinceeachnodereceivesheartbeatsfromall othernodesandsendsoutoneheartbeatpermulticast cycle.Thuseachnodeconsumes m × n bandwidth percycle.Iftheprotocolassumesanodeisdeadaf- ternothearing p consecutiveheartbeatsfromthenode, thefailuredetectiontimeis T fail = p f = p × m × n 2 B , and BDP = B × T fail = p × m × n 2 = O ( n 2 ) . Inpractice,eachnodeoftenLxesitsmulticastfre- quency,whichisindependentofthenumberofnodes. Thismakesthefailuredetectiontimeasaconstantand thenthetotalamountofnetworkbandwidthconsump- tion B willbecome O ( n 2 ) .<br><br> " Gossip. Eachgossipmessagecontainsalocalview ofclustermembership.Aseachnodeaccumu- latestheglobalviewincrementallyfollowingaran- domizedmanner,thesizeofthelocalviewreaches m × n byteseventually.Thenbandwidthconsump- tionis O ( mn 2 ) .Wedonotcountthebroadcast messageinthegossipschemesinceitcanbeelimi- natedunderoptimization.Thefrequencyofgossipcan becalculatedas f = B m × n 2 .Forthisscheme,thepre- viousresearchshowsthatthenumberoffailurede- tectionstepsis O (log n ) [24].Thuswecomputethe failuredetectiontimeandBDPas: T fail = O (log n ) f = O (log n ) × m × n 2 B , and BDP = B × T fail = O ( n 2 log n ) . Ifthegossipapproachconsumes O ( n 2 ) amountof bandwidth,thenthefailuredetectiontimewillbe O (log n ) .<br><br> " Hierarchical. Assumethesizeofeachmulticastgroup withavaryingTTLvalueislimitedtoaconstantof k nodes,theheightofthemembershiptreeislimitedby log k n .Addingupthenumberofgroupsateachlevel, wegetthenumberoftotalgroups g = n k + n k 2 + ... + n k log k n = n 2 1 k 2 1 Themulticastfrequencyis f = B ( g × m × k 2 ) sinceeach grouphas k nodeswhichconsume m × k 2 bandwidth.<br><br> Iftheprotocolassumesanodeisdeadafternothear- ing p consecutiveheartbeatsfromthenode,thefailure detectiontimeis T fail = p f = p × g × m × k 2 B , and BDP = B × T fail = p × n 2 1 k 2 1 × m × k 2 = O ( n ) . IfeachnodeLxesitsmulticastfrequencyastheall- to-allapproachdoes,thefailuredetectiontimeisa constantandthetotalamountofnetworkbandwidth willbecome O ( n ) ,whichismorescalablethanthe othertwomethods. 4.2.ViewConvergenceTime Viewconvergencetimeincludesthefailuredetection timeandthetimetodisseminatethisinformationtoallother nodes.Similarly,wecandeLneametricBCPwhichcom- binesbandwidthconsumptionwithconvergencetime,mea- suretheeffectivenessofthethreeapproaches: BCP = B × T converge .<br><br> FortheGossipandtheMatall-to-allscheme,theview convergencetimeisthesameastheirfailuredetectiontime sinceallnodesmaintaintheirviewsindependently.Thus, theirBCPvaluesare O ( n 2 ) and O ( n 2 log n ) ,respectively. Forthehierarchicalscheme,theviewconvergencetime isthefailuredetectiontimeplusthetimetodisseminate thisinformationalongthehierarchicaltreewhoseheightis log k n .AnupdatemessagewillLrsttraveluptotherootof thetreeandpropagatedowntothebottomofthetree.As- sumethenetworktransmissiontimeofanupdatemessageis » ,thewholepropagationwilltake 2 » log k n .Thus,thecon- vergencetimeis T converge = T fail +2 » log k n, and BCP = B × T converge = O ( n )+ O ( B × log k n ) . Intheworstcase, B canbe O ( n ) ,thehierarchicalscheme hasthebestscalabilityintermsofBCP.<br><br> Ifwejustlookattheviewconvergencetimeassum- ingthathighcommunicationcostisallowed,theall-to- allmethodhasaconstanttimewhilethegossipmethod has O (log n ) delayandthehierarchicalapproachhave O ( log k n ) delay.Inpractice,theconvergencerateforthehi- erarchicalmethodcanbeveryfastbecause » isverysmall and k canbechosenquitelarge.Thegossipmethodcanbe slowinviewconvergencesinceupdatingrandomlydoesnot followadeterministicnotiLcationpath. Insummary,theall-to-allmethodhasashortconver- gencerateandfailuredetectiontime,butitisnotscalablein termsofcommunicationcost.Thecommunicationscheme ingossipmethodisalsonotscalableanditsfailuredetec- tiontimeisslowerthanothers.Thehierarchicalmethodcan haveareasonablefailuredetectiontimeandconvergence ratewithascalablecommunicationscheme. 5.ImplementationandEvaluation Inthissection,weLrstillustratetheimplementationof thehierarchicalmembershipservice,whichisthenevalu- atedandcomparedwithtwoalternativeapproaches.The mainobjectiveistostudythescalabilityofthehierarchi- calapproachintermsoffailuredetectiontime,viewcon- vergencetimeandnetworkoverhead.<br><br> 5.1.Implementation Wehaveimplementedthemembershipserviceinthe Neptune framework 3programmingandruntimesupport forbuildingcluster-basedInternetservices[28].However, itcanbeeasilycoupledintootherclusteringframeworks. SHM Local Service Status Data Structure /proc File System Annoucer Multicast Channels Receiver Status Tracker Informer SHM External Service Code Client Code Contender Figure7: Implementationofthehierarchicalmembership(Cir- clesandeclipsesrepresentactiveentities,rectanglesrepresentdata structuresandarrowsshowtheinformationMow). Figure7showsthecomponentsinourimplementation andtherelatedexternalentities.Thecomponentsincircles arethreadsassociatedwithspeciLctasks.The Announcer threadcollectsthemachineinformationfromLinux/proc LlesystemandtheserviceinformationfromtheIPCchan- neloftheservice.Thenitpublishesthisinformationtothe multicastchannelsthenodehasjoined.<br><br> The Receiver threadsubscribestothemulticastchannels thatthenodehasjoinedandupdatesthesharedmemory structuretoreMectnewlyreceivedinformation.Theshared memoryblockisdividedintotwoparts:(1)alocalpartthat containstheinformationaboutallthedirectlyconnected nodesviathemulticastchannel.(2)anexternalpartwhich containsinformationofexternalgroupsrelayedbyagroup leader.Thedifferenceisthatanodeisresponsibleforde- tectingafailureinthelocalpartwhileitdependsonits groupleadertotelltheavailabilityofanexternalnode. The StatusTracker threadperiodicallycheckstheen- triesinthesharedmemoryblockandpurgesanyexpired entries.Whenthereisanexpirationandthefailednode istheleaderofthelocalgroup,theTrackerwillassume thebackupleaderasthenewleader.Ifthereisnobackup leader,theTrackerwakesupthe Contender threadtoiniti- ateanelectionprocessforanewgroupleader.Ifthenodeit- selfisagroupleader,itneedstopropagateastatuschange tohigherlevelgroups.ThisisdonethroughtheInformer thread,whichpropagatesachangetoothergroups. Besidesrelayingchangestoothergroupmembers,the Informer threadofagroupleaderalsolistenstoawell knownUDPport.Thus,theReceiverthreadonanewly joinednodecanpollInformertogettheentireyellowpage.<br><br> Furthermore,theReceiverisalsoresponsiblefordetecting anylossofupdatepackets.Eachupdatepacketcontainslast threestatuschangestoimprovethetoleranceofthepacket loss.Ifthereisanunrecoverableloss,theReceiverwillalso pollthesourcenodetogetacompleteimage. 5.2.ExperimentSettings Alltheexperimentalevaluationswereconductedona rack-mountedLinuxclusterwith100dual1.4GHzPentium IIInodes.EachnoderunsRedHatLinux(kernelversion 2.4.20).TherearetwoLayer-3switcheswith100Mblinks. Oneaccommodates50nodeseach.Thesetwoswitchesare connectedbyaGigabitlink.<br><br> Forourhierarchicalprotocol,wemanuallydesignate multicastchannelstoemulatemultiplenetworks.Eachmul- ticastchannelhosts20nodes.Therefore,thereareLvenet- worksfor100nodesandtheseLvenetworksformasecond levelnetwork. ForGossipscheme,mistakeprobabilityissetto0.1%, whichrepresentstheboundthatanynodemaymakeaner- roneousfailuredetection.Thisisarelativelylooserequire- mentfortheGossipscheme.Asdiscussedbefore,eachgos- sipmessageaccountsforanumberofnetworkpacketsand thebroadcastpacketsarenotcounted. Inthefollowingexperiments,weLxthemulticastor gossipfrequencyasonepacketpersecondforallthree schemes.Fortheall-to-allschemeandthehierarchical scheme,wesetthemaximumpacketlossesas5beforea nodeisconsideredasdead.Wevarythenumberofnodes from20to100withthenumberofnetworksfrom1to5.<br><br> Theaveragepacketsizecarryingthemembershipinforma- tionofeachnodeismeasuredas228bytesforallthree schemes. 5.3.BandwidthConsumption First,wecomparethebandwidthconsumptionforthree schemesinFigure8.Bandwidthconsumptionismeasured oneachnodebycountingtheincomingheartbeatpack- ets.Thenallnumbersareaddeduptogettheaggregated bandwidthconsumption.Whenthereare20nodes,allthe schemesusethesameamountofbandwidth.Whenthe numberofnodesgrows,thehierarchicalschemehasthe leasttotalbandwidthconsumptionandhasclosetolinear growth.Onthecontrary,thebandwidthusageforboththe all-to-allschemeandthegossipschemegrowsquadratically withthenumberofnodes.Theseresultsareinlinewithour analysisresultsinSection4,whereweshowthattheav- eragebandwidthconsumptionforeachnoderemainscon- stantinthehierarchicalapproachandgrowslinearlyforthe othertwoapproaches.Thissuggeststhatthehierarchicalap- proachismorescalableintermsofnetworkbandwidthus- age. 5.4.FailureDetectionTime Figure9showsthefailuredetectiontimeforthree schemes.Duringtheexperiments,amembershipser- vicedaemonprocessonanodeiskilledtoemulatethe nodefailure.WecanseefromtheLgurethatasthenum- berofnodesgrows,thehierarchalschemeandtheall-to-all schemehavethesameconstantdetectiontimewhichis around5seconds,themaximumnumberofpacketlosses timesthemulticastperiod.Ontheotherhand,thedetec- tiontimeofthegossipschemeincreaseslogarithmically alongwiththenumberofnodes.Italsohasthelongestde- tectiontimewhenthereareonly20nodes.Boththe hierarchicalandtheall-to-allschemeshaveshorterfail- uredetectiontimethanthegossipscheme.Thisexperiment resultsarealsoinaccordancewithouranalysisinSec- tion4.<br><br> 5.5.ViewConvergenceTime Figure10comparestheviewconvergencetime.Thehi- erarchicalschemehasthesimilarviewconvergencetimeas theall-to-allscheme.Thisisbecausetheyhavethesame 20 30 40 50 60 70 80 90 10 0 0 0.5 1 1.5 2 2.5 Number of Nodes Bandwidth Overhead (MB per second) Communication Cost All 2to 2all Gossip Hierarchical Figure8:Bandwidthconsumption 20 30 40 50 60 70 80 90 10 0 0 5 10 15 20 Number of Nodes Detection Time (seconds) Failure Detection Time ALl 2to 2all Gossip Hierarchical Figure9:Failuredetectiontime 20 30 40 50 60 70 80 90 10 0 0 5 10 15 20 25 30 Number of Nodes Convergence Time (seconds) View Convergence Time All 2to 2all Gossip Hierarchical Figure10:Viewconvergencetime failuredetectiontimeandwhenafailureisdetected,group leaderscanquicklypropagatethisinformationtoallnodes. Figure10alsoshowstheviewconvergencetimeofthegos- sipschemeisthebiggestamongthethreeschemesandit growsalongwiththenumberofnodes.Again,thehierarchi- calandtheall-to-allschemesperformbetterthanthegossip scheme. 5.6.Discussion Fromtheaboveexperiments,wecanseethatgossip schemeperformstheworst.Usingthesameamountofnet- worktrafLc,ithasthelongestdetectionandconvergence time.Whenthenumberofnodesscalesup,thegossip schemeincreasesbandwidthusagequadratically.Therea- sonisthateachgossiphastocarryahost 9slocalviewof thegroupmembership,whiletheothertwoprotocolsuse muchsmallermessagesize.Ifthegossipprotocolisorga- nizedintoahierarchicalfashion[24,11],thedetectiontime andconvergencetimewillbelongerduetocrossgroupgos- sips.However,itisshownthatGossipprotocolisusefulfor largegroupswhereeachmemberonlyneedsapartialview ofgroupmembership[11].Gossipstyleapproachesarealso attractivewhenefLcientmulticastisnotavailable,forex- ampleinwide-areaapplications.Wefocusonprotocols, suchastheall-to-allschemeandthehierarchicalscheme, whichalloweachmembertoquicklymanageaglobalview.<br><br> Thehierarchicalschemeisbetterthantheall-to-allscheme foritslessbandwidthconsumptionandcomparableperfor- mance.Theexperimentresultscloselymatchouranalysisin Section4.Thisvalidatesouranalysisandallowsustopre- dictthatourLndingswillremainvalidforlargerclusters. 6.RelatedWork Previousresearchonmembershiporfailuredetection protocolsforfault-tolerantdistributedapplicationsrequires precisemembershipservicestosupportotherdistributed protocolssuchasatomicbroadcastprotocols[4,8,22,9]. Stok etal.<br><br> [29]describedahierarchicalmembershipproto- colandtheirprotocolrequiresallnodeshavesynchronized clockssothattheexecutionstepsoftheprotocolcanbesyn- chronized.Ourworkisfocusedonnetworktopology-aware techniquesatthesoftwareapplicationlayer. Gossipstylemembershipservicesaredifferentfrom heartbeat-basedmembershipservicesintheaspectthatthey arebasedonprobabilities[24,16,11].Theseprotocols aremostattractivewhenfullgroupmembershipisnotre- quired,especiallyinwide-areaapplications.Forinstance, SCAMP[11]isahierarchicalvariationofgossipprotocols wheregroupmembersonlyhavepartialknowledgeofthe group.Wefocusontheclusterapplicationswhichrequire fullmembershipknowledge. Therearesimilarideasofhierarchicallyorganiz- inggroupmembersintheresearchofoverlaynetworks[3].<br><br> Mostofworkisfocusedonwideareanetworkappli- cations,wherenetworklatencyishighandlinkband- widthvariesfromnodetonode.Failuredetectiontimeand viewconvergencetimeareoftennottheprimarygoals. Manyhigh-availabilitysystems[17,14]alsoinclude membershipservicesasanessentialcomponent.TheLinux- HAprojectprovidesaheartbeatbasedmembershipser- vice[18].Itenablesahot-standbyfeaturethatonemachine cantakeoveranothermachine 9sIPwhenitfails.Butitonly worksinsmallscale.Thestudyof[15]proposesafailure detectorprotocolforGridenvironment.Ourstudyaimsat large-scaleclusterswithlowlatencyandhighthroughput systemareanetworks. Cluster-basednetworkserviceshavebeenstudiedin [10,30,26,28]andthemembershipserviceispartofinfras- tructure.Resourcemonitoringtools,suchasGanglia[25], theNetworkWeatherService[31],andtheMonitoringand DiscoveryService(MDS2)[19,7]ofGlobusproject,pro- videinformationofmachineresourceandnetworkresource oflarge-scaleclustersorcomputergrids.Theseprojectsdo notemphasizefastfailuredetectiontime,convergencetime, andtopologyawarenessindetails.Itshouldalsobenoted thatourmembershipserviceisgeneralandcanprovideape- riodicalupdateofservicestatusinformationforeachnode inadditiontofailuredetection.<br><br> 7.ConcludingRemarks Thecontributionofthisworkisatopology-adaptivehi- erarchicalmembershipserviceforlarge-scaleclusters.The keystrategyistoformanumberofsmallTTLbasedmul- ticastgroupswithanupdateprotocoltoachievefasthier- archicalcommunication.Ittoleratesswitchandnodefail- ureadaptivelywhilelocalizingprotocolcommunicationfol- lowingthenetworktopology.Ourevaluationshowsthehi- erarchicalmembershipserviceisscalableandefLcientin large-scaleclusters. Acknowledgments Wethanktheanonymousrefereesfortheirhelpfulcom- mentsonearlierdraftsofthispaper.Thisworkwassup- portedinpartbyAskJeevesandNSFgrantsCCF-0234346. References [1]AmericaOnline.<br><br> http://www.aol.com . [2]AskJeevesSearch. http://www.ask.com .<br><br> [3]S.Banerjee,B.Bhattacharjee,andC.Kommareddy.Scal- ableApplicationLayerMulticast.In Proc.ofACMSIG- COMM 902 ,Aug.2002. [4]T.D.Chandra,V.Hadzilacos,S.Toueg,andB.Charron- Bost.Ontheimpossibilityofgroupmembership.In Proc.of the15thACMSymposiumonPrinciplesofDistributedCom- puting(PODC 996) ,pages322 3330,NewYork,1996. [5]R.ChowandT.Johnson.<br><br> DistributedOperatingSystemsand Algorithms .Addison-Wesley,1997. [6]L.Chu,K.Shen,H.Tang,T.Yang,andJ.Zhou.Dependency IsolationforThread-basedMulti-tierInternetServices.In Proc.oftheIEEEINFOCOM ,MiamiFL,Mar.2005. [7]K.Czajkowski,S.Fitzgerald,I.Foster,andC.Kesselman.<br><br> GridInformationServicesforDistributedResourceShar- ing.In IEEEInternationalSymposiumonHigh-Performanc eDistributedComputing(HPDC) ,Aug.2001. [8]C.Fetzer.Enforcingperfectfailuredetection.In 21st ProceedingsoftheInternationalConferenceonDistributed ComputingSystems(ICDCS2001) ,Phoenix,AZ,2001. [9]C.FetzerandF.Cristian.Ahighlyavailablelocalleaderelec- tionservice.<br><br> SoftwareEngineering ,25(5):603 3618,1999. [10]A.Fox,S.D.Gribble,Y.Chawathe,E.A.Brewer,and P.Gauthier.Cluster-BasedScalableNetworkServices.In ACMSOSP ,SaintMalo,Oct.1997. [11]A.J.Ganesh,A.-M.Kermarrec,andL.Massoulie.Peer-to- peermembershipmanagementforgossip-basedprotocols.<br><br> IEEETransactionsonComputers ,52(2),February2003. [12]S.Ghemawat,H.Gobioff,andS.-T.Leung.TheGoogleFile System.In ACMSOSP ,2003. [13]GoogleSearch.<br><br> http://www.google.com . [14]ClusterInfrastructureforLinux. http://sourceforge.net/projects/ci-linux .<br><br> [15]A.JainandR.K.Shyamasundar.FailureDetectionand MembershipManagementinGridEnvironments.In Fifth IEEE/ACMInternationalWorkshoponGridComputing (GRID 904) ,pages44 352,Pittsburgh,PA,Nov.2004. [16]D.Kempe,J.M.Kleinberg,andA.J.Demers.Spatialgos- sipandresourcelocationprotocols.In ACMSymposiumon TheoryofComputing ,pages163 3172,2001. [17]High-AvailabilityLinuxProject.http://www.linux-ha.org.<br><br> [18]LinuxHeartbeat.http://www.linux-ha.org/heartbeat. [19]MDS2. http://www.globus.org/mds .<br><br> [20]MSNGroupsService.http://groups.msn.com. [21]K.Nagaraja,X.Li,R.Bianchini,R.P.Martin,andT.D. Nguyen.Usingfaultinjectionandmodelingtoevaluatethe performabilityofcluster-basedservices.In Proc.ofthe4th USENIXSymposiumonInternetTechnologiesandSystems (USITS 903) ,2003.<br><br> [22]G.Neiger.Anewlookatmembershipservices.In Proceed- ingsofthe=fteenthAnnualACMSymposiumonPrinciples ofDistributedComputing ,Philadelphia,PA,1996. [23]D.Oppenheimer,A.Ganapathi,andD.A.Patterson.Why doInternetservicesfail,andwhatcanbedoneaboutit?In Proc.ofthe4thUSENIXSymposiumonInternetTechnolo- giesandSystems(USITS 903) ,Seattle,WA,Mar.2003. [24]R.V.Renesse,Y.Minsky,andM.Hayden.Agossip-style failuredetectionservice.In Proc.Middleware98 ,1998.<br><br> [25]F.D.Sacerdoti,M.J.Katz,M.L.Massie,andD.E.Culler. Wideareaclustermonitoringwithganglia.In Proc.ofthe IEEECluster2003Conference ,HongKong,2003. [26]Y.Saito,B.N.Bershad,andH.M.Levy.Manageability, AvailabilityandPerformanceinPorcupine:AHighlyScal- able,Cluster-basedMailService.In Proc.ofthe17thSOSP , pages1 315,1999.<br><br> [27]K.Shen,T.Yang,andL.Chu.ClusterLoadBalancingfor Fine-grainNetworkServices.In Proc.ofInternationalPar- allel&DistributedProcessingSymposium ,Apr.2002. [28]K.Shen,T.Yang,L.Chu,J.L.Holliday,D.A.Kuschner, andH.Zhu.Neptune:ScalableReplicationManagement andProgrammingSupportforCluster-basedNetworkSer- vices.In Proc.of3rdUSENIXSymposiumonInternetTech- nologiesandSystems ,SanFrancisco,CA,Mar.2001. [29]P.Stok,M.Claessen,andD.Alstein.AHierarchicalMem- bershipProtocolforSynchronousDistributedSystems.In 1stEuropenDependableComputingConference ,LNCS852, pages597 3616.Springer-Verlag,Oct.1994.<br><br> [30]J.R.vonBehren,E.A.Brewer,N.Borisov,M.Chen, M.Welsh,J.MacDonald,J.Lau,S.Gribble,andD.Culler. Ninja:AFrameworkforNetworkServices.In Proc.of2002 AnnualUSENIXTechnicalConf. ,Monterey,CA,June2002.<br><br> [31]R.Wolski,N.Spring,andJ.Hayes.TheNetworkWeather Service:ADistributedResourcePerformanceForecasting ServiceforMetacomputing. JournalofFutureGeneration ComputingSystems ,1998. [32]Yahoo!<br><br> http://www.yahoo.com .

less

Copyright © 2010 beepdf.com. All rights reserved.