您的当前位置：首页 Cluster Computing Environment Supporting Single System Image

Cluster Computing Environment Supporting Single System Image

来源：纷纭教育

ClusterComputingEnvironmentSupportingSingleSystemImage

MinChoi,DaeWooLee,andSeungRyoulMaeng

DivisionofComputerScience,DepartmentofElectricalEngineeringandComputerScience

KoreaAdvancedInstituteofScienceandTechnology

mchoi,dwlee,maeng@camars.kaist.ac.kr

Abstract

Singlesystemimage(SSI)systemshavebeenthemain-stayofhigh-performancecomputingformanyyears.SSIrequirestheintegrationandaggregationofalltypesofre-sourcesinaclustertopresentasingleinterfacetousers.Inthispaper,wedescribeaclustercomputingenvironmentsupportingSSI,constructedthroughthreecomponents:sin-gleprocessspace(SPS),processmigration,anddynamicloadbalancing.Thesecomponentsattempttoshareallavailableresourcesintheclusteramongallexecutingpro-cesses,sothattheclusteroperateslikeasinglenodewithmuchmorecomputingpower.Themostimportantgoalistocombinetheseconstructsininnovativewaysforbuild-ingclustercomputingenvironmentforSSI,aswellasin-dividuallytakeanovelapproachetoimproveperformanceorfunctionality.Ourimplementationofprocessmigrationhasthecapabilityofresolvingbrokenpipeproblemsandbinderrorsonserversocketreconstruction.WerealizeSPSbasedonblockPIDallocation.Wealsodesignedandim-plementedadynamicloadbalancingschemewhichresolvesthelimitationsofourpreviousworkbycontinuouslytrac-ingthejobresourceusageatruntime.TheexperimentalresultsshowthatthesethreeconstructsforSSIclustersre-alizedscalability,newfunctionalityandperformanceim-provement.Theclustercomputingenvironmentallowstheseconstructstocooperateimplicitlysothattheycreateasyn-ergyeffectattheSSIclustersystemlevelandsuccessfullyprovideasinglesystemimagetousersandadministrators.

1Introduction

SSIenablesaclusterofPCsorworkstationstobeusedasasinglecomputingunitinanefﬁcientandscalablemanner[10].Inthispaper,weintroduceaclustercomputingen-vironmentforSSI,consistingofthreemajorcomponents:

ThisresearchissupportedbytheNationalResearchLaboratoryGrant4-207.

SPS,processmigrationanddynamicloadbalancing.TheseconstructscooperateimplicitlytocreateasynergyeffectattheSSIclustersystemlevel.Individually,eachtakesalsoanovelapproachetoimprovetheirownperformanceandfunctionality.

SSIillusioncanberealizedinawayasaprovisionofanSPS.Inthiswork,SPSisbuiltwiththesupportofcluster-wideuniquePIDs,signalforwarding,andremoteexecu-tion.WeemployedanewdesignofSPSbasedonablockPIDallocation,inwhichtheentirePIDspaceisbrokenupintoblocksandthebrokenPIDblocksareassignedtoclientnodes.Therefore,thecommunicationoverheadre-quirementsofanewPIDissigniﬁcantlyreduced.

InadditiontotheSPS,weimplementprocessmigrationanddynamicloadbalancingtoattempttoshareallavailableresourcesinthecluster,sothattheclusteroperateslikeasinglenodewithmuchmorecomputingpower.Withthespreadinguseofnetworkinterconnection,migrationshouldbeabletomaintainthenetworkconnectionofaprocessaf-terbeingmigrated.Previousresearchrarelydidnotsupportthemigrationofaprocesswithanopennetworkconnection[16,17],andifitdid,itonlymigratedprocesseswhichoper-ateonclientsockets[4].Thereasonwhypreviousresearchdidnotofferserversocketmigrationisthattherearesig-niﬁcantproblemwithbinderrorsonsocketreconstructionandbrokenpipeproblems.Toovercometheselimitations,wemakeuseoftwokeymechanisms:TCPportinheritanceandcommunicationchannelclearingbyzerowindowad-vertisement.

Wehavediscussedanimportantissueofhowtoeffec-tivelydistributejobstoalargeclustermachinebypropos-ingasolutiontermedthe”numberofeffectivetasks”[6].ThenumberofeffectivetasksisametricfordeterminingtheratiobetweenthenumberofCPU-boundjobsandI/O-boundjobs.Themetriccanbeusedtoimprovethecomplextaskofjobschedulingandhastheadvantageoverotherjobschedulersthatitdoesnotrequirealargetaskhistoryaboutresourceusage.However,alimitationofthisloadbalanc-ingapproachisthepossibilityofmakingwrongdecisionduetothefactthatjobsmaynotbeeasilyclassiﬁedinto

CPU-boundorI/O-bound,orsomejobmayvarybetweenCPU-boundandI/O-boundfromtimetotime.Weresolvethisproblembyruntimemeasurementsofjobresourceus-ageandprocessmigrationinadditiontothedynamicloadbalancingbyusingthenumberofeffectivetasks.

Therestofthispaperisorganizedasfollows.Section2brieﬂydescribesthedesignobjectivesforourclustercom-putingenvironmentsupportingSSI.Section3presentsthedesignandimplementationofourcomputingenvironmentforSSIclusters.Section4reportstheexperimentalresultsoftheclustercomputingenvironment.Section5describesrelatedwork.WeconcludebysummarizingourresultsinSection6.

Aclustersystemhasthetendencytoconcentratethesystemloadoncertainnodes,resultinginbothover-loadednodesandidleresources.Theoverloadedexe-cutionofonenodegivesrisetoacceleratedconsump-tionofitsallocatedPIDspace.Thus,thePIDblockre-distributionofSPSoccurswhentheavailablePIDsareexhausted.Thisisnotsigniﬁcantproblemintermsofperformance,butcreatesunneccessaryandunwantedoverhead.Therefore,thedevelopmentofaloadbal-ancingsystemforutilizingcomputingresourcesoflightlyloadednodesencouragesSPStominimizethechanceofPIDblockredistribution.

2DesignObjectives

Indesigningtheclustercomputingenvironmentsupport-ingSSI,themostimportantgoalisimplicitcooperationamongdifferentSSIfeatures.Weproposethreecompo-nents:processmigration,SPS,anddynamicloadbalancing.TheseconstructsarecombinedininnovativewaysforourgoalofbuildingacomputingenvironmentforSSIclusters,aswellasindividuallytakeanovelapproachestoimproveperformanceorfunctionality.Thesynergyeffectsfromthisimplicitcooperationareasfollows.

3DesignandImplementationofComputingEnvironmentforSSIClusters

3.1ProcessMigrationofNetworkedApplications

Processmigrationistheactoftransferringaprocessbe-tweentwomachinesduringitsexecution.Withtheincreas-ingdeploymentofdistributedsystemsandmobilecomput-ing,theinterestinprocessmigrationisagainontherise.Wedevelopedanew,workingsystemwhichsupportsmigrationofnetworkedapplications.Nowadays,thetechniqueusedtomigratealegacyprocessistrivial,sincealmostallre-searcheffortsinthepastseveraldecadesconductedtoop-timizethesetechniques.Thus,weusedtheexistingbasiccheckpoint/restartpackage,Epckpt,asbasedevelopmentplatform.WeextendedtheEpckptfacilitytobeabletomigratenetworkedapplications.Thisisthemostchalleng-inganddifﬁcultareabecauseitinvolvesmorethantwoma-chinesinadistributedsystem.Inthissection,wedescribethereasonwhyprocessmigrationwithanopennetworkconnectionisdifﬁcultandweproposesolutionapproaches.3.1.1BrokenPipeProblems

Themajorproblemofmigratingaprocesswithasocketisthemessageswhichareintransitwhenaprocessneedstobemigrated.Simplycheckpointingtheprocessstate,regard-lessofthenetworkstate,resultsinaninconsistentapplica-tionstateafterrestoration.Moreover,brokenpipeproblemshappenwhenthesenderwritesapacket,eventhoughtheoperationofthereceiverprocesshadbeeninterruptedandthesocketwasclosed(Figure1-a).Inordertoresolvethisproblemwithoutcapturingnetworkstate,wemakeuseofcommunicationchannelclearingbeforemigration.

Inthiswork,communicationchannelclearingisachievedbyzerowindowadvertisement.Itinterruptthedatatransmissionbyadjustingthewindowofsenderpro-cessto0astheﬁgure1-b.Sincethezerowindowadver-tisementisanetworklevelapproach,thepeerapplicationcantrytowritedatacontinuouslywithoutknowingthefact2

Migration+SPS:resolvingthenamingconﬂict.

InourSPSimplementation,eachPIDforeverynodeistrulyunique.ThemaintenanceofmappinglocalPIDanduniquePIDbymasquerading[9]iscomplexanditrequiresadditionaloverheadforadjustmentofmap-pinginformationateverynewprocesscreation.Usingcluster-wideuniquePIDsdoesntrequiretheoverheadanditalsohelpstoresolvethenamingconﬂict,whichoccursofteninmigrationofaprocessfromonenodetoanothernodealreadyhavingthesamePID.Migration+DynamicLoadBalancing:enablingSSIClusterstotakeamoresophisticatedloadbalancingappoachusingprocessmigration.

Ourpreviousloadbalancingsystemtakesinitialjobplacement,whichassignsanewlyarrivedjobtonodethatbestmeetsthetaskrequirementsbeforeexecu-tion[6].Theimplementationofprocessmigrationen-ablesourdynamicloadbalancingsystemtoemployamoresophisticatedapproach,migratingarunningpro-cessfromanoverloadednodetoalessloadedone.TheLinuxkernelcontinuouslymonitorsthetasksbyrecordingtheresourceusageatruntime,andperiod-icallycheckswhethertheresourcerequirementonanodeisseriouslyimbalancedornot.

SPS+DynamicLoadBalancing:preventingSPSfromunwantedPIDblockredistribution.

Figure1.Brokenpipeproblemandthesolu-tion

thatthesendwindowisclosed.Thus,wesuspendtheap-plicationprocessrightafterzerowindowadvertisement.Asaresult,atthemomentofprocessmigration,wecanguar-anteecorrectprocessmigrationwithouthavingthetransitdataonthecommunicationchannel.Whencommunicationchannelclearingoccurs,weconsiderdifferentrolesthatthesocketoftheprocesstobemigratedcanplay.

1.Thesocketassender

Itisquitesimpletomovethesocketofamigratedpro-cesswhenitactsasasender.Itdoesnotmerelywritedata.Thatis,itisjustdothemigratedwithoutanyspe-cialpreprocessingsuchaszerowindowadvertisementandapplicationprocesssuspending.2.Thesocketasreceiver

Whenthesocketofamigratedprocessisthereceiver,asinﬁgure1-b,thereceiver(Node2)advertisesazerowindowtothesender(Node1).Evenifitdoes,thereisnoprobleminreceivingapacketintransit.3.Thesocketasbothsenderandreceiver

Thisisthetypicalcaseindistributedsystemsinwhichprocessescommunicatebysendingandreceivingdata.Inthiscase,wetaketheapproachof.oneachnode.Thus,theyinterruptdatasendingofcorrespondingnodesbyzerowindowadvertisement.Figure2showsthemessageexchangewhichhappensatdifferentnodesduringtheprocessmigration.Forsuccess-fulprocessmigration,someconditionsmustbefulﬁlledasfollows:

1.TheeventthatnodeAadvertisesitswindowsizeaszerohappensbeforetheeventthatacorrespond-ingprocessonnodeBsuspendsitsexecutionwithSIGSTOP(

bytransitivityofahappensbefore

relation).

Figure2.Eventsandtheorderintheprocessmigrationwithsocketsupport

resumesthesuspendedprocessonnodeB(

3.Itmakesnodifferencewhichofthetwonodessus-pendstheprocessﬁrstsincetherearenopacketsintransmitbetweentheprocessesofnodeAandnodeB

andbyzerowindowadvertisement(Theorderof

doesnotmatter).

4.TheeventwhichresumesthesuspendedprocessonnodeBhasnothingtodowiththeeventwhichresumesthetransferredprocessimageonnodeC.Becauseanyprocessdoesnotsenddatabyzerowindowadvertise-anddoesnotmatter).ment(Theorderof

5.Thereisalsonorelevancetotheorderofzerowindow

advertisement.Thoughitssocketisfreezedbyzerowindowadvertisement,nodeAcanreceivedatafromthecorrespondingsendernodeB,andviceversa(The

anddoesnotmatter).orderof

3.1.2BindErrorsonSocketReconstruction

Aprocessoperatesonserverhasalistenersocket,andin

thecaseofanincomingconnectionrequestfromaclient,theservercreatesachildsocketandacceptstheconnectionrequestfromtheclient.Thereasonwhypreviousresearchdidnotofferserversocketmigrationisasigniﬁcantbinderrorproblemonsocketreconstruction.

TheLinuxkernelmakesuseofahashtableasinﬁgure3tofacilitatemaintenanceandlookupsofsocketobjects.Theinformationofsourceanddestinationofasocketob-jectareusedasthehashkey.Forabindingandlistening3

2.TheeventthatnodeAtransfersthecheckpointedpro-cessimagetonodeChappensbeforetheeventwhich

3.2.1BlockPIDAllocation

InblockPIDallocation,theentirePIDspaceisbrokenupintoblocks.ThePIDblocksareassignedtoslavenodesondemand.WhenanodesendsarequestforaPID,thepoolmanagerrepliestotheincomingrequestwithaPIDblock,notanindividualPID.Thenodereceivingthisre-plycanforksomeprocesseswithinitsblocksizewithoutgettingpermissionforusingsomeotherPID,orcheckingwhetherthenewPIDisgloballyunique,ateveryprocessforktime.Eventhoughthemaster-slavearchitecturehasbeenchosen,theblockPIDallocationlightenstheheavyburdenofmasternodes,whichresultsingreaterscalability.Moreover,thisofferstheultimateinﬂexiblityinthatanodewhowantstousemorePIDscangetmorePIDblocks.EveryslavenodegetaninitialPIDblockonsystembootup,andiftheyuseupalltheirallocatedPIDs,thentheycangetanotherPIDblockfromthemasternode.ThisistheoriginalconceptofblockPIDallocation.ThereisanenhancedwayforreducingthePIDblockallocationover-headwhichoccurseveninthisblockscheme.Fromsystemstartup,everynodegetsdisjointPIDrangesfromtheentirePIDspace.TheycanfreelyusethegloballyuniquePIDintheirpreallocatedrange.Thereafter,thePIDspaceadjust-mentoccursinblockfashion,onlywhentheavailablePIDsofanodeareexhaustedoranewslavenodeparticipatesinSPS.Thisenhancedschemeshowsbetterscalableperfor-mancethantheoriginalone,sinceitdoesnotneedtosendPIDblockrequest/replymessagesaslongasPIDblockre-distributionoraddingnewnodesintoSPSdoesnotoccur.

Figure3.ThedifferencebetweenTCPPortBindingandTCPPortInheritance

socket,thesourceportisusedasthehashkey,andforaconnectedsocket,atupleofsaddr,sports,daddranddportareusedtocreatethehashkey.Whenitcomestoasocketbindrequest,thehashfunctiontcpbhashfnisusedtotrans-formthehashkeyintoahashbucketaddressinthehashtabletcpbhash.Ifthesocketinformationisidenticaltoanal-readybindedsocket(intheeventofthisisthesamesourceportbeingusedbybothalisteningandchildsocket),abinderroroccurs.Becausetheygetthesamehashbucketaddressfromthehashfunction.Inordertoresolvethisproblem,wemakeuseofTCPportinheritancetoinheritthesourceportfromtheparentsocket.Thus,morethantwosocketobjectsarechainedonthesamehashbucketandsharesthesameport.Bothparentandchildsocketsusethesamesourceportof8,000.Inordertoreconstructasocketduringmi-gration,theparentsocketmustbecreatedﬁrst,andmustentertheTCPLISTENstate.Thenthechildsocketiscre-atedandinheritstheparent’sport.Thechildsocketgetstheaddressofthetb-¿ownerstoredinprevoftheparentsocket,andsavestheaddressoftheparentsocketinthebindnextﬁeldofthechildsocket.Therefore,withoutbinding,itispossibleforanestablishedchildsocketobjecttoputitselfintothehashtablestructure.

3.2SingleProcessSpace

SPSallowsalltheprocessesintheclustertoshareanuni-formprocessidentiﬁcationscheme.Aprocessintheclustercanaccessandcommunicatealltheotherprocessescluster-wide,usingasinglenamespace.Thus,providingcluster-wideuniquePIDsandanIPCfacilityisoneofthebiggestchallengesinbuildingSPS.Wedesignascalableandﬂexi-bleSPS,builtwiththesupportofuniquecluster-widePIDs,aswellaswiththesupportofsignalforwarding.Forthescalablityandﬂexiblityingeneratingcluster-wideuniquePIDs,weintroduceanewPIDallocationscheme,whichwecallblockPIDallocation.

Figure4.PIDlook-upinblockPIDallocationFigure4givestheoverviewofaPIDlook-up,witheachstepoftheprocesslabeled.Theprocessbegins(step1)bysendingalook-uprequesttothemasterdaemon.Thedae-monlooksuptherequestedPIDintheblockallocationtablebycheckingwhetherornotthePIDisincludedinthePIDblockstartingfromStartPID(step2).Thefoundentrycon-tainsthefollowinginformation:thePIDblockstartnumber,thenodenumber,andtheblockusage.Thus,wecanknow

whichblocktherequestedPIDison,whichnode(step3),andhowmuchattheblockisutilized.Theblockusagebitmapexiststorepresenttheblockusageinformation;onebitinthebitmapcorrespondstoaPID.Thispreventsthestarvationproblembycollectingunder-utilizedPIDblocks.

3.3DynamicLoadBalancing

Wehavediscussedtheimportantissueofhowtoeffec-tivelydistributejobstoalargeclustermachinebyproposingasolutiontermedthe”numberofeffectivetasks”[6].Thenumberofeffectivetasksisametricfordeterminingthera-tiobetweenthenumberofCPU-boundjobsandthenumberofI/O-boundjobs.Themetriccanbeusedtoimprovethecomplextaskofjobschedulingandhastheadvantageoverotherjobschedulersthatitdoesnotrequirealargetaskhis-tory.However,alimitationofthisloadbalancingapproachisdescribedasfollows.

execution,butalsoonexecution.BecausethejobmayvarybetweenCPU-boundandI/O-boundfromtimetotime,andthepreemptivemigrationmayhappenindynamicloadbalancingsystematanytime.

Inthiswork,theanalyzingprocessisdoneonkernelatruntime.InordertodistinguishwhetherthejobisCPU-boundorI/O-bound,wemakeuseofbothCPUfactorandproportionoftimequantumconsumedbythejob.Thekernelrecordsnewlyallocatedtimequantumofajobatthebeginningofanewepochandtimequantumconsumeduntiltheendoftheepochinprocessdescriptor.Forthispurposeweaddedsomenewﬁeldsinprocessdescriptorasfollows:

structtaskstruct

longquantumonepochstart;longquantumused;

doubleppquantumused;

JobsmaynotbeclassiﬁedintoCPU-boundorI/O-boundcategories,someofthemcannotbeclassiﬁedasbeingstrictlyCPU-boundorI/O-bound.Moreover,somejobsmayvarybetweenCPU-boundandI/O-boundfromtimetotime.

Fortheformercase,wedonothavetoconsiderpro-cesses,whichmaynotbeclassiﬁableaseitherCPU-boundandI/Obound.Theloadbalancingsystem,whichusesthenumberofeffectivetasksasaloadmetricjusttreatssuchprocessesasusual(referCase1of[6]).Next,weresolvethelatterproblembyprocessmigrationthroughruntimemea-surementofjobresourceusage.Eventhoughtheinitialjobplacementwasdonebasedonthenumberofeffectivetasks,theunbalancedresourceusagemayoccurbecauseacertainjobvariesbetweenCPU-boundandI/O-boundfromtimetotime.Therefore,theLinuxkernelcontinuouslytracesthetasksbyrecordingtheirresourceusageatruntime,andperi-odicallycheckswhethertheresourcerequirementonanodeisseriouslyimbalancedornot.Ifitis,theloadbalancingsystemdynamicallymigratestheprocesstoanotherappro-priatenode.Asaresult,wetakeamoresophisticatedap-proachtoloadbalancing,exploitingbothprocessmigrationandinitialjobplacementsimultaneously,whileourprevi-ousapproachonlyinitialjobplacement.Theexperimentalresultsshowthatthethisapproachgivesbetterperformanceoverthatofourpreviousapproach.

3.3.1ProcessMigrationthroughtheRuntimeMea-surementofJobResourceUsageThereisasimpleandpopularwayforanalyzingthejobresourcerequirementtocomputetheamountofeachresourceusageafterthejobisdone.However,theana-lyzingprocessmustbeabletobedonenotonlyafterthe

Quantumonepochstartkeepstrackofthetimequan-tumallocatedwhenanewepochisstarted.Quantumusedistheamountoftimequantumconsumedinanepoch.Finally,ppquantumusedistheproportionofthetimequantumconsumedfromtheinitiallyallocatedtimequan-tum.

foreachtask(p)

p-quantumused=p-quantumonepochstart-p-counter;p-ppquantumused=(double)p-quantumused/p-quantumonepochstart;

p-quantumonepochstart=(p-counter1)+

NICETOTICKS(p-nice);

p-counter=(p-counter1)+NICETOTICKS(p-nice);

TheLinuxschedulergivesanewquantumtothewholeprocesseswhentheentirerunnableprocesseshaveusedupalloftheirquantum.Operationonthreevariablesnewlyintroducedaboveisperformedatthistime.Thus,theinfor-mationforuseincharacterizingjobresourcerequirementiskepttrackofbytheLinuxkernel.Aglobaljobschedulerutilizestheinformationgatheredfromalltheothernodestodecidethepolicyofdynamicloadbalancingsystem.

4ExperimentalResults

4.1SystemArchitecture

Weconstructedoursystemon16nodeLinuxcluster,ofPentiumIV1.8GHzmachineseachwith512MBRAM.Themachinesareconnectedby100MbpsEthernet.Fig-ure5showsanoverviewofourclustercomputingenviron-mentforSSI,designedintwoparts:theﬁrstoneiskernel

modiﬁcationforimplementingprocessmigration,SPS,andloadbalancing,andthesecondoneisuser-leveldaemonsrunningonallthenodesinaclusterofcomputers.Thedae-monskeepinformationofjobsandnodesup-to-dateoneachnode.Thedaemonstransfertheseinformationtoaglobaljobschedulerperiodically.ThekernelsendsamessagetoadaemonthroughaSimpleCharacterDeviceDriver(SCDD).TheSCDDhasaﬁxed-sizebuffertostoremessagesfromthekernelandallowsathreadofuser-leveldaemontoreadthem.

plicationspresentedinthissectionaresummarizedinTable1.

Table1.Lmbenchapplicationsanddescrip-tions

Applfork+/bin/shfork+exit

Description

processwithalonglifetime,whichforksandwaitsforchildtorun/bin/sh

processhavingashortlifetime,whichforksandwaitsforchildprocesseswhichcallexit()immediately

IPCsharedmemorysegmentholdinganinte-geriscreatedandremoved

Parallelcompilewithmultipleprocessesac-tiveatonetime

signalhandlingbyinstallingasignalhandlerandthenrepeatedlysendingitselfthesignalcreatingapairofpipes,forkingachildpro-cess,andpassingawordbackandforth

semaphorepmakesignalpipe

Figure5.PIDlook-upinblockPIDallocationWhenanewjobissubmittedtosystem,aglobaljobschedulerselectsthemostpreferablenodeforthejobtobeexecuted.Theinformationmanagermonitorsjobbykeep-ingtracksofchangeofjobsresourceusageduringtheexe-cution.Therefore,jobcanbemigratedfromnodetonodebasedonrequiredresourcesandavailableresources.

4.2PerformanceofSingleProcessSpace

BythedeﬁnitionofSPS,itdoesnothelptoimproveper-formancebyparallelcomputation.SPSisakindofserviceofSSI,notaparallelprogrammingmodelsuchassharedaddressspaceormessagepassing.Thus,itisnotpossi-bletoevaluatetheSPSperformanceusingparallelbench-marksintermsofspeeduporscalabilityasafunctionofthenumberofnode.Instead,wemeasurehowtheSPSimple-mentationaffectstheOSperformanceoneachnode.Thus,weusetheterm’scalability’inthissectiontodescribeOSscalabilityonasinglenode.WecarriedoutperformancetestforSPSwithsomelmbench[14]application.Lmbenchisamicrobenchmarksuitedesignedtofocusattentiononthebasicbuildingblocksofmanycommonsystemappli-cations.Theapplicationsweuseinourstudyareasubsetofthelmbenchmicrobenchmarksuite.Someofthemwerenotchosenbecausetheymeasureasystem’sabilitytotrans-ferdatabetweenprocessor,cache,andmemory.Theydon’thavemuchrelevancewiththeSPSperformance.Theap-6

Figure6.SPSperformancecomparisonsFigure6showsthenormalizedexecutiontimeofrunninglmbenchmicrobenchmark.Inthegraph,ORIGINALrepe-sentsthevaluesobtainedwhenpureblockPIDallocationisused,whileENHANCEDrepresentsthevaluesobtainedwhenblockPIDallocationisusedwithinitialPIDspacepartitioning.TheﬁgureindicatesthatENHANCEDoutper-formstheORIGINALbyupto7.1%,andBprocshowsaperformancedegradationinfork+sbin/shandpmake.TheBprocperformancedegradationinforkintensiveapplica-tionsstemsfromthefactthateachslaveshouldgetpermis-siontouseanuniquePIDfromamasterateveryforktime.TheENHANCED,ORIGINAL,andBprocdeliveredessen-tiallythesameperformanceinsemaphor,signal,andpipe.TheseresultsdemonstratethatourSPSimplementationis

morescalablethanBprocespeciallyinfork-intensiveappli-cations.

4.4OverallSystemPerformance

Weconductedatrace-drivensimulationcomparabletoMOSIX,andhistorybasedestimationsystem[3]inordertoevaluatetheoverallsystemperformance.AstheinputweselectedajobexecutionlogcollectedbytheCornellThe-oryCenter(CTC)duringtheperiodJuly1996throughMay1997[7].WeﬁlteredtherawCTCjobexecutionlogrecordbyremovingthejobsthatsatisfythefollowingcondition.

4.3MeasurementofOverheadforProcessMigra-tionwithOpenNetworkConnection

Tomeasurethecostofourprocessmigration,weper-formedcheckpointing/restartingtoprocesswhichrepeti-tivelyexchanges’ping-pong’messagesbetweenprocessesindifferenthosts.

Figure7.Socketmigrationoverhead

Figure7showsthatthevastmajorityoftimeisspentdo-ingmemorydump.Notethatthey-axisdoesnotstartfrom0.Thetimetosaveandrestoresocketstructuretakesonly0.7%ofcheckpointtimeand1.1%ofrestarttime,respec-tively.Thisimpliesthattheoverheadduetosocketstructuremigrationisalmostnegligiblewhencomparedwithtimetotakethebasicprocesscheckpoint/restart.InthisﬁgureFilerepresentstheoverheadtosavetheprocess’sﬁledescrip-tors,andOthersrepresentstheremainingtimeconsumedbyothertaskslike:initializeglobalvariables,runaloopincontrollogic,anssoon.

Thejobswhichareshorterthan10secondsareruledout,sothatthesimulationisnotaffectedbysystemcommands,suchas’ls’or’cd’.Theconditionthattheexecutiontimemustnotexceed10000secondsexistsbecausethetargetsystemforthesimulationdoesnothaveenoughprocessingpoweroftheoriginalsystem,IBMSP2512node.

Oursystemassignsanewjobtoanodewhichhasthelowestvalueofthenumberofeffectivetasks,andmigratesthejobifloadimbalanceoccurs.Thehistorybasedesti-mationsystemestimatestherequiredresourcestoassignajobexperiencedbyotherjobwhichhassamematchingnameandparameters[3].Ifthenewlyarrivedjobdoesnotmatchthecharacteristicsofthepreviouslyexecutedjob,thesystemcannotcorrectlyestimatetherequiredresourceforthisjobandevenifthenewjobmatchesthepreviousjob,it’spossiblethatthejobrequiresdifferentresourcesthanoneexecutedpreviously.However,weusedthenumberofeffectivetaskstoavoidperformancedegradationthroughanon-optimalplacementofajobeventhoughthejobhasnotbeenexecutedbefore.Moreover,ourproposedprocessmi-grationpreventsthedynamicloadbalancingfromsufferingfromimbalancedresourceusagewhichoccurswhensomejobsvarybetweenCPU-boundandI/O-boundcategories.Thus,theprocessmigrationprotectsthesystemfromexces-siveexecutiontimecausedbyimbalancedresourceusage.Anincomingjobisusuallyplacedonthemostpreferablenodeinthenumberofeffectivetasksbasedapproachandalsoitmaybeplacedonaninappropriatenodewhenincor-rectestimationoccursinestimationbasedapproach.Figure8showsoursystemreducesthenumberofjobswhichsuf-fersfromunwantedincreaseofexecutiontimeto15throughruntimemeasurementsandprocessmigration.Evenifthe15jobsstillresultinagreaterexecutiontime,thegaphasbeensigniﬁcantlyreduced.MOSIXisdesignedtorespondtoloadaveragevariationamongthenodesbymigratingpro-cessesfromonenodetoanother.Theloadaverageimpliestheweightedaverageofnumberofjobsanditdoesnotcontainthejobresourceusage,asdifferentfromourloadmetric.Anarrivalofnewjobmaycausethesystemtobeloadimbalanceatinitialstage,becauseMOSIXdoesnotexploitinitialjobplacement.Moreover,I/Oboundandnet-workboundprocessessufferbigdrawbacksandpenalized

Figure8.ExecutiontimeofeachjobinCTCIBMSP2trace

byprocessmigrationduetothestrongresidualdependencyinMOSIX.ThesearethereasonwhyoursystemperformsbetterthanMOSIXasshownintable2.

ﬁedcommercialoperatingsystems.ThesessystemsincludeCondor[2],CoCheck[19],libckpt[17],andMPVM[5].Sincethereisnokernelsupportforprocessmigration,thesesystemsrequirestheprocessesnottousesomecommonop-eratingsystemservicessuchasnotallowingprocessestoforkorexec.Moreover,migrationofnetworkedapplica-tionsisinherentlyimpossiblewithoutkernelsupport.

Kernel-levelapproachesforprocessmigrationincludeCRAK[4],Epckpt[16],MOSIX[1],OpenSSI[8],SolarisMC[12].MOSIX,OpenSSI,andCRAKprovideprocessmigrationwithsocketsuppport.However,CRAKisonlyabletomigratetheprocesswhichoperatesonclientsocket,notserversocket.MOSIX,OpenSSIandSolarisMCpro-videmigrationbyredirectinglocationdependentoperationssuchas,systemcallsorsocketconnections.Thisapproachdegradesperformance,faultresilienceandadverselyaffectsreliability,becauseamigratedforeignprocesswillstillde-pendonitshomenode.

BProc[9,18],SolarisMCareapproachesforprovi-sionofSPS.BProc,orBeowulfDistributedProcessSpace,wasdevelopedatScyldSoftwareandreleasedundertheGNUPublicLicense(GPL).BprocshouldnotbeconfusedwiththeolderglobalPIDspacedoneforBeowulfsystems.Bprocisnotaderivativeofthatscheme.Bprocadoptsthemaster-slavearcihtecture.Allrequestsforanewallocation

Table2.Comparisonsoftotalexecutiontime

Performanceimprovement(%)ToestimationToMOSIXbasedapproach

-6.06

5.7111.245.85

14.15

8.95

Estimationbased

MOSIX

The#ofeffectivetasksbased

[2]withruntimemea-surement

Table2showsthatourdynamicloadbalancingsystemgivesperformanceimprovementof14.15%toestimationbasedand8.95%toMOSIX.

5RelatedWorks

Severalsystemshavebeendevelopedtosupportpro-cessmigrationattheuser-levelandcanberunonunmodi-8

ofuniquePIDarecentralizedtothemasterserver.SolarisReferences

MCprovidesaglobalprocessspaceforglobalprocessman-agement,sothatthelocationofaprocessistransparenttotheuser.Processeslivingwithinthisglobalprocessspacecanbeuniquelyidentiﬁed.Theglobalprocessspacesup-portsremotecreationofprocessesandoperating-system-relatedmessagesaretransparentlyredirectedtothenodewheretheprocessesreside.

6Conclusion

Weintroducedaclustercomputingenvironmentsupport-ingSSI,constructedusingthreeconstructs:singleprocessspace(SPS),processmigration,anddynamicloadbalanc-ing.WeproposedblockPIDallocationasanewdesignforscalableSPS,especiallyforgeneratingcluster-wideuniquePIDs.WhenaclientrequestsmorePIDs,aPIDblockisassignedtotheclientinsteadofindividualPIDs.Therefore,thisapproachlightenstheheavyburdenofthemasternodesforPIDallocation.

Ourprocessmigrationprovidessolutionstothetwoma-jorproblemsinsocketmigration:brokenpipeproblemsandbinderrorsduringsocketreconstruction.Communicationchannelclearingbyzerowindowadvertisementhelpstore-solvebrokenpipeproblemsandTCPportinheritancepre-ventsbinderrorsduringsocketreconstruction.Thus,ourprocessmigrationfacilityisnowabletopreserveopennet-workconnections,andevenserversocketconnections.Thisisahighlytransparentapproach,inthatwedidnotintro-duceadditionalmessagesforchannelclearinganddidnotmakeanymodiﬁcationtotheTCPprotocolstack.

Wealsoprovideamoresophisticateddynamicloadbal-ancingapproachwhichresolvesthelimitationsofourprevi-ousone.Thelimitationsincludeincorrectinitialjobplace-mentmayarisefromthefactthatjobsmaynotbeclassiﬁedintoCPU-boundorI/O-boundorthatsomejobsmayvarybetweenCPU-boundandI/O-bound.Inordertoresolvethisproblem,theLinuxkernelcontinuouslytracesrunningjobsbymonitoringtheirresourceusageatruntime.Theloadbalancingdaemonperiodicallycheckstheneedformigra-tion,anddynamicallymigrateswhenanincorrectdecisionhasoccurred.Thisapproachgivesa14.15%improvementcomparedwithhistory-basedone,anda8.95%improve-mentoverourpreviousapproach.

Themaincontributionofthisworkistobuildacomput-ingenvironmentforSSIclustersbycombiningtheabovethreecomponentsininnovativeways.Forthispurpose,wedesigntheseconstructstoimplicitlycooperatewitheachotheraswellastakeanovelapproachestoimprovetheirownperformanceandfunctionality.

[1]L.Amar,A.Barak,andA.Shiloh.Themosixdirectﬁlesys-temaccessmethodforsupportingscalableclusterﬁlesys-tems.ClusterComputing,pages141–150,2004.

[2]J.Basney,M.Livny,andR.Buyya.DeployingaHigh

ThroughputComputingCluster.PrenticeHall,1999.

[3]A.M.Bond.Adaptivetaskallocationinadistributed

workstationenvironment,phdthesis.VictoriaUniversityofWellington,1993.

[4]K.Bubendorfer.Resourcebasedpoliciesforloaddistri-bution,master’sthesis.VictoriaUniversityofWellington,1996.

[5]J.Casas,D.L.Clark,R.Konuru,S.W.Otto,R.M.Prouty,

andJ.Walpole.Mpvm:Amigrationtransparentversionofpvm.ComputingSystems,1995.

[6]M.Choi,J.R.Yu,H.J.Kim,andS.R.Maeng.Improving

performanceofadynamicloadbalancingsystembyusingnumberofeffectivetasks.IEEEInternationalConferenceonClusterComputing(CLUSTER),December01-04,HongKong,2003.

[7]D.G.Feitelson.Parallelworkloadarchive.

http://www.cs.huji.ac.il/labs/parallel/workload/.

[8]S.S.I.C.forLinux(OpenSSI).http://www.openssi.org.[9]E.Hendriks.Bproc:Thebeowulfdistributedprocessspace.

ACMInternationalConferenceonSupercomputing(ICS),June22-26,NewYork,USA,2002.

[10]R.S.C.Ho,K.Hwang,andH.Jin.Designandanalysisof

clusterswithsinglei/ospace.IEEEInternationalConfer-enceonDistributedComputingSystems(ICDCS),April10-13,Taipei,Taiwan,2000.

[11]S.Hotovy,D.Scheider,andT.O’Donnell.Analysisofthe

earlyworkloadonthecornelltheorycenteribmsp2.Pro-ceedingsofthe1996ACMSIGMETRICSConferenceon-MeasurementandModelingofComputerSystems,pages272–273,1996.

[12]Y.A.Khalidi,J.M.Bernabeu,V.Matena,K.Shirriff,and

M.Thadani.Solarismc:Amulticomputeros.USENIX1996AnnualTechnicalConference,SanDiego,CA,1996.[13]P.KruegerandR.Chawla.Thestealthdistributedsched-uler.ProceedingsofIEEE11thInternationalConferenceonDistributedComputingSystems,pages336–343,1991.[14]L.McVoy.lmbench:Portabletoolsforperformanceanal-ysis.ProceedingsoftheUSENIX1996AnnualTechnicalConference,SanDiego,California,1996.

[15]J.Pasquale,B.Bittel,andD.Kraiman.Astaticanddy-namicworkloadcharacterizationstudyofthesandiegosu-percomputercentercrayx-mp.Proceedingsofthe1991ACMSIGMETRICSConferenceonMeasurementandMod-elingofComputerSystems,pages218–219,1991.

[16]E.Pinheiro.Truly-transparentcheckpointingofparallel

applications(workingdraft).FederalUniversityofRiodeJaneiroUFRJ,2001.

[17]J.S.Plank,M.Beck,G.Kingsley,andK.Li.Libckpt:

Transparentcheckpointingunderunix.ProceedingofUSENIXAnnualTechnicalConference,1995.[18]S.Software.Apenguincomputingcompany,

http://www.scyld.com/bproc.html.

[19]G.Stellner.Cocheck:Checkpointingandprocessmigration

formpi.Proceedingsofthe10thInternationalParallelPro-cessingSymposium,Honolulu,Hawaii,April15-19,1996.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文