forHeterogeneousPlatforms,Compiler-Driven

amd 4
DataLayoutTransformationforHeterogeneousPlatforms DeepakMajeti1,RajkishoreBarik2,JishengZhao1,MaxGrossman1,andVivekSarkar1 1RiceUniversity2IntelLabs Abstract.ModernheterogeneouspriseofCPUcores,GPUcores,andinsomecases,eleratorcores.Eachofputationalcoreshaveverydifferentmemoryhierarchies,makingitchallengingtoefficientlymapthedatastructuresofanapplicationtothesememoryhierarchiesautomatically.Inthispaper,wepresentpiler-drivendatalayouttransformationframeworkforheterogeneousplatforms.Weintegrateourdatalayoutframeworkwiththedataparallelconstruct,forasync,ofHabanero-Candenablethesamesourcecodetopiledwithdifferentdatalayoutsforvariousarchitectures.Theprogrammeroranauto-tunerspecifiesaschemaofthedatalayout.pilerinfrastructuregeneratesefficientcodefordifferentarchitecturesbasedonthemetainformationprovidedintheschema.Ourexperimentalresultsshowsignificantbenefitsfrompiler-drivendatalayouttransformation,anddemonstratethatthebestdatalayoutforaprogramvarieswithdifferentheterogenousplatforms. 1Introduction Recenthardwaretrendshaveseentheadoptionofheterogeneoussystemsconsistingofstandardprocessorcores,graphicsprocessingcores,andeleratorcores.WhilethememoryhierarchyofstandardCPUcoresconsistofL1,L2andL3caches,recentdiscreteGPUcoreshavealsobeenembeddedwiththeirownL1,andL2caches.IntegratedGPUcores,ontheotherhand,sharethesamephysicalmemorywiththeCPUwhileusingaprivateL3cachefortheGPUcores.Withsuchdifferingmemoryhierarchieswithinthesamesystem,determiningthebestdatalayoutcanbechallengingsincetheoptimallayoutforputationalkerneldependsonwhetherthekernelexecutesonaCPUcore,adiscreteGPU,oronanintegratedGPU(alongwithotherfactors).Additionally,theGPUmemoryperformanceisimpactedbythenumberofcoalescedmemoryessesandcontrol-flowdivergence,whereasCPUmemoryperformanceisimpactedbyfactorssuchasfalsesharing,prefetchinganddatareuse.ThisimpliesthatthechangesindatalayoutcanhavemajorimpactonCPUvs.GPUcoreperformance.Ingeneral,theprogrammerhastowritedifferentversionsofCPUandGPUkernelsfordifferentarchitecturesandhastoselectoptimalmemorylayoutsforeachdatastructure.Thisplacesasevereconstraintoncodeportability.Atthesametime,performingtheselayouttransformationsautomaticallyforawide-rangeofapplicationsincludingirregularapplicationsisadauntingtask. D.anMeyetal.(Eds.):Euro-Par2013Workshops,LNCS8374,pp.188–197,2014.cSpringer-VerlagBerlinHeidelberg2014 Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms189 CUDAandOpenCLarethetwoprimarylanguagestargetingtheheterogeneoussystemsforGPGPUprogramming.OpenCLcanalsobeusedfortargetingCPUcores.Manyhigh-levelprogrammingmodelshavealsoevolvedinthelastfewyearsthatdealwithheterogeneity[14,21,13,22,6,12].Choosingalanguageinvolvestrade-offsbetweenprogrammerproductivityandperformance.However,noneoftheselanguagesprovidemechanismstospecifythedatalayout.Somerecentwork[20,10,19]providealibrarybasedmechanism,butarelimitedinscope.Forexample,Kunkelet.al.[24]emphasizetheneedforadatalayoutabstraction.Recently,Wuetal[23]haveprovedthatfindingtheoptimaldatalayouttomaximizethenumberofcoalescedessesonaGPUisplete.Thus,manuallywritinghighperformanceportableprogramsorautomaticallygeneratingefficientcodeviapilerswithoutanydomainknowledgeischallenginggiventheproliferationofdevicetechnologiesonheterogeneousarchitecturesandtheirdifferingmemoryhierarchies.Webelievethatpiler-drivendatalayouttransformationframeworkcanhelpbridgethisgap. Inthispaper,wepresentpilerdrivenmeta-dataframeworkthatallowsbothprogrammersandtuningexpertstospecifyarchitecturespecificanddomainspecificinformationforparallel-forloopsofaprogram.Ameta-datafileiscreatedforanapplicationandispopulatedwithentriesonthedatalayouttobeusedforadeviceontheheterogeneoussystem.Thedatalayoutwefocusoninthispaperincludestructure-of-array(SOA),array-of-structure(AOS)andintermediatestructure-of-array-of-structures(SOAOS).Anyhighlevellanguagewhichhasparallel-forloopscanbeextendedtomodatethemetadataframework.Inourwork,wetargetthedata-parallelforasyncconstructinHabanero-C[2]andintegrateourmeta-dataframeworkwiththepilerandruntime.pilestheforasyncconstruct,togenerateOpenCLdeviceandhostcodeforthetargetheterogenousarchitecture.Themeta-datainformationisveryusefulinguidingpileroptimizationpassesforthegenerationofefficientcodeforadevice. Ourpapermakesthefollowingcontributions: –Ameta-dataframeworkthatallowsboththeprogrammerandthetuningexperttospecifytheunderlyingarchitectureanddomainspecificknowledgesforparallel-forloops; –pilerandruntimeframeworktoautomaticallygenerateefficientcodebasedonthemeta-datainformation.WecurrentlyfocusonAOS,SOAandSOAOSdatalayoutsinpiler; –Anexperimentalevaluationofoursystemusingawidevarietyofheterogeneousarchitectureswhichshowstheimpactofdatalayouton5distinctapplications.Onanaverage,thedatalayouttransformationaloneimpactedtheperformanceby7.33×(upto27.11×)onAMD4-coreA10-5880KCPU,2.84×(upto5.57×)onAMDRadeonintegratedGPU,8.32×(upto29.5×)onNVIDIATeslaM2050GPU,2.19×(upto5.32×)onIntel12-coreX5660CPUand1.9×(upto3.89×)onIntelintegratedi7-3770GPU. Therestofthispaperanizedasfollows.Section2presentsourmeta-dataframework.Section3discussesthedetailsofpilercodegenerationandruntime.Section4presentstheexperimentalresultsonawidevarietyofprocessors.RelatedworkisdiscussedinSection5,andfinally,Section6concludes. 190D.Majetietal. 2ProgrammingModel Ourmeta-dataframeworkisbuiltofHabanero-C(HC)pilerandruntimeinfrastructure[9].ThedetailsoftheparallelconstructssupportedbyHCcanbefoundat[2].Ourpaperfocusesonthedataparallelforasyncconstruct1.Thesyntaxoftheforasyncconstructisasfollows. forasyncindex(args)size(args)optional{//forasyncbody } Thesemanticsoftheforasyncconstructissimilartoaprogramloopwhichexhibitsparallelforparallelism.Theindexclauseisusedtospecifytheloopiterators.Thenumberofvariablesintheindexclausegivesthedimentionalityoftheloop.Thesizeclausespecifiesthenumberofiterationsoftheloopineachdimension.Thereare2optionalclauses,scratchpadandseqclause.TheHClanguagemodeltakesadvantageofthedifferentmemoryregionsavailableonmostGPUhardwareswiththehelpofthescratchpadandseq. Foreachhostorthedeviceonaheterogeneoussystem,itispossibletospecifythedesireddatalayoutforarray-basedorstructure-baseddatastructuresofagivenforasyncloop.Thedatalayoutsthatwefocusonare:
(1)AOS:arrayof-structure;
(2)SOA:structure-of-arrayand
(3)SOAOS:structure-of-array-ofstructures.piler(describedinSection3)withthehelpofthemeta-datafileisabletotransformHCcodetooneoftheSOA,AOSandSOAOSlayouts. Thegrammarforthemeta-dataandanexampleisshowninFigure1.The archname−>Archnamemetadatametadata−>(structdef)∗(scratchpaddef)∗structdef−>Structname(fielddef)∗scratchpaddef−>Scratchpadname (fielddeftilesizelinenum)∗fielddef−>Fieldtypenamelengthtype=fp|dp|iplength−>(digit)∗tilesize−>(digit)∗linenum−>(digit)∗name=(letter)(letter|digit)∗letter−>|A|B|C|...|Z|a|b|c|...|z|digit−>1|2|3|4|5|6|7|8|9|
0 ArchIntelGPUStructbodyposFieldfpposxFieldfpposyFieldfpposzStructFieldfpxFieldfpyFieldfpzScratchpadlocalFieldfpposx25664ScratchpadlocalFieldfpposy25664ScratchpadlocalFieldfpposz25664 ArchAMDGPUStructbodyposFieldfpposxFieldfpposyFieldfpposzFieldfpxFieldfpyFieldfpzScratchpadlocalFieldfpx102464 Fig.1.Meta-dataGrammar(left)andmeta-datafileExample(right) meta-datafileconsistsofasetarchitecturespecificoptimizationinformation.Thearchitecturaldetailsconsistofthedatalayoutinformationandscratchpadmemoryallocationinformationforagivenprogram.EachstructdefinitionhasalabelStruct,anameforthestructandasetoffields.EachfieldinturnhasalabelField,thetypeofthefieldandthenameofthefield.Thetypeoffieldscanbefp:apointertoanarrayoffloatvalues,dp:apointertoanarrayofdoublevaluesorip:apointertoanarrayofintegervalues.Thescratchpadmemoryallocationinformationconsistsofasetofbufferdescriptions.ItbeginswithalabelScratchpad,thenameofthespecialmemoryregion,thefield,theamountofdatawhichmustbecachedandthelinenumberoftheforasync. 1Ourframeworkisalsoapplicabletootherdata-parallelprogramminglanguageswithaparallel-forlikeconstruct. Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms191 Restrictionsofourmeta-dataframeworkTheusercannotaliasthefieldsspecifiedinthemeta-datafile.Weplantoberesolvethisissuewiththehelpofanaliasanalysis.Anotherlimitationintheprogrammingmodelisthatavariablenamecannotberepeatedinthewholeprogramindifferentscopes.Thislimitationcanbeavoidedbyaclevervariablerenamingmechanism.Also,allfieldsinastructmustbeofthesametype.WecurrentlydonotsupportplexdatalayoutssuchasAOSOA(Array-ofstructure-of-arrays).Weleavethisforfuturework. 3Implementation Ouroverallmeta-dataframeworkisshowninFigure2.TheapplicationuserwritesaprograminHabanero-C(HC)usingtheforasyncconstruct.Followedbywhich,eitherthedeveloperorthetuningexpertspecifiesthemeta-datainformationfortheapplication.Weextendthepilerinfrastructuretoperformdatalayouttransformationbasedonthemetainformation;.pilerpassisimplementedintheROSEpilerframework[17].ThepilergeneratesOpenCLcodefromtheprogramwiththespecifieddatalayoutandthecorrespondinghostcode. HabaneroCSource(.hcfiles) Metafile HCCompilerpassesLayout+HC- OpenCL(ROSE) CProgram(.cfiles)+CopenCL(.clfiles) HostProgram(.cfiles) Runtime:OpenCLSDK+MemoryManager CCompiler(GCC) ExcutableBinaryfiles GPU CPU Fig.2.CompilationFlow 3.1DataLayoutTransformation pilerpassfirstparsesthespecifiedmeta-datafileanditcreatesametadatamapforeacharchitecture.Themappingisbetweenthefieldsandthestructnametheybelongto.Themappingisdoneforeachsuchstructmetainformation.Ifitfindsanyscratchpadmetainformation,itisrecorded. Thedatalayouttransformation(DLT)pilerpassgeneratesthecodebasedonthespecifieddatalayoutinthemeta-datafile.Itgeneratescodewhichincludesnewstructdefinitionsandthecodethatoperatesonit.Figure3showsthealgorithmfortransformingtheprogramwithagivendatalayout.DLTtakestheinputprogramandameta-datafile.createStructDefinitions(M)addsthestructdefinitionsasspecifiedinthemeta-datafiletotheAST.Thesestructsaredefinedonlyonceintheglobalscope.TheDLTpasstheniteratesoverallthefunctionsandperformsthestepsdescribedinlines4-
7. tryAddStructInstances(f)analyzesthefunctionparameters.Ifanyoftheparameternamesappearinthemetafile,aninstanceofthecorrespondingstructisdeclaredinthefunctioncall.Ifweabstractthestructasagroupoffieldsnames,thenonestructinstanceisdeclaredpergroup.Innextstep,updateInst(I) 192D.Majetietal. 1functionDLT() Input:MetafileMandinputprogramP Output:TransformedprogramP’ 2createStructDefinitions(M); 3foreachfunctionFinPdo
4 foreachformalfinfunctionparameterlistdo
5 tryAddStructInstances(f);
6 foreachinstructionIinfunctionbodydo
7 updateInst(I); Fig.3.DataLayoutTransformation checksallpointerorarrayreferencesinthefunctionbody.Ifanyofthosereferenceareviaanyofthefieldsinthemetafile,thentheessisreplacedwiththecorrespondingstructinstance. Animportantfactorhereisthatthetypeofthefunctionintheoriginalprogramremainsthesame.Keepingthefunctiontypesintactwillavoidrewritingthedirectandindirectcallstothefunctionfromsequentialcode. 3.2MemoryManagement IntheHCprogrammingmodel,theprogrammerallocatesheapmemorytothefieldsviastandardmallocandcalloccalls.Wereplacethesecallswithourspecializedmemoryallocators.Wenametheallocators,hcmetamallocorhcmetacalloc.ThesyntaxoftheallocatorsisshowninFigure4. void∗hcmetamalloc(char∗fldname,sizetnumbytes);void∗hcmetacalloc(char∗fldname,sizetnumelem,sizetsizeelem); Fig.4.MemoryAllocators hcmetamallocorhcmetacallocarewrappersaroundthestandardmallocandcalloccalls.Theallocatorsalsopassinthenameofthefieldtothememoryallocator.Thefieldnameisrequiredbythememorymanagerandisexplainedasfollows. Thememorymanagerhandlesthedifferentlayoutsandalsocreatesdevicebuffers.Thememorymanagerhastwoponents,thememoryallocatorandthelayouthandler.Duringtheprograminitializationphase,thelayouthandlerreadsthemetafileandcreatesamapofthedatalayout.Thememorymanagerwiththehelpofthefieldname,looksintothelayoutmapandallocatesthememory. Figure5showsanexamplecodegenerationofasinglekernelwithmetadatainformationforanIntelarchitecture. 4Evaluation Thegoaloftheexperimentalevaluationistostudytheperformanceofdifferentdatastructurelayoutsforvariousprogramsonmultiplearchitectures. Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms193 //forasyncprogramintmain(){ forasyncpoint(i,j)size(
M,N)seq(tilesize,tilesize){ a[i*M+j]=b[i*M+j]+c[i*M+j];} } //metadataArchIntel_CPU StructBC FieldfpbFieldfpc //CompilergeneratedcodestructBC{floatb,floatc};voidoffload(float*a,structBC*bc, char*kernel_name,char*ocl_kernel){ //OpenCLHostCode.......... } intmain(){structBC*bc= offload(a,bc,”kernel_1”,Kernel_string);} //OpenCLKernelCodeKernel_string=“ structBC{floatb,floatc}; voidkernel_1(__globalfloat*a,__globalstructBC*bc,intM,intN){ i=get_global_id
(1);j=get_global_id
(0);a[i*M+j]=bc[i*M+j].b+bc[i*M+j].c; }“; Fig.5.Exampleofcodegenerationusingameta-datafile Table1.Benchmarks Name NBodyMedicalSRADSeismicMRIQ Description N-BodySimulationMedicalImageRegistrationSpeckleReducingAnisotropicDiffusionSeismicWaveSimulationMatrixQComputationfor3DicResonanceImageReconstructioninNon-CartesianSpace. OriginalLayout SOASOASOASOASOA NumofFields 76466 Input 32Knodes256×256×2565020×45804096×409664×64×64 4.1ExperimentalSetup Table1describesthebenchmarksusedinthisevaluation.TheN-Bodyparticlesimulationbenchmarkwaswrittenfromscratchforthis work.Wefocusonputeintensivekernelwhichcalculatestheforcesbetweenthebodies. TheMedicalImagingbenchmarkincludeskernelsfromamedicalimagingpipelineusedtoanalyzedifferenttypesofmedicalimagesfordefectsorabnormalities[15].Thisapplicationconsistsofthreemainphases:denoising,registration,andsegmentation.Forourevaluation,wefocusontheputationallysignificantkernelofthethree,registration. TheSRADbenchmarkfromtheRodiniabenchmarksuite[11]isalsoused.SRADisusedto”removelocallycorrelatednoise”in”ultrasonicandradarimagingapplicationsbasedonpartialdifferentialequations”[18]. TheSeismicbenchmarksuitewascreatedbasedontheexampleincludedintheIntelTBBbenchmarksuite[4].Seismicsimulatesthepropagationofwavesduringseismicactivity. TheMRIQbenchmarkfromtheParboilbenchmarksuite[7]putesaQmatrix.TheQmatrixrepresentsthescannerconfigurationusedina3Dicresonanceimagereconstructionalgorithminnon-Cartesianspace.TheMRIQcodehasbeenconvertedtoSOAlayoutbyhand. Table2showsthedifferentmeta-datafilesusedforeachbenchmark.SincethedefaultlayoutisSOA,thereisnoneedofametafile.AllOpenCLkernels,glue 194D.Majetietal. Table2.Applicationmeta-datafiles ApplicationNBodySeismicSRADMedicalMRIQ AOS StructbodyFieldfpposxFieldfpposyFieldfpposzFieldfpxFieldfpyFieldfpz StructparamsFieldfpSFieldfpTFieldfpVFieldfpDFieldfpLFieldfpM StructdirectionFieldfpNFieldfpSFieldfpEFieldfpW StructdispFieldfpU1FieldfpU2FieldfpU3StructvelocityFieldfpV1FieldfpV2FieldfpV3 StructbodyFieldfpkxFieldfpkyFieldfpkzFieldfpphiMag SOAOS StructposFieldfpposxFieldfpposyFieldfpposzStructFieldfpxFieldfpyFieldfpz
N.A Structdirection1FieldfpNFieldfpSStructdirection2FieldfpEFieldfpW StructdispFieldfpU1FieldfpU2FieldfpU3
N.A Table3.Hardwarearchitectures VendorType Model Freq CoresLocalMem L1$L2$ IntelCPU X5660 2.8GHz 12(HT)
N.A 192KB1.5MB IntelIntegratedGPUi7-3770U 350MHz-1.15GHz14 64KB(perhalf-slice)
N.AN.A NVIDIADiscreteGPUTeslaM2050 575MHz
8 8x48KB 16KB768KB AMDCPU A10-5800K 1.4GHz 4(HT)
N.A. 16KB32MB AMDIntegratedGPURadeonHD7660800MHz
6 6x32KB
N.A4MB code,anddifferentlayoutsforeachoftheseapplicationsweregeneratedfromaHCarray-basedimplementation. Table3liststhehardwarearchitecturesusedinourevaluation.WeuseavarietyofCPUandGPUsystemswithdifferingmemoryhierarchiesinordertodemonstratethebenefitofourdatalayouttransformation.pilerusedforthesequentialversionsofeachapplicationGCC4.4.6(withtheflags-g-O2).AllOpenCLkernelspiledwiththeirdefaultoptimizationsenabled.IntelGPUtestswererunusingthe2013ReleaseoftheIntelOpenCLSDK[3].IntelCPUtestswereperformedusing2011ReleaseofIntelOpenCLSDK,v1.5[3].NVIDIAGPUtestswereperformedusingNVIDIASDKv5.0[5].AMDGPUandGPUtestswereperformedusingAMDAPPSDKv2.8[1]. 4.2CPUandGPUPerformance Figure6containsresultsforallthebenchmarks.parerelativeexecutiontimeforthevariousdatalayoutsondifferentCPUandGPUplatforms.Foragivenarchitecture,wenormalizeeverylayoutwithrespecttothefastestexecutinglayout.Inthiscase,smallerbarsimplybetterperformance.Everycolumnisstackedin2levels.Thebottomlevelrepresentsthefractionoftotalexecutiontimespentinthekernel.ThisinformationisretrievedfromtheOpenCLAPI.stackrepresentsthefractionoftotalexecutiontimefortheremainingexecution.ThismunicationandOpenCLinitializationoverheads.stackisnegligibleforIntelGPU.ThisisbecausetheGPUisintegratedonethesamedieastheCPUandthereisnodatacopyingoverhead.NVIDIAGPUandAMDarchitecturesshowcopyingoverheads.Foralltheworkloads,theAMDCPU/GPUexhibitalargeamountofoverhead.Onfurtherinspection,wediscoveredthatthemajorityoftheoverheadwasduetosignificanttimedifferencebetweenOpenCLkernelenqueueandkernelexecution.ThiscouldbeanimplementationerrorinAMD’sOpenCLimplementation.Toanalyzetheperformancedifferences,wecouldnotfindanytoolswhichprofileOpenCLcode.Wemakethefollowinganalysisbasedoncodeandmachinecharacteristics. FortheN-Bodybenchmark,weseethattheSOAandAOSversionsperformsimilarlyontheCPU.Sincethenumberoffieldsareless,alltheloadsinan Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms195 iterationfitintothecacheandconsecutiveiterationsdonotincuranypenalty.ThearraylayoutperformsbetteronGPUsbecauseSOAlayouthelpsinmemorycoalescing. FortheSeismickernel,theSOAlayoutshowsbetterperformanceonAMDCPU,whereastheAOSlayoutisbetteronIntelCPU.ThiscanbeattributedtothedifferenceincacheassociativityandsizesbetweenAMDandIntel.OntheGPUside,SOAperformswellonall3GPUhardwaresduetocoalescing. TheSRADkernelshowsimprovedperformancefortheSOAOSlayoutrelativetotheSOAandAOSlayoutformostofthearchitectures.SurprisinglyevenontheGPUtheAOSandSOAOSlayoutsperformsbetterthantheSOAlayout.ThisiscontrarytoGPUbestpractices.ThememoryessfunctionsintheSRADkernelarenon-affineandirregular.Itisdifficultforpilerorprogrammertoanalyzeanddeterminetherightlayout.Ourframeworkenablesrapidprototypingandtestingofdifferentlayoutsforperformanceonmultiplearchitectures. MRIQexhibitslittleornovariationacrosslayouts.MRIQisputeboundkernelandthedatastructurelayoutwillhavelittleornoeffect. Themedicalimagebenchmarkshowssomeinterestingpropertiesfordifferentlayouts.TheAOSlayoutisbetterontheCPUwhereastheSOAlayoutisbetterontheGPU.Medicalimagekernelissimilartoa3DJacobi(stencil)putation.Theputationisperformedseparatelyonthreeinputbuffersandtheresultsarewrittenintocorrespondingoutputbuffers.KeepingtheinputbuffersinasinglestructishelpfulfortheCPU.Thisisbecausewhenyouloadapointforoneofthestencil,youautomaticallyloadthepointsfortheother2stencils(multiplepointsfitinacacheline).Thearraylayoutwouldhavecaused3loadsforthesamepoint,oneineachofthethreestencils.OntheGPUside,thearraylayoutisbetterasexpected. BestpracticesgenerallydictatetheuseofarraydatalayoutsonGPUsduetoimprovedcoalescenceofglobalmemoryesses.However,ourSRADandMRIQresultscontradictthisknowledge.Ourmetadataframeworkenablesrapidprototypingandoptimizationofdifferentdatalayouts,allowingtuningexpertstorapidlydiscoveroptimallayoutsplexandirregularapplications.FortheCPUthelayoutoftendependsuponthekernelfeaturesandmemoryesspatterns.Ourprogrammingmodelcaneasilyportsuchapplicationstodifferentarchitectures. 5RelatedWork Recently,datalayoutshavebeenstudiedinthecontextofGPUs.DL[20]usesamappingfunctionandruntimelibrarysupporttoenablearchitecturespecificdatalayouts.DLdoesin-placedatamarshalingontheGPU.LikeDL,Dymaxion[10]proposesasetofindexmappingfunctionswhichareusedtooptimizememorymappings,withdatamarshalingdoneontheCPUside.Sunget.al.[19]usedtechniquessimilartoDLtoperformdatalayouttransformationsforstructuredgridapplications.pilerautomaticallychangestheorderofn-dimensionalarrayreferencestomaximizememoryesscoalescing.Withthehelpofmicro-benchmarks,lowlatencystridesandanoptimalindexmaparediscovered.Thistechniquerequiresmanualhostcodechanges.Themaindisadvantageofthetechniqueslistedinthisparagraphisthattheoverheadof 196D.Majetietal.  

标签: #文件夹 #串口 #下载网页 #怎么看 #火麒麟 #cathy #php #怎么做