forHeterogeneousPlatforms,Compiler-Driven

DataLayoutTransformationforHeterogeneousPlatforms DeepakMajeti1,RajkishoreBarik2,JishengZhao1,MaxGrossman1,andVivekSarkar1 1RiceUniversity2IntelLabs Abstract.ModernheterogeneouspriseofCPUcores,GPUcores,andinsomecases,eleratorcores.Eachofputationalcoreshaveverydiﬀerentmemoryhierarchies,makingitchallengingtoeﬃcientlymapthedatastructuresofanapplicationtothesememoryhierarchiesautomatically.Inthispaper,wepresentpiler-drivendatalayouttransformationframeworkforheterogeneousplatforms.Weintegrateourdatalayoutframeworkwiththedataparallelconstruct,forasync,ofHabanero-Candenablethesamesourcecodetopiledwithdiﬀerentdatalayoutsforvariousarchitectures.Theprogrammeroranauto-tunerspeciﬁesaschemaofthedatalayout.pilerinfrastructuregenerateseﬃcientcodefordiﬀerentarchitecturesbasedonthemetainformationprovidedintheschema.Ourexperimentalresultsshowsigniﬁcantbeneﬁtsfrompiler-drivendatalayouttransformation,anddemonstratethatthebestdatalayoutforaprogramvarieswithdiﬀerentheterogenousplatforms. 1Introduction Recenthardwaretrendshaveseentheadoptionofheterogeneoussystemsconsistingofstandardprocessorcores,graphicsprocessingcores,andeleratorcores.WhilethememoryhierarchyofstandardCPUcoresconsistofL1,L2andL3caches,recentdiscreteGPUcoreshavealsobeenembeddedwiththeirownL1,andL2caches.IntegratedGPUcores,ontheotherhand,sharethesamephysicalmemorywiththeCPUwhileusingaprivateL3cachefortheGPUcores.Withsuchdiﬀeringmemoryhierarchieswithinthesamesystem,determiningthebestdatalayoutcanbechallengingsincetheoptimallayoutforputationalkerneldependsonwhetherthekernelexecutesonaCPUcore,adiscreteGPU,oronanintegratedGPU(alongwithotherfactors).Additionally,theGPUmemoryperformanceisimpactedbythenumberofcoalescedmemoryessesandcontrol-ﬂowdivergence,whereasCPUmemoryperformanceisimpactedbyfactorssuchasfalsesharing,prefetchinganddatareuse.ThisimpliesthatthechangesindatalayoutcanhavemajorimpactonCPUvs.GPUcoreperformance.Ingeneral,theprogrammerhastowritediﬀerentversionsofCPUandGPUkernelsfordiﬀerentarchitecturesandhastoselectoptimalmemorylayoutsforeachdatastructure.Thisplacesasevereconstraintoncodeportability.Atthesametime,performingtheselayouttransformationsautomaticallyforawide-rangeofapplicationsincludingirregularapplicationsisadauntingtask. D.anMeyetal.(Eds.):Euro-Par2013Workshops,LNCS8374,pp.188–197,2014.cSpringer-VerlagBerlinHeidelberg2014 Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms189 CUDAandOpenCLarethetwoprimarylanguagestargetingtheheterogeneoussystemsforGPGPUprogramming.OpenCLcanalsobeusedfortargetingCPUcores.Manyhigh-levelprogrammingmodelshavealsoevolvedinthelastfewyearsthatdealwithheterogeneity[14,21,13,22,6,12].Choosingalanguageinvolvestrade-oﬀsbetweenprogrammerproductivityandperformance.However,noneoftheselanguagesprovidemechanismstospecifythedatalayout.Somerecentwork[20,10,19]providealibrarybasedmechanism,butarelimitedinscope.Forexample,Kunkelet.al.[24]emphasizetheneedforadatalayoutabstraction.Recently,Wuetal[23]haveprovedthatﬁndingtheoptimaldatalayouttomaximizethenumberofcoalescedessesonaGPUisplete.Thus,manuallywritinghighperformanceportableprogramsorautomaticallygeneratingeﬃcientcodeviapilerswithoutanydomainknowledgeischallenginggiventheproliferationofdevicetechnologiesonheterogeneousarchitecturesandtheirdiﬀeringmemoryhierarchies.Webelievethatpiler-drivendatalayouttransformationframeworkcanhelpbridgethisgap. Inthispaper,wepresentpilerdrivenmeta-dataframeworkthatallowsbothprogrammersandtuningexpertstospecifyarchitecturespeciﬁcanddomainspeciﬁcinformationforparallel-forloopsofaprogram.Ameta-dataﬁleiscreatedforanapplicationandispopulatedwithentriesonthedatalayouttobeusedforadeviceontheheterogeneoussystem.Thedatalayoutwefocusoninthispaperincludestructure-of-array(SOA),array-of-structure(AOS)andintermediatestructure-of-array-of-structures(SOAOS).Anyhighlevellanguagewhichhasparallel-forloopscanbeextendedtomodatethemetadataframework.Inourwork,wetargetthedata-parallelforasyncconstructinHabanero-C[2]andintegrateourmeta-dataframeworkwiththepilerandruntime.pilestheforasyncconstruct,togenerateOpenCLdeviceandhostcodeforthetargetheterogenousarchitecture.Themeta-datainformationisveryusefulinguidingpileroptimizationpassesforthegenerationofeﬃcientcodeforadevice. Ourpapermakesthefollowingcontributions: –Ameta-dataframeworkthatallowsboththeprogrammerandthetuningexperttospecifytheunderlyingarchitectureanddomainspeciﬁcknowledgesforparallel-forloops; –pilerandruntimeframeworktoautomaticallygenerateeﬃcientcodebasedonthemeta-datainformation.WecurrentlyfocusonAOS,SOAandSOAOSdatalayoutsinpiler; –Anexperimentalevaluationofoursystemusingawidevarietyofheterogeneousarchitectureswhichshowstheimpactofdatalayouton5distinctapplications.Onanaverage,thedatalayouttransformationaloneimpactedtheperformanceby7.33×(upto27.11×)onAMD4-coreA10-5880KCPU,2.84×(upto5.57×)onAMDRadeonintegratedGPU,8.32×(upto29.5×)onNVIDIATeslaM2050GPU,2.19×(upto5.32×)onIntel12-coreX5660CPUand1.9×(upto3.89×)onIntelintegratedi7-3770GPU. Therestofthispaperanizedasfollows.Section2presentsourmeta-dataframework.Section3discussesthedetailsofpilercodegenerationandruntime.Section4presentstheexperimentalresultsonawidevarietyofprocessors.RelatedworkisdiscussedinSection5,andﬁnally,Section6concludes. 190D.Majetietal. 2ProgrammingModel Ourmeta-dataframeworkisbuiltofHabanero-C(HC)pilerandruntimeinfrastructure[9].ThedetailsoftheparallelconstructssupportedbyHCcanbefoundat[2].Ourpaperfocusesonthedataparallelforasyncconstruct1.Thesyntaxoftheforasyncconstructisasfollows. forasyncindex(args)size(args)optional{//forasyncbody } Thesemanticsoftheforasyncconstructissimilartoaprogramloopwhichexhibitsparallelforparallelism.Theindexclauseisusedtospecifytheloopiterators.Thenumberofvariablesintheindexclausegivesthedimentionalityoftheloop.Thesizeclausespeciﬁesthenumberofiterationsoftheloopineachdimension.Thereare2optionalclauses,scratchpadandseqclause.TheHClanguagemodeltakesadvantageofthediﬀerentmemoryregionsavailableonmostGPUhardwareswiththehelpofthescratchpadandseq. Foreachhostorthedeviceonaheterogeneoussystem,itispossibletospecifythedesireddatalayoutforarray-basedorstructure-baseddatastructuresofagivenforasyncloop.Thedatalayoutsthatwefocusonare:
(1)AOS:arrayof-structure;
(2)SOA:structure-of-arrayand
(3)SOAOS:structure-of-array-ofstructures.piler(describedinSection3)withthehelpofthemeta-dataﬁleisabletotransformHCcodetooneoftheSOA,AOSandSOAOSlayouts. Thegrammarforthemeta-dataandanexampleisshowninFigure1.The archname−>Archnamemetadatametadata−>(structdef)∗(scratchpaddef)∗structdef−>Structname(ﬁelddef)∗scratchpaddef−>Scratchpadname (ﬁelddeftilesizelinenum)∗ﬁelddef−>Fieldtypenamelengthtype=fp|dp|iplength−>(digit)∗tilesize−>(digit)∗linenum−>(digit)∗name=(letter)(letter|digit)∗letter−>|A|B|C|...|Z|a|b|c|...|z|digit−>1|2|3|4|5|6|7|8|9|
0 ArchIntelGPUStructbodyposFieldfpposxFieldfpposyFieldfpposzStructFieldfpxFieldfpyFieldfpzScratchpadlocalFieldfpposx25664ScratchpadlocalFieldfpposy25664ScratchpadlocalFieldfpposz25664 ArchAMDGPUStructbodyposFieldfpposxFieldfpposyFieldfpposzFieldfpxFieldfpyFieldfpzScratchpadlocalFieldfpx102464 Fig.1.Meta-dataGrammar(left)andmeta-dataﬁleExample(right) meta-dataﬁleconsistsofasetarchitecturespeciﬁcoptimizationinformation.Thearchitecturaldetailsconsistofthedatalayoutinformationandscratchpadmemoryallocationinformationforagivenprogram.EachstructdeﬁnitionhasalabelStruct,anameforthestructandasetofﬁelds.EachﬁeldinturnhasalabelField,thetypeoftheﬁeldandthenameoftheﬁeld.Thetypeofﬁeldscanbefp:apointertoanarrayofﬂoatvalues,dp:apointertoanarrayofdoublevaluesorip:apointertoanarrayofintegervalues.Thescratchpadmemoryallocationinformationconsistsofasetofbuﬀerdescriptions.ItbeginswithalabelScratchpad,thenameofthespecialmemoryregion,theﬁeld,theamountofdatawhichmustbecachedandthelinenumberoftheforasync. 1Ourframeworkisalsoapplicabletootherdata-parallelprogramminglanguageswithaparallel-forlikeconstruct. Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms191 Restrictionsofourmeta-dataframeworkTheusercannotaliastheﬁeldsspeciﬁedinthemeta-dataﬁle.Weplantoberesolvethisissuewiththehelpofanaliasanalysis.Anotherlimitationintheprogrammingmodelisthatavariablenamecannotberepeatedinthewholeprogramindiﬀerentscopes.Thislimitationcanbeavoidedbyaclevervariablerenamingmechanism.Also,allﬁeldsinastructmustbeofthesametype.WecurrentlydonotsupportplexdatalayoutssuchasAOSOA(Array-ofstructure-of-arrays).Weleavethisforfuturework. 3Implementation Ouroverallmeta-dataframeworkisshowninFigure2.TheapplicationuserwritesaprograminHabanero-C(HC)usingtheforasyncconstruct.Followedbywhich,eitherthedeveloperorthetuningexpertspeciﬁesthemeta-datainformationfortheapplication.Weextendthepilerinfrastructuretoperformdatalayouttransformationbasedonthemetainformation;.pilerpassisimplementedintheROSEpilerframework[17].ThepilergeneratesOpenCLcodefromtheprogramwiththespeciﬁeddatalayoutandthecorrespondinghostcode. HabaneroCSource(.hcﬁles) Metaﬁle HCCompilerpassesLayout+HC- OpenCL(ROSE) CProgram(.cﬁles)+CopenCL(.clﬁles) HostProgram(.cﬁles) Runtime:OpenCLSDK+MemoryManager CCompiler(GCC) ExcutableBinaryﬁles GPU CPU Fig.2.CompilationFlow 3.1DataLayoutTransformation pilerpassﬁrstparsesthespeciﬁedmeta-dataﬁleanditcreatesametadatamapforeacharchitecture.Themappingisbetweentheﬁeldsandthestructnametheybelongto.Themappingisdoneforeachsuchstructmetainformation.Ifitﬁndsanyscratchpadmetainformation,itisrecorded. Thedatalayouttransformation(DLT)pilerpassgeneratesthecodebasedonthespeciﬁeddatalayoutinthemeta-dataﬁle.Itgeneratescodewhichincludesnewstructdeﬁnitionsandthecodethatoperatesonit.Figure3showsthealgorithmfortransformingtheprogramwithagivendatalayout.DLTtakestheinputprogramandameta-dataﬁle.createStructDefinitions(M)addsthestructdeﬁnitionsasspeciﬁedinthemeta-dataﬁletotheAST.Thesestructsaredeﬁnedonlyonceintheglobalscope.TheDLTpasstheniteratesoverallthefunctionsandperformsthestepsdescribedinlines4-
7. tryAddStructInstances(f)analyzesthefunctionparameters.Ifanyoftheparameternamesappearinthemetaﬁle,aninstanceofthecorrespondingstructisdeclaredinthefunctioncall.Ifweabstractthestructasagroupofﬁeldsnames,thenonestructinstanceisdeclaredpergroup.Innextstep,updateInst(I) 192D.Majetietal. 1functionDLT() Input:MetaﬁleMandinputprogramP Output:TransformedprogramP’ 2createStructDeﬁnitions(M); 3foreachfunctionFinPdo
4 foreachformalfinfunctionparameterlistdo
5 tryAddStructInstances(f);
6 foreachinstructionIinfunctionbodydo
7 updateInst(I); Fig.3.DataLayoutTransformation checksallpointerorarrayreferencesinthefunctionbody.Ifanyofthosereferenceareviaanyoftheﬁeldsinthemetaﬁle,thentheessisreplacedwiththecorrespondingstructinstance. Animportantfactorhereisthatthetypeofthefunctionintheoriginalprogramremainsthesame.Keepingthefunctiontypesintactwillavoidrewritingthedirectandindirectcallstothefunctionfromsequentialcode. 3.2MemoryManagement IntheHCprogrammingmodel,theprogrammerallocatesheapmemorytotheﬁeldsviastandardmallocandcalloccalls.Wereplacethesecallswithourspecializedmemoryallocators.Wenametheallocators,hcmetamallocorhcmetacalloc.ThesyntaxoftheallocatorsisshowninFigure4. void∗hcmetamalloc(char∗ﬂdname,sizetnumbytes);void∗hcmetacalloc(char∗ﬂdname,sizetnumelem,sizetsizeelem); Fig.4.MemoryAllocators hcmetamallocorhcmetacallocarewrappersaroundthestandardmallocandcalloccalls.Theallocatorsalsopassinthenameoftheﬁeldtothememoryallocator.Theﬁeldnameisrequiredbythememorymanagerandisexplainedasfollows. Thememorymanagerhandlesthediﬀerentlayoutsandalsocreatesdevicebuﬀers.Thememorymanagerhastwoponents,thememoryallocatorandthelayouthandler.Duringtheprograminitializationphase,thelayouthandlerreadsthemetaﬁleandcreatesamapofthedatalayout.Thememorymanagerwiththehelpoftheﬁeldname,looksintothelayoutmapandallocatesthememory. Figure5showsanexamplecodegenerationofasinglekernelwithmetadatainformationforanIntelarchitecture. 4Evaluation Thegoaloftheexperimentalevaluationistostudytheperformanceofdiﬀerentdatastructurelayoutsforvariousprogramsonmultiplearchitectures. Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms193 //forasyncprogramintmain(){ forasyncpoint(i,j)size(
M,N)seq(tilesize,tilesize){ a[i*M+j]=b[i*M+j]+c[i*M+j];} } //metadataArchIntel_CPU StructBC FieldfpbFieldfpc //CompilergeneratedcodestructBC{floatb,floatc};voidoffload(float*a,structBC*bc, char*kernel_name,char*ocl_kernel){ //OpenCLHostCode.......... } intmain(){structBC*bc= offload(a,bc,”kernel_1”,Kernel_string);} //OpenCLKernelCodeKernel_string=“ structBC{floatb,floatc}; voidkernel_1(__globalfloat*a,__globalstructBC*bc,intM,intN){ i=get_global_id
(1);j=get_global_id
(0);a[i*M+j]=bc[i*M+j].b+bc[i*M+j].c; }“; Fig.5.Exampleofcodegenerationusingameta-dataﬁle Table1.Benchmarks Name NBodyMedicalSRADSeismicMRIQ Description N-BodySimulationMedicalImageRegistrationSpeckleReducingAnisotropicDiﬀusionSeismicWaveSimulationMatrixQComputationfor3DicResonanceImageReconstructioninNon-CartesianSpace. OriginalLayout SOASOASOASOASOA NumofFields 76466 Input 32Knodes256×256×2565020×45804096×409664×64×64 4.1ExperimentalSetup Table1describesthebenchmarksusedinthisevaluation.TheN-Bodyparticlesimulationbenchmarkwaswrittenfromscratchforthis work.Wefocusonputeintensivekernelwhichcalculatestheforcesbetweenthebodies. TheMedicalImagingbenchmarkincludeskernelsfromamedicalimagingpipelineusedtoanalyzediﬀerenttypesofmedicalimagesfordefectsorabnormalities[15].Thisapplicationconsistsofthreemainphases:denoising,registration,andsegmentation.Forourevaluation,wefocusontheputationallysigniﬁcantkernelofthethree,registration. TheSRADbenchmarkfromtheRodiniabenchmarksuite[11]isalsoused.SRADisusedto”removelocallycorrelatednoise”in”ultrasonicandradarimagingapplicationsbasedonpartialdiﬀerentialequations”[18]. TheSeismicbenchmarksuitewascreatedbasedontheexampleincludedintheIntelTBBbenchmarksuite[4].Seismicsimulatesthepropagationofwavesduringseismicactivity. TheMRIQbenchmarkfromtheParboilbenchmarksuite[7]putesaQmatrix.TheQmatrixrepresentsthescannerconﬁgurationusedina3Dicresonanceimagereconstructionalgorithminnon-Cartesianspace.TheMRIQcodehasbeenconvertedtoSOAlayoutbyhand. Table2showsthediﬀerentmeta-dataﬁlesusedforeachbenchmark.SincethedefaultlayoutisSOA,thereisnoneedofametaﬁle.AllOpenCLkernels,glue 194D.Majetietal. Table2.Applicationmeta-dataﬁles ApplicationNBodySeismicSRADMedicalMRIQ AOS StructbodyFieldfpposxFieldfpposyFieldfpposzFieldfpxFieldfpyFieldfpz StructparamsFieldfpSFieldfpTFieldfpVFieldfpDFieldfpLFieldfpM StructdirectionFieldfpNFieldfpSFieldfpEFieldfpW StructdispFieldfpU1FieldfpU2FieldfpU3StructvelocityFieldfpV1FieldfpV2FieldfpV3 StructbodyFieldfpkxFieldfpkyFieldfpkzFieldfpphiMag SOAOS StructposFieldfpposxFieldfpposyFieldfpposzStructFieldfpxFieldfpyFieldfpz
N.A Structdirection1FieldfpNFieldfpSStructdirection2FieldfpEFieldfpW StructdispFieldfpU1FieldfpU2FieldfpU3
N.A Table3.Hardwarearchitectures VendorType Model Freq CoresLocalMem L1$L2$ IntelCPU X5660 2.8GHz 12(HT)
N.A 192KB1.5MB IntelIntegratedGPUi7-3770U 350MHz-1.15GHz14 64KB(perhalf-slice)
N.AN.A NVIDIADiscreteGPUTeslaM2050 575MHz
8 8x48KB 16KB768KB AMDCPU A10-5800K 1.4GHz 4(HT)
N.A. 16KB32MB AMDIntegratedGPURadeonHD7660800MHz
6 6x32KB
N.A4MB code,anddiﬀerentlayoutsforeachoftheseapplicationsweregeneratedfromaHCarray-basedimplementation. Table3liststhehardwarearchitecturesusedinourevaluation.WeuseavarietyofCPUandGPUsystemswithdiﬀeringmemoryhierarchiesinordertodemonstratethebeneﬁtofourdatalayouttransformation.pilerusedforthesequentialversionsofeachapplicationGCC4.4.6(withtheﬂags-g-O2).AllOpenCLkernelspiledwiththeirdefaultoptimizationsenabled.IntelGPUtestswererunusingthe2013ReleaseoftheIntelOpenCLSDK[3].IntelCPUtestswereperformedusing2011ReleaseofIntelOpenCLSDK,v1.5[3].NVIDIAGPUtestswereperformedusingNVIDIASDKv5.0[5].AMDGPUandGPUtestswereperformedusingAMDAPPSDKv2.8[1]. 4.2CPUandGPUPerformance Figure6containsresultsforallthebenchmarks.parerelativeexecutiontimeforthevariousdatalayoutsondiﬀerentCPUandGPUplatforms.Foragivenarchitecture,wenormalizeeverylayoutwithrespecttothefastestexecutinglayout.Inthiscase,smallerbarsimplybetterperformance.Everycolumnisstackedin2levels.Thebottomlevelrepresentsthefractionoftotalexecutiontimespentinthekernel.ThisinformationisretrievedfromtheOpenCLAPI.stackrepresentsthefractionoftotalexecutiontimefortheremainingexecution.ThismunicationandOpenCLinitializationoverheads.stackisnegligibleforIntelGPU.ThisisbecausetheGPUisintegratedonethesamedieastheCPUandthereisnodatacopyingoverhead.NVIDIAGPUandAMDarchitecturesshowcopyingoverheads.Foralltheworkloads,theAMDCPU/GPUexhibitalargeamountofoverhead.Onfurtherinspection,wediscoveredthatthemajorityoftheoverheadwasduetosigniﬁcanttimediﬀerencebetweenOpenCLkernelenqueueandkernelexecution.ThiscouldbeanimplementationerrorinAMD’sOpenCLimplementation.Toanalyzetheperformancediﬀerences,wecouldnotﬁndanytoolswhichproﬁleOpenCLcode.Wemakethefollowinganalysisbasedoncodeandmachinecharacteristics. FortheN-Bodybenchmark,weseethattheSOAandAOSversionsperformsimilarlyontheCPU.Sincethenumberofﬁeldsareless,alltheloadsinan Compiler-DrivenDataLayoutTransformationforHeterogeneousPlatforms195 iterationﬁtintothecacheandconsecutiveiterationsdonotincuranypenalty.ThearraylayoutperformsbetteronGPUsbecauseSOAlayouthelpsinmemorycoalescing. FortheSeismickernel,theSOAlayoutshowsbetterperformanceonAMDCPU,whereastheAOSlayoutisbetteronIntelCPU.ThiscanbeattributedtothediﬀerenceincacheassociativityandsizesbetweenAMDandIntel.OntheGPUside,SOAperformswellonall3GPUhardwaresduetocoalescing. TheSRADkernelshowsimprovedperformancefortheSOAOSlayoutrelativetotheSOAandAOSlayoutformostofthearchitectures.SurprisinglyevenontheGPUtheAOSandSOAOSlayoutsperformsbetterthantheSOAlayout.ThisiscontrarytoGPUbestpractices.ThememoryessfunctionsintheSRADkernelarenon-aﬃneandirregular.Itisdiﬃcultforpilerorprogrammertoanalyzeanddeterminetherightlayout.Ourframeworkenablesrapidprototypingandtestingofdiﬀerentlayoutsforperformanceonmultiplearchitectures. MRIQexhibitslittleornovariationacrosslayouts.MRIQisputeboundkernelandthedatastructurelayoutwillhavelittleornoeﬀect. Themedicalimagebenchmarkshowssomeinterestingpropertiesfordiﬀerentlayouts.TheAOSlayoutisbetterontheCPUwhereastheSOAlayoutisbetterontheGPU.Medicalimagekernelissimilartoa3DJacobi(stencil)putation.Theputationisperformedseparatelyonthreeinputbuﬀersandtheresultsarewrittenintocorrespondingoutputbuﬀers.KeepingtheinputbuﬀersinasinglestructishelpfulfortheCPU.Thisisbecausewhenyouloadapointforoneofthestencil,youautomaticallyloadthepointsfortheother2stencils(multiplepointsﬁtinacacheline).Thearraylayoutwouldhavecaused3loadsforthesamepoint,oneineachofthethreestencils.OntheGPUside,thearraylayoutisbetterasexpected. BestpracticesgenerallydictatetheuseofarraydatalayoutsonGPUsduetoimprovedcoalescenceofglobalmemoryesses.However,ourSRADandMRIQresultscontradictthisknowledge.Ourmetadataframeworkenablesrapidprototypingandoptimizationofdiﬀerentdatalayouts,allowingtuningexpertstorapidlydiscoveroptimallayoutsplexandirregularapplications.FortheCPUthelayoutoftendependsuponthekernelfeaturesandmemoryesspatterns.Ourprogrammingmodelcaneasilyportsuchapplicationstodiﬀerentarchitectures. 5RelatedWork Recently,datalayoutshavebeenstudiedinthecontextofGPUs.DL[20]usesamappingfunctionandruntimelibrarysupporttoenablearchitecturespeciﬁcdatalayouts.DLdoesin-placedatamarshalingontheGPU.LikeDL,Dymaxion[10]proposesasetofindexmappingfunctionswhichareusedtooptimizememorymappings,withdatamarshalingdoneontheCPUside.Sunget.al.[19]usedtechniquessimilartoDLtoperformdatalayouttransformationsforstructuredgridapplications.pilerautomaticallychangestheorderofn-dimensionalarrayreferencestomaximizememoryesscoalescing.Withthehelpofmicro-benchmarks,lowlatencystridesandanoptimalindexmaparediscovered.Thistechniquerequiresmanualhostcodechanges.Themaindisadvantageofthetechniqueslistedinthisparagraphisthattheoverheadof 196D.Majetietal.

本文地址：https://www.apjn.cn/w/891/7907.html

声明：该资讯来自于互联网网友发布，如有侵犯您的权益请联系我们。

标签： #文件夹 #串口 #下载网页 #怎么看 #火麒麟 #cathy #php #怎么做