Databaseextensions,Database

转换成 1
extensionsforfunandprofit AndrewDalkeAndrewDalkeScientific,AB Gothenburg,Sweden Tuesday,December8,2009 SettheWaybackMachineto1997!
Tuesday,December8,2009 Relationaldatabases -stringsandnumbers-SQL-clientswritten inanylanguage Objectdatabases plexdatatypes-nostandardquery language-singleclientlanguage (Smalltalk) Object-relationaldatabases Tuesday,December8,2009 Object-Relationaldatabase Domain-specificdatabaseextensions SelectcidFromstructuresWheresmarts_match(smiles,"1O") CreateFunctiontanimoto(bit,bit)ReturnsRealAs'Selectnbits_set($1&$2)::real/ (nbits_set($1)+nbits_set($2)nbits_set($1&$2))::real;' LanguageSQL;SecondexamplefromTJO’Donnell Tuesday,December8,2009 Tuesday,December8,2009 UsesPostgreSQL TeachesSQLforqueries,constraintsandtriggers. ExamplesschemasforLIMSandPubChem substructurematches,fingerprints,3Dandreactions PostgreSQLextensionsusingC,pgsql,Perl,Python Clientexamples CHORDdatabase/ OracleDataCartridgeInformixDataBlade Tuesday,December8,2009 Andthewinneris......?
MySQL “Goodenough”Fast,free,well-understood,widelyused,scalable UsedbyYouTube,Flickr,Digg,Wikipediaandotherpopularsites Supportedbyeverywebapplicationframework. (On20April,OracleagreedtobuySunforUS$7.4billion.On16Jan2008,SunacquiredMySQLforUS$1billion.) Tuesday,December8,2009 ClientScenario -Internalwebapplication-Severaldatasets,eachO(10,000)pounds-Searchbasedonsimpleproperties(MW,source)and chemistry(similarity,substructure)-Otherin-housetoolsbasedonOEChem(aromaticity)-In-housesimilarityfingerprints,availableasalibrary -CorporatedatabaseusingOracle,controlledbyIS-OkaytouseMySQL;existingin-houseexperience Iasked:nootherRDBMSwaseptable-Didnotwanttopayannualsupportfees Tuesday,December8,2009 MySQL “Goodenough”Supportsuser-definedfunctionsandaggregates. Doesnotsupportuser-defineddatatypes. SelectcidFromstructuresWheresmarts_match(smiles,"1O")OrderBytanimoto(fp,similarity_fp("1O"))Limit25 Tuesday,December8,2009 valid_smiles(smiles)Return1ifitwasvalid,otherwise0 Tuesday,December8,2009 mysql_1/3 #include#includeusingnamespaceOEChem; //OEChemrequiresC++butMySQLcanonlycallCfunctions//TelltheC++pilertogenerateaCfunctioninterface extern"C"{//MySQL'ssecuritymodelrequiresthatthefunctionand//atleastoneof"*_init"or"*_deinit"mustbedefined.my_boolvalid_smiles_init(UDF_INIT*initid,UDF_ARGS*args,char*message);longlongvalid_smiles(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error);voidvalid_smiles_deinit(UDF_INIT*initid); } Differentprototypesifthefunctionreturnsastringoradouble. Tuesday,December8,2009 mysql_2/3 //Thisiscalledonce,atthestartofatablesearchmy_boolvalid_smiles_init(UDF_INIT*initid,UDF_ARGS*args, char*message){if(args->arg_count!
=1||args->arg_type[0]!
=STRING_RESULT){ strcpy(message,"valid_smilestakesasingleSMILESstring");return1;}return0;}//Thisiscalledonce,attheendofatablesearchvoidvalid_smiles_deinit(UDF_INIT*initid){//Idon'tneedtodoanything.} Tuesday,December8,2009 mysql_3/3 longlongvalid_smiles(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error){ OEGraphMolmol; //GettheSMILESstringanditslength.constchar*smiles=args->args[0];intsmiles_len=args->lengths[0]; if(smiles==NULL){//ThisrowcontainsaSQLNULL.It'snotvalid.return0; } //TheMySQLstringmightnotbeNUL-terminated.//OEParseSmilestakeseithera'char*'ora'string&'//butnota(char*,intlength).Converttoastring.std::stringsmiles_s(smiles,smiles_len); //ParsethestringandreturnitsresultreturnOEParseSmiles(mol,smiles_s);} Tuesday,December8,2009 Howpile g++-Wall-O3-I/usr/local/openeye/toolkits/include\`/usr/local/mysql/bin/mysql_config--cflags`\-
I.-c-omysql_oechem.omysql_ Linuxsharedlibrary: g++-shared-omysql_oechem.somysql_oechem.o\-L/usr/local/openeye/toolkits/lib\-loeiupac-loechem-loesystem-loeplatform-lz Macsharedlibrary: g++-bundle-bundle_loader/usr/local/mysql/bin/mysqld\-omysql_oechem.somysql_oechem.o\-L/usr/local/openeye/toolkits/lib\-loeiupac-loechem-loesystem-loeplatform-lz Theresultisthesharedlibrary“mysql_oechem.so” Tuesday,December8,2009 Howtoinstall InMySQL4:intheserver’sLD_LIBRARY_PATHInMySQL5:intheserver’splug-indirectory %mysql etotheMySQLmonitor.Commandsendwith;or\g. ... mysql>SHOWVARIABLESLIKE'plugin_dir'; +---------------+-------------------------------------------------+ |Variable_name|Value | +---------------+-------------------------------------------------+ |plugin_dir|/usr/local/mysql-5.1.30-osx10.4-i686/lib/plugin| +---------------+-------------------------------------------------+ 1rowinset(0.90sec) mysql> cpmysql_oechem.so/usr/local/mysql-5.1.30-osx10.4-i686/lib/plugin Tuesday,December8,2009 mysqldneedsOEChemsettings MakesurethatitsLD_LIBRARY_PATHincludestheOEChemsharedlibrariesandthatOE_DIRpointsto theOEChemdirectorywiththelicensefile. Tuesday,December8,2009 TellMySQLtoloadthenewfunctions mysql>Selectvalid_smiles("C");ERROR1305(42000):FUNCTIONvalid_smilesdoesnotexist mysql>CreateFunctionvalid_smilesReturnsINTSONAME"mysql_oechem.so";QueryOK,0rowsaffected(0.03sec) mysql>Selectvalid_smiles("C"); +-------------------+ |valid_smiles("C")| +-------------------+ | 1| +-------------------+ 1rowinset(0.01sec) CanbeINT,STRINGorFLOAT mysql>Selectvalid_smiles("Q"); +-------------------+ |valid_smiles("Q")| +-------------------+ | 0| +-------------------+ 1rowinset(0.00sec) Tuesday,December8,2009 Whichfunctionsareloaded?
Availablefromthespecial“mysql.func”table. mysql>Select*Frommysql.func; +--------------+-----+-----------------+----------+ |name |ret|dl |type | +--------------+-----+-----------------+----------+ |valid_smiles|2|mysql_oechem.so|function| +--------------+-----+-----------------+----------+ 1rowinset(0.00sec) Tuesday,December8,2009 CreateanewdatabaseusingtheMySQLshell mysql>CreateDatabasepubchem;mysql>UseDatabasepubchem;mysql>CreateTableCompounds( ->cidINTEGERPRIMARYKEY,->mwFLOAT,->smilesTEXT);QueryOK,0rowsaffected(0.76sec) Tuesday,December8,2009 importMySQLdbfromopeneye.oechemimport* PopulatethedatabaseusingaPythonclient db=MySQLdb.connect(user="skroob",passwd="12345",db="pubchem")cur=db.cursor() ifs=oemolistream()ifnotifs.open("/Users/dalke/databases/Compound_24200001_24225000.sdf.gz"): raiseIOError("Cannotopenfile") formolinifs.GetOEGraphMols(): cid=OEGetSDData(mol,"PUBCHEM_COMPOUND_CID")mw=OEGetSDData(mol,"PUBCHEM_EXACT_MASS") 24,pounds ifnotcidornotmw:#anymissingvalues?
continue 19stoload OEAssignAromaticFlags(mol)smiles=OECreateCanSmiString(mol) cur.execute(""" (14ssending1000records/insert) INSERTINTOCompounds(cid,mw,smiles) VALUES(%s,%s,%s)""",(cid,mw,smiles)) cur.execute("SELECTcount(*)FROMCompounds")n=cur.fetchone()[0] print"Databasetablecontains",n,pounds" Tuesday,December8,2009 SearchingforinvalidSMILES mysql>Selectcid,smilesFromCompoundsWherevalid_smiles(smiles)=0;Emptyset(1.08sec) mysql>InsertIntoCompounds(cid,mw,smiles)Values(12345,0.0,"Q");QueryOK,1rowaffected(0.00sec) mysql>Selectcid,smilesFromCompoundsWherevalid_smiles(smiles)=0; +-------+--------+ |cid|smiles| +-------+--------+ |12345|
Q | +-------+--------+ 1rowinset(0.97sec) mysql> Bestsearchtimeafter5testswas0.94seconds Tuesday,December8,2009 Tuesday,December8,2009 NeedforSpeed!
Parsing27,000/secisgreat. Iwantitfaster. EachrowtestcreatesanewOEGraphMol. WhataboutreusingthesameOEGraphMol,calling Clear()eachtime?
Storearbitrarydataininitid->ptr my_boolvalid_smiles_init(UDF_INIT*initid,UDF_ARGS*args,char*message){ if(args->arg_count!
=1||args->arg_type[0]!
=STRING_RESULT){strcpy(message,"valid_smilestakesasingleSMILESstring");return1; }//Createamoleculeoncefortheentirequeryinitid->ptr=reinterpret_cast(newOEGraphMol);return0;} voidvalid_smiles_deinit(UDF_INIT*initid){deletereinterpret_cast(initid->ptr); } Tuesday,December8,2009 ReusetheOEGraphMolfrominitid->ptr longlongvalid_smiles(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error){ OEGraphMol*mol=reinterpret_cast(initid->ptr); constchar*smiles=args->args[0];intsmiles_len=args->lengths[0];if(smiles==NULL){ return0;}std::stringsmiles_s(smiles,smiles_len); //Remember,eachSMILESaddstotheexistingmoleculemol->Clear();returnOEParseSmiles(*mol,smiles_s);} Searchtimegoesto0.87s,or29,000/second;7%faster. (Beforegoingfurther:DeleteFromStructuresWherecid=12345;) Tuesday,December8,2009 SMARTSsearches Selectcount(*)fromCompoundsWheresmarts_match(smiles,"1O"); #include#includeusingnamespaceOEChem;extern"C"{ my_boolsmarts_match_init(UDF_INIT*initid,UDF_ARGS*args,char*message); longlongsmarts_match(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error); voidsmarts_match_deinit(UDF_INIT*initid);} Tuesday,December8,2009 my_boolsmarts_match_init(UDF_INIT*initid,UDF_ARGS*args,char*message){ if(args->arg_count!
=2){strcpy(message,"smarts_matchtaketwoarguments(smiles,smarts)");return1; }if(args->arg_type[0]!
=STRING_RESULT){ strcpy(message,"smarts_match:SMILESparametermustbeastring");return1;}if(args->arg_type[1]!
=STRING_RESULT){strcpy(message,"smarts_match:SMARTSparametermustbeastring");return1;}return0;} voidsmarts_match_deinit(UDF_INIT*initid){} Tuesday,December8,2009 longlongsmarts_match(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error){ //GettheSMILESandSMARTSstringsconstchar*smiles=args->args[0];intsmiles_len=args->lengths[0];constchar*smarts=args->args[1];intsmarts_len=args->lengths[1]; //CheckforNULLsandconverttostringsif(smiles==NULL||smarts==NULL){ return0;}std::stringsmiles_s(smiles,smiles_len);std::stringsmarts_s(smarts,smarts_len); OEGraphMolmol;OESubSearchpat; //Ifthestringisn'tavalidSMILESor//thenregarditasanerror.if(!
OEParseSmiles(mol,smiles_s)|| !
pat.Init(smarts_s.c_str(),true)){*error=1;return0;} SMARTS returnpat.SingleMatch(mol)!
=0;} Tuesday,December8,2009 Doesitwork?
mysql>selectsmarts_match("C#N", +----------------------------+ |smarts_match("C#N","*#*")| +----------------------------+ | 1| +----------------------------+ 1rowinset(0.00sec) "*#*"); mysql>selectsmarts_match("C#N", +----------------------------+ |smarts_match("C#N","O#*")| +----------------------------+ | 0| +----------------------------+ 1rowinset(0.00sec) "O#*"); mysql>selectcount(*)from+----------+|count(*)|+----------+|10007|+----------+1rowinset(2.90sec) Compoundswheresmarts_match(smiles,"1O"); 8,600SMARTStests/second Tuesday,December8,2009 Ifeeltheneed—theneedforspeed!
-ReusetheOEGraphMol-TheSMARTSpatternisconstant;pileitonce?
The*_initfunctionknowsaboutallparameterswhichareconstantvalues my_boolsmarts_match_init(UDF_INIT*initid,UDF_ARGS*args,char*message){ ...verifythatthereare2inputsandbotharestrings...constchar*smarts=args->args[1];if(smarts){ ...thisisconstantforallrows...}else{ ...theSMARTSmaybedifferentforeachrow...}...return0;} Tuesday,December8,2009 structSmartsMatchState{OEGraphMolmol;intpattern_is_constant;OESubSearchpattern; }; my_boolsmarts_match_init(UDF_INIT*initid,UDF_ARGS*args,char*message){ ...verifythatthetwoinputfieldsarestrings...SmartsMatchState*state=newSmartsMatchState; constchar*smarts=args->args[1];if(smarts){ //usethesameSMARTSforeverymatchintsmarts_len=args->lengths[1];std::stringsmarts_s(smarts,smarts_len);if(!
state->pattern.Init(smarts_s.c_str(),true)){ deletestate;strcpy(message,"CouldnotparseSMARTS");return1;}state->pattern_is_constant=1;}else{state->pattern_is_constant=0;}initid->ptr=reinterpret_cast(state);return0;} voidsmarts_match_deinit(UDF_INIT*initid){deletereinterpret_cast(initid->ptr); } Tuesday,December8,2009 longlongsmarts_match(UDF_INIT*initid,UDF_ARGS*args,char*is_null,char*error){ SmartsMatchState*state=reinterpret_cast(initid->ptr); //GettheSMILESstringsconstchar*smiles=args->args[0];intsmiles_len=args->lengths[0];std::stringsmiles_s(smiles,smiles_len);state->mol.Clear();if(!
OEParseSmiles(state->mol,smiles_s)){ *error=1;return0;} if(!
state->pattern_is_constant){//CompilethenewSMARTSstringconstchar*smarts=args->args[1];intsmarts_len=args->lengths[1];std::stringsmarts_s(smarts,smarts_len);if(!
state->pattern.Init(smarts_s.c_str(),true)){*error=1;return0;} }returnstate->pattern.SingleMatch(state->mol)!
=0;} Tuesday,December8,2009 Isitfaster?
mysql>Selectcount(*)fromCompounds ->Wheresmarts_match(smiles,"1O"); +----------+ |count(*)| +----------+ |10007| +----------+1rowinset(1.11sec) (was2.90sec) 22,500SMARTStests/sec(was8,600) About2.6xfaster Tuesday,December8,2009 GoingtoPlaid Tuesday,December8,2009 Nexttrickupourcollectivesleeves... FingerprintScreens Myclientwantedsubstructuretests,notSMARTSmatching. (SMILES→canonicalSMILEStreatedasSMARTSisgoodenough) -Computeandstoresubstructurefingerprintsforpoundinthedatabase. -Compoundthequerysubstructurefingerprint-Screenouttargetswhere target_fpdoesnotcontainquery_fp-DotheSMARTStestontheremainingtargets Screeningseemstoremoveabout85%ofthetargets.6-7xfaster Tuesday,December8,2009 Whichfingerprints?
Substructurekeys MACCSSMARTSfromTJO’DonnellandfromRDKitCACTVSdefinitionsfromPubChem Hashkeys ImplementationinRDKitandOpenBabel Iimplementedasubsetofthe881CACTVSkeys. Tuesday,December8,2009 Assumea‘substructure_fp(smiles)’function Dependsonhow AlterTableCompoundsAddfpText; youimplement UpdateCompoundsSetfp=substructure_fp(smiles); fingerprints Selectcount(*)fromCompoundsWherefp_contains(fp,substructure_fp("1O"))Andsmarts_match(smiles,"1O"); substructure_fp(fixed_smarts)mon.The*_initfunctionmustcheckforitandcachetheresult. Tuesday,December8,2009 Fingerprintimplementations ☞PostgreSQLhasabuilt-inbitsetdatatype(SeeTJ’sbook) CreateFunctiontanimoto(bit,bit)ReturnsRealAs'Selectnbits_set($1&$2)::real/ (nbits_set($1)+nbits_set($2)nbits_set($1&$2))::real;' LanguageSQL; Tuesday,December8,2009 Fingerprintimplementations PostgreSQLhasabuilt-inbitsetdatatype☞Usemultipleintegerfields CreateTableCompounds(cidINTEGERPRIMARYKEY,mwFLOAT,smilesTEXT,fp1INTEGER,fp2INTEGER,fp3INTEGER,fp4INTEGER,...fp28INTEGER); SelectcidfromCompoundsWhere(qfp1&fp1=qfp1andqfp2&fp2=qfp2andqfp3&fp3=qfp3and...qfp28&fp28=qfp28); Need2832-bitintegerstostore881bits Tuesday,December8,2009 Fingerprintimplementations PostgreSQLhasabuilt-inbitsetdatatypeUsemultipleintegerfields☞Storethefingerprintsasrawbytes Textfieldsstorecharacters,notbytes!
Unicodeencodingproblems. Coulduseablob,butPythonlibrariesessthemthroughfile-likeinterface. Tuesday,December8,2009 Fingerprintimplementations PostgreSQLhasabuilt-inbitsetdatatypeUsemultipleintegerfieldsStorethefingerprintsasrawbytes☞Encodethefingerprintascharacters Iusedahexencoding:881bits→222hexchars00000000000...00020009CDE00Requirestwiceasmuchmemory Tuesday,December8,2009 substructure_fp,fp_contains,andtanimotoworkinhexspace mysql>Selectcount(*)fromCompounds->Wherefp_contains(fp,substructure_fp("1O")); +----------+|count(*)|+----------+|21136|+----------+1rowinset(0.20sec) mysql>Selectcount(*)fromCompounds ->Wherefp_contains(fp,substructure_fp("(1O")); +----------+ |count(*)| +----------+ | 3636| +----------+ 1rowinset(0.19sec) 130,000testspersecond Tuesday,December8,2009 Estimatedaverageperformance (Arethereanygoodbenchmarkdatasets?
) pounds(1.11sfullsearch)*20%+(0.19sscreening) ≅60,pounds/sec≅6minutestosearchPubChem(assumingyouhad~7GBofmemory) Tuesday,December8,2009 Availability MyclientownsthecodeIwroteforthem.Ireimplementedtheplugin,tobereleased“soon.” Itwillbeannouncedonmywritingsblogat/writings/ I’llincludeasectiononusingrelationaldatabasesatmytrainingcourse,nextweekinLeipzig:/training/ Tuesday,December8,2009 Conclusions Databasesmodities;whatyoudowithdataisnot.GetTJ’sbooktolearnhowtousedatabasesinchemistry Modernmachines+OpenEye=fastIfonlyOEParseSmilesandOESubSearch.Init took:(char*s,ints_len)...AhashfingerprintforOEMolswouldbesouseful. WhypayalotofmoneywhenDIY(orDIWithAConsultant)issomuchfun?
Tuesday,December8,2009 SQLite In-memoryrelationaldatabase.ComeswithrecentPythons. UsedinApple’sMail.appandotherapplications.Supportsuser-definedextensions. Tuesday,December8,2009

标签: #怎么看 #换行 #邮箱 #cdr #平局 #雷暴 #csgo #枪法