Bingo Elastic¶
Overview¶
Bingo Elastic is a set of APIs available in Java and Python for efficient chemistry search in Elasticsearch-based engines. It provides methods for storing chemical information and searching through it. With this API you can create indices, read SDF and mol files, index them and later use different types of searches such as substructure, exact, similarity.
Matchers¶
This SDK is intended to:
Read molecules from SDF, Smiles, Mol, CML files, etc
Index them into Elasticsearch
Have ability to search molecules efficiently with different similarity metrics (Tanimoto, Tversky, Euclid)
Filter additionally based on text or number fields attached to the records
Scalability¶
Bingo Elastic relies on scalability options of Elastic-based engines such as including additional nodes to the cluster and increasing number of shards and replicas to cope with additional load
Python API¶
Installation¶
Dependency¶
Install dependency using pip
pip install bingo-elastic
Install async version
pip install bingo-elastic[async]
bingo-elastic async version supports all the same methods to index and
search molecules as sync. To use async version, just instantiate
AsyncElasticRepository
Usage¶
Create ElasticRepository¶
Sync
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
Async
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
...
repository.close()
Async version also supports async context manager to auto close connections:
async with AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200) as rep:
...
Using with FastAPI:
app = FastAPI()
rep = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
# This gets called once the app is shutting down.
@app.on_event("shutdown")
async def app_shutdown():
await rep.close()
Other customisations like SSL, custom number of shards/replicas, refresh
interval, and many more are supported by ElasticRepository
and
AsyncElasticRepository
Read Indigo records from file¶
IndigoRecord can be created from IndigoObject.
Full usage example:
from bingo_elastic.model.record import IndigoRecord
from indigo import Indigo
indigo = Indigo()
compound = indigo.loadMoleculeFromFile("composition.mol")
indigo_record = IndigoRecord(indigo_object=compound)
bingo_elastic
provides helpers to load sdf, cml, smiles and smi
files
from bingo_elastic.model import helpers
sdf = helpers.iterate_sdf("compounds.sdf")
cml = helpers.iterate_cml("compounds.cml")
smi = helpers.iterate_smiles("compounds.smi")
Also function helpers.iterate_file(file: Path)
is available. This
function selects correct iterate function by file extension. The
file
argument must be pathlib.Path
instance
from bingo_elastic.model import helpers
from pathlib import Path
sdf = helpers.iterate_file(Path("compounds.sdf"))
Index records into Elasticsearch¶
Full usage example sync:
from bingo_elastic.model import helpers
from bingo_elastic.elastic import, ElasticRepository IndexName
from pathlib import Path
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
repository.index_records(sdf)
Full usage example async:
Async indexing and search requires event loop created
import asyncio
from bingo_elastic.model import helpers
from bingo_elastic.elastic import AsyncElasticRepository, IndexName
from pathlib import Path
async def index_compounds():
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
await repository.index_records(sdf)
asyncio.run(index_compounds)
CAVEAT: Elasticsearch doesn’t have strict notion of commit, so records might appear in the index later on Read more about it here - https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-refresh-interval-setting
For indexing one record the the method
ElasticRepository.index_record
can be used
Retrieve similar records from Elasticsearch¶
Sync:
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = repository.filter(similarity=alg, limit=20)
Async:
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = await repository.filter(similarity=alg, limit=20)
In this case we requested top-20 most similar molecules compared to
target
based on Tanimoto similarity metric
Supported similarity algorithms:
SimilarityMatch
orTanimotoSimilarityMatch
EuclidSimilarityMatch
TverskySimilarityMatch
Find exact records from Elasticsearch¶
Sync:
exact_records = repository.filter(exact=target, limit=20)
Async:
exact_records = await repository.filter(exact=target, limit=20)
In this case we requested top-20 candidate molecules with exact same
fingerprint to target
. target
should be an instance of
IndigoRecord
class.
Substructure match of the records from Elasticsearch¶
Sync:
submatch_records = repository.filter(substructure=target)
Async:
submatch_records = await repository.filter(substructure=target)
In this case we requested top-10 candidate molecules with exact same
fingerprint to target
.
Custom fields for molecule records¶
Async protocol exact same, just don’t forget to ``await``
Indexing records with custom fields
indigo_record = IndigoRecord(indigo_object=compound)
indigo_record.chembl_id = "CHEMBL2063090"
indigo_record.compound_key = "GRAZOPREVIR"
indigo_record.internal_id = 10001
Searching similar molecules to the target and filtering only those that
have value of the chembl_id
equals to CHEMBL2063090
from bingo_elastic.queries import KeywordQuery
alg = TanimotoSimilarityMatch(target)
result = elastic_repository.filter(similarity=alg,
chembl_id=KeywordQuery("CHEMBL2063090"))
Or you can just write:
result = elastic_repository.filter(similarity=alg,
chembl_id=RangeQuery(1, 10000))
You could also use similarly wildcard and range queries
from bingo_elastic.queries import WildcardQuery
result = elastic_repository.filter(chembl_id=WildcardQuery("CHEMBL2063*"))
from bingo_elastic.queries import RangeQuery
result = elastic_repository.filter(internal_id=RangeQuery(1000, 100000))
Java API¶
Installation¶
Dependency¶
Add dependency to your Maven POM file like this:
<dependency>
<groupId>com.epam.indigo</groupId>
<artifactId>bingo-elastic</artifactId>
<version>VERSION</version>
</dependency>
Gradle:
compile group: 'com.epam.indigo', name: 'bingo-elastic', version: 'VERSION'
it will work the same for other major dependency managers
Usage¶
Create ElasticRepository¶
ElasticRepositoryBuilder<IndigoRecord> builder = new ElasticRepositoryBuilder<>();
repository = builder
.withHostName("localhost")
.withPort(9200)
.withScheme("http")
.build();
Other customisations like SSL, custom number of shards/replicas, refresh interval, and many more are supported
Read Indigo records from file¶
List<IndigoRecord> records = Helpers.loadFromCmlFile("/tmp/file.cml");
Index records into Elasticsearch¶
repository.indexRecords(records);
CAVEAT: Elasticsearch doesn’t have strict notion of commit, so records might appear in the index later on
Read more about it here - https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-refresh-interval-setting
Retrieve similar records from Elasticsearch¶
List<IndigoRecord> similarRecords = repository.stream()
.filter(new SimilarityMatch<>(target))
.limit(20)
.collect(Collectors.toList());
In this case we requested top-20 most similar molecules compared to
target
based on Tanimoto similarity metric
Find exact records from Elasticsearch¶
List<IndigoRecord> exactRecords = repository.stream()
.filter(new ExactMatch<>(target))
.limit(20)
.collect(Collectors.toList())
.stream()
.filter(ExactMatch.exactMatchAfterChecker(target, indigo))
.collect(Collectors.toList());
In this case we requested top-20 candidate molecules with exact same
fingerprint to target
. After that we used
ExactMatch.exactMatchAfterChecker
, which double checked exact match
based on actual molecule
Substructure match of the records from Elasticsearch¶
List<IndigoRecord> substructureMatchRecords = repository.stream()
.filter(new SubstructureMatch<>(target))
.limit(20)
.collect(Collectors.toList())
.stream()
.filter(SubstructureMatch.substructureMatchAfterChecker(target, indigo))
.collect(Collectors.toList());
In this case we requested top-20 candidate molecules with exact same
fingerprint to target
. After that we used
SubstructureMatch.substructureMatchAfterChecker
, which double
checked substructure match based on actual molecule and it’s graph
representation
Custom fields for molecule records¶
Indexing records with custom text tag
List<IndigoRecord> indigoRecordList = Helpers.loadFromSdf("src/test/resources/rand_queries_small.sdf");
IndigoRecord indigoRecord = indigoRecordList.get(0);
indigoRecord.addCustomObject("tag", "test");
repository.indexRecord(indigoRecord);
Searching similar molecules to the target and filtering only those that
have value of the tag
equals to test
List<IndigoRecord> similarRecords = repository.stream()
.filter(new TanimotoSimilarityMatch<>(target))
.filter(new KeywordQuery<>("tag", "test"))
.collect(Collectors.toList());
you could also use similarly wildcard and range queries