GitHub - biojava/biojava-spark: :collision: Algorithms that are built around BioJava and run on Apache Spark
Algorithms that are built around BioJava and are running on Apache Spark
Starting up
Some initial instructions can be found on the mmtf-spark project
https://github.com/sbl-sdsc/mmtf-spark
First download and untar a Hadoop sequence file of the PDB (~7 GB download)
wget http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar tar -xvf full.tar
Or you can get a C-alpha, phosphate, ligand only version (~800 Mb download)
wget http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar tar -xvf reduced.tar
Second add the biojava-spark dependecy to your pom
<dependency> <groupId>org.biojava</groupId> <artifactId>biojava-spark</artifactId> <version>0.2.1</version> </dependency>
Extra Biojava examples
Do some simple quality filtering
float maxResolution = 3.0f; float maxRfree = 0.3f; StructureDataRDD structureData = new StructureDataRDD("/path/to/file") .filterResolution(maxResolution) .filterRfree(maxRfree);
Summarsing the elements in the PDB
Map<String, Long> elementCountMap = BiojavaSparkUtils.findAtoms(structureData).countByElement();
Finding inter-atomic contacts from the PDB
Double mean = BiojavaSparkUtils.findContacts(structureData, new AtomSelectObject() .groupNameList(new String[] {"PRO","LYS"}) .elementNameList(new String[] {"C"}) .atomNameList(new String[] {"CA"}), cutoff) .getDistanceDistOfAtomInts("CA", "CA") .mean(); System.out.println("\nMean PRO-LYS CA-CA distance: " + mean);