SUPPORT VECTOR MACHINE FOR CLASSIFICATION OF HIV, PLANT AND ANIMAL miRNA’S

-MicroRNAs (miRNA’s) constitute a large family of non coding RNAs that function to regulate gene expression. Wet lab experiments usually used to classify the miRNA of plants and animals are highly expensive, labor intensive and time consuming. Thus there arises a need for computational approach for classification of plant and animal miRNA. These computational approaches are fast and economical as compared to wet lab techniques. Here a machine learning approach is used to classify miRNA of HIV, plants and animals. The new SVM learning algorithm called Weka LibSVM has been used for classification of plant and animal and HIVmiRNA. The model has been tested on available data and it gives results with 95% accuracy.


Introduction
MicroRNAs (miRNA's) are small RNAs of 21-25 nucleotides that specifically regulate cellular gene expression at the post-transcriptional level. miRNA's are derived from the maturation by cellular RNases III of imperfect stem loop structures of ~ 70 nucleotides. Evidence for hundreds of miRNA's and their corresponding targets has been reported in the literature for plants, insects, invertebrate animals, and mammals. The analysis of miRNA-encoding potential to the human immunodeficiency virus (HIV) is being studied [1]. Using computer-directed analyses, it was found that HIV putatively encodes five candidate pre-miRNA's [ Fig 1]. Folded pre-miRNA and their corresponding predicted mature viral miRNA (red) are listed. Nucleotide positions (where 1 is the initiation of transcription) in the pNL4-3 genome are presented in the right column.It was then matched deduced mature miRNA sequences from these 5 pre-miRNA against a database of 3' untranslated sequences (UTR) from the human genome. These searches revealed a large number of cellular transcripts that could potentially be targeted by these viral miRNA (vmiRNA) sequences [1]. Correct identification of miRNA that regulate cellular processes and impact economically important traits is the need of the industry. This requires better understanding of characteristics of miRNA's which can be done by understanding the differences between miRNA's of different organisms [2], [8]. They have applications in forensic science where miRNA belonging to organism can be identified and as the classification is extended further incorporating all organisms in the mirBASE registry more specific analysis can be done. The miRNA classified can be shown to have relationship with the sequence, structure and function of the genes lying nearby. The upstream and downstream genomic region can be identified with miRNA classification and signature. As of now various attempts have been made to discover novel miRNA's in various species of plants, animals and viruses by using both in-vivo and in-silico techniques and elucidate their role in various regulatory processes. But from literature survey it appears that no attempt has been made to develop computational approaches for classification of plant, animal [15] and HIV miRNA's, Thus there is a need to develop newer algorithms which are robust, fast and economical considering the financial and time constraint which it poses on existing lab techniques. For the classification to be successful, each class must show some distinct properties or characteristics. There are many similarities between plant and animal miRNA system, both system play fundamental role in development and appear to predominantly exert their influence by controlling regulatory genes but many dissimilarity with HIV exist listed in table 1below which are used used in the classifier.

MODEL AND METHOD
SVM are a set of related supervised learning methods used for classification and regression. Viewing input data as two sets of vectors in an n-dimensional space, an SVM will construct a separating hyperplane in that one maximizes maximizes the margin between the two data sets. To calculate the margin, two parallel hyperplanes are constructed, one on each side of the separating hyperplane, which are "pushed up against" the two data Bioinfo Publications sets [11,12].Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both classes, since in general the larger the margin the lower the generalization error of the classifier [8], [9], [10].A support vector machine (SVMs) is a useful technique for data classification. A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one target value (class labels) and several attributes (features). The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes [4].

A. Software
The miRNA target registry software is used to extract the properties of the animal and plant miRNA [13]. Weka and LibSVM are two efficient software tools for building SVM classifiers. Each one of these two tools has its points of strength and weakness. Weka has a GUI and produces many useful statistics (e.g. confusion matrix, precision, recall, Fmeasure, and ROC scores) [6]. LibSVM runs much faster than Weka SMO and supports several SVM methods (e.g. one-class SVM, nu-SVM, and R-SVM). Weka LibSVM (WLSVM) combines the merits of the two tools. WLSVM can be viewed as an implementation of the LibSVM running under Weka environment [2].

B. Evaluation
There are two parameters while using RBF kernels: C and γ (radial basis function: exp (-gamma*|u-v|^2). It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (Parameter search) must be done. The goal is to identify good values so that the classifier can accurately predict unknown data (i.e., testing data) to achieve which, we perform a grid-search on C and gamma using 10-fold cross validation to accurately predict unknown data. Here is the result of our grid search shown in figure 2 on one of our training set to find best C and gamma using the RBF kernel function. The value of c=2 and gamma= 0.0078125 obtained using grid search was implemented in weka Libsvm which gave 95% accurate results. In the two class case with classes yes and no, a single prediction has the four different possible outcomes (where TP = true positive, FP = false positive, TN = true negative and FN = false negative). The true positive and true negative is correct classification. The overall success rate is the number of correct classification divided by the total number of classification.
Finally the error rate is one minus success rate. As data of HIV miRNA is not sufficient for the software so it excludes it and classifies only plant and animal miRNA. In a multiclass prediction the result on a test set is often displayed as a two dimensional confusion matrix with a row and column for each class. In matrix element the actual class is the row and the predicted class is the column. In this case test set has 100 instances (81+14=95) of them are predict correctly so the success rate is 95%.

C. Training set
For the classification of animal, plant and HIV miRNA we select the dissimilarity between the animal and plant miRNA from table1. For the classification purpose we have selected the characteristics, cluster members in both plants and animals. As the data of HIV miRNA is very small so the software considers only plant and animal miRNA data. The cluster members can have the values common/uncommon. Clusters are generally found in animals whereas very less plant species contain them. Number of mismatches in animal and plant is the second feature used in our classifier with different numeric values. The number of mismatches in plants have values less than or equal to 3 but for animals this value is 4 or more than 4. The number of target genes is one more feature included in the classifier [7]. Animals generally show large number of targets belonging to different families but plants have less number of targets generally belonging to one family. Size of fold back loop is greater than 100 for plant with variations till 303 nucleotides and less than 100 nucleotides for animals [13]. With these characteristics we have trained the WEKA classifier and the values we get are given in the table 2.

RESULT AND DISCUSSION
A set of attributes was collected from miRNA registry [13] and the corresponding attribute values were fed to the classifier for each transcript of all plant and animal miRNA genes. Not all attributes, however, are fit for use in a Classifier. First, some attributes are clearly not independent and do not provide any additional advantage when evaluated together [5]. For example complimentarily is a feature which is dependent on number of mismatches with the target. The result show 95% classified instances and only 5% unclassified instances. The detailed characteristics of the classification are given in Table II which   improvement. However as soon as sufficient information about more additional characteristics becomes available in the literature, the authors intend to include the same in the above classifier in future to improve its performance and accuracy. Bioinfo Publications