ANALYSIS AND PREDICTION OF MAJOR BLOOD PROTEINS BASED ON THEIR AMINO ACID AND DIPEPTIDE COMPOSITION

Determination of protein functions is one of the most challenging problems in the genomic era. Enormous amounts of protein sequences are available in the database as raw sequence data. Using various computational approaches, these data’s are being processed into meaningful biological information’s. The support vector machine (SVM) based prediction system is fully automatic and reliable. It has been used in many applications including sub-cellular localization, protein secondary structure prediction, and micro array data analysis of proteins [1-4]. However, no direct method is currently available to predict blood proteins. Thus analysis and prediction studies of blood proteins are important for researchers.


Introduction
Determination of protein functions is one of the most challenging problems in the genomic era. Enormous amounts of protein sequences are available in the database as raw sequence data. Using various computational approaches, these data's are being processed into meaningful biological information's. The support vector machine (SVM) based prediction system is fully automatic and reliable. It has been used in many applications including sub-cellular localization, protein secondary structure prediction, and micro array data analysis of proteins [1][2][3][4]. However, no direct method is currently available to predict blood proteins. Thus analysis and prediction studies of blood proteins are important for researchers. Blood proteins found in blood plasma are also called serum proteins. Major blood proteins are albumin, globulin, fibrinogen, and regulatory proteins [5,6]. Sixty percent of plasma proteins are made up of albumins [7], which are major contributors to the osmotic pressure of plasma and which assists in the transport of lipids and steroid hormones. Globulins make up thirty five percent of plasma proteins and are used in the transport of ions, hormones, and lipids thus assisting in immune function [8]. Four percent is fibrinogen and it is essential in the clotting of blood when converted into insoluble fibrin [9]. Regulatory proteins, which make up less than one percent of plasma proteins, are proteins such as enzymes, proenzymes and hormones. The main functions of the blood proteins are transporting lipids, hormones, vitamins and metal molecules. Thus, these pro-teins are playing an important role in the regulation of a cellular activity and many different functions in the immune system. Due to their great function of blood, a classification prediction system has been developed in order to facilitate better understanding of their roles. The superior facility of classification system using machine learning based approach rather than experimental techniques is apparent. Currently, there is no classification of blood proteins available based on amino (AC) and dipeptide composition. Support vector machine (SVM) is one of the promising kernel based machine learning for building effective model for predicting class labels of unknown protein data. Therefore, in this study, we have developed an integrative SVM based prediction with a two step approach to predict the blood proteins and further classify them into different classes. The method presented is highly specific and sensitive to predict the blood proteins.

Results
To develop a prediction platform for blood proteins, amino acid composition dataset of all blood-proteins were made using five fold cross-validation technique. The SVM module was developed using blood-protein and non-blood protein training sets. Dataset was divided into five equal sets randomly, four sets were used for training, and the remaining set was used for testing [10,11]. This process was repeated five times to test each protein at least once [12]. We labeled all blood-proteins as positive proteins and non bloodproteins were used as negative proteins. The results demonstrated Abstract-A method has been developed for predicting blood proteins using the SVM based machine learning approach. In this prediction method a two-step strategy was deployed to predict blood proteins and their subclasses. We have developed models of blood proteins and achieved the maximum accuracies of 90.57% and 91.39% with Matthews correlation coefficient (MCC) of 0.89 and 0.90 using single amino acid and dipeptide composition respectively. Furthermore, the method is able to predict major subclasses of blood proteins; developed based on amino acid (AC) and dipeptide composition (DC) with a maximum accuracy 90.38%, 92.83%, 87.41%, 92.52% and 85.27%, 89.07%, 94.82%, 86.31 for albumin, globulin, fibrinogen, and regulatory proteins respectively. All modules were trained, tested, and evaluated using the five-fold cross-validation technique.

ANALYSIS AND PREDICTION OF MAJOR BLOOD PROTEINS BASED ON THEIR AMINO ACID AND DIPEPTIDE COMPOSITION
that the method can differentiate blood-proteins from non bloodproteins with great accuracy of 90.57% in 0.89 of MCC at a default cutoff score of 0. The best result was obtained using an optimal RBF kernel with parameters g =3, C 375. The dipeptide composition method was also tested and achieved 91.39% accuracy with 0.90 of the MCC. In average, all amino and dipeptide composition analysis of blood-proteins were significantly different from non blood-proteins [ Table-1]. We have applied the same method for predicting the classification of major blood-proteins. Here we took one class of blood proteins as positive and all other classes for negative examples. This was repeated to all other classes of blood proteins. We prepared models of each blood protein classes based on their amino acid as well as dipeptide composition with different optimized SVM kernels parameters. This indicates that each class of bloodproteins can be discriminated from other classes of proteins based on their amino acid and dipeptide composition [13][14][15][16]. In amino acid composition we achieve maximum accuracy as shown in the [

Table 2-Performance of various SVM modules of blood-protein classifications (albumin, globulin, fibrinogen and regulatory) predictions developed using various types of compositions; amino acids (AC) and dipeptides (DC).
The dipeptide prediction accuracy was improved significantly over single amino acid prediction. Therefore, the prediction accuracy can be increased using a wide range of information about a protein. The sensitivity and specificity also has been calculated for blood proteins and subclasses, shown in [

Analysis of Amino Acids
Determining the relative amino acid composition of a protein will give a characteristic profile for protein. This amino acid analysis profile provides enough information to identify major blood-proteins. Here, we used the total number of amino acid divided by the total number of amino acids in protein. The average amino acid composition of blood proteins has been calculated which shown in [Fig-1] and [ Fig-2] with non blood proteins. In this analysis results shows that Cys, Pro, Ser and Tyr are higher in blood proteins than the non blood proteins. In sub classes of blood proteins Leu residues are higher in all classes, Cys, Asp, Gly and Asn are higher in fibrinogen proteins than other classes. Overall regulatory and globulin protein is having a similar percentage in all residues.

Discussion
Blood proteins serve many different functions, such as circulatory transport molecules for (lipid hormones, vitamins, and metals), protease inhibitors, and regulation of cellular activities, including the immune system. Based on their importance, we have decided to develop a method using SVM for prediction of these proteins. Here, we have described amino and dipeptide based method which is helpful in differentiating various blood proteins. This method is usefull to show whether a newly discovered protein sequence belongs to blood proteins and identify its subfamilies. This method is a highly accurate method and able to perform classification separation properly. Finally, our results demonstrate that using the concept of amino and dipeptide using SVM is a successful method for predicting these proteins.

Methods Dataset
The final data set of blood proteins including subfamilies consist 717 (albumin 91, globulin 33, fibrinogen 564 and regulatory 29). As negative set 899 belonging to protease family were selected randomly. These protein sequences were obtained from Uniprot and Expasy server. In this dataset "fragments", "isoforms", "potentials", "similarity", or "probables" in comment field were removed, to avoid bias in the classifier. We have used 90% cutoff to generate nonredundant dataset of both blood and non-blood sequences.

Support Vector Machine (SVM)
In the present study, a free downloadable package of SVM, SVM_light has been used to classify major blood-protein sequences. This software enables the users to define a number of parameters as well as the choice of inbuilt kernel, such as a radial basis function (RBF) or a polynomial kernel (of given degree) [17]. In this study, all parameters of a kernel were kept constant, except for the regulatory parameter C. The experimentation was conducted by using various types of kernels such as polynomial and radial base function. The SVMs required a fixed number of inputs for training, thus necessitating a strategy for encapsulating the global information about the proteins of variable length in a fixed length format. The fixed length format was obtained from protein sequences of variable length using amino acid and dipeptide composition. It has been successfully applied to numerous classification and pattern recognition problems such as classification of microarray data, protein secondary structure prediction and sub cellular localization [3,4,18].

Amino Acid Composition
Amino acid composition is the fraction of each amino acid in a protein. The fraction of all 20 natural amino acids was calculated by using [Eq-1] [19,20]. (1) where i can be any amino acid.

Dipeptide Composition
Dipeptide composition is used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400 (20 X 20). The fraction of each dipeptide was calculated using [Eq-2] [19]. (2) where dep (i+1) is one out of 400 dipeptides.

Evaluation of Performance
In this study, 5-fold cross-validation technique was adopted according to which dataset was partitioned randomly into five equal subsets. The training and testing were carried out five times, each time using one subset for testing and remaining 4 subsets for training. The performance of each classifier is measured in terms of accuracy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC) by standard [Eq-3] to [Eq-6] [20].
(3) (4) (5) (6) where, TP, TN, FP, FN's are the number of true positives, true negatives, false positives and false negatives respectively. TP and TN are number of correctly classified blood proteins and non blood proteins respectively. FP and FN are incorrectly classified as blood and non blood proteins.

Prediction System
The prediction of blood and blood related proteins are a multi-class classification problem. To handle this multi-class situation, we have to design a series of binary SVMs. For N class classification, N SVMs was constructed. The ith SVM will do training with all samples of the ith subfamily being labeled as positive, and the samples of all other subfamilies being labeled as negative. The SVMs trains in this way will reefer to as 1-v-r SVMs. In this classification approach, each of the unknown proteins will achieve four scores. An unknown protein will be classified into the subfamily that corresponds to the 1-v-r SVM with the highest output score.