PREDICTION OF DOWNSTREAM INTERACTION OF TRANSCRIPTION FACTORS WITH MAPK3 IN ARABIDOPSIS THALIANA USING PROTEIN SEQUENCE INFORMATION

,


INTRODUCTION
Signalling through Mitogen Activated Protein Kinase (MAPK) is a fundamental and conserved process in eukaryotes and transduce external signals through protein phosphorylation [1][2]. Signal transduction is the means by which cells respond to extracellular information by interacting with other proteins. The knowledge of the interactions among proteins is essential for understanding the molecular mechanisms inside the cell [3]. In plants signal transduction are mediated by a special class of Kinases known as MAP kinases. MAP kinase pathway is one of the main phosphorylation pathway that plants use in biotic and abiotic stress resistance. It is a cascade consisting of several classes of kinases, each having a different role in signal integration and divergence. The Arabidopsis genome contains 23 MAPKs, 10 MAPK2 and 60 MAPK3. These kinases are involved in a variety of functions, including growth, development and responses to environmental and endogenous stimuli as well as responses to plant hormones. The role of these kinases in how plants respond to pathogen attack and other abiotic and biotic factors is currently poorly understood. This is primarly because of many MAPK-substrate interactions are very transient and unstable in order to facilitate rapid signaling. Therefore, it is difficult to identify the downstream interacting partners or the substrates of these kinases, which complicates the understanding of the development of pathogenesis in plants. Many experimental methods have been developed to study the protein-protein interactions including yeast two hybrid systems, affinity purification followed by mass spectrometry and the phage display libraries, but these methods have its own limitations and suffer from high false positive rate [4]. Therefore these limitations highlight the need of in silico interaction predictions. Transcription factors regulate the expression of genes at the transcriptional level and act as transcriptional activators or repressors. Transcription factors/nuclear proteins represent the largest group of MPK targets and regulated by MAPK phosphorylation [5]. To date, a number of transcription factors including TGA, Myb, bzip, WRKY, Myb-related proteins, AP2/EREBP, homeobox have been shown to act downstream/substrates of various MAPKS [6][7][8][9][10][11]5] factors are involved in several diverse pathways [12] and perform central functions in the regulation of developmental process, senescence and defence responses against pathogens [13][14][15][16]. WRKY genes contain at least one conserved DNA binding region, called WRKY domain, comprising the highly conserved about 60 amino acids in length WRKYGQK sequence [16]. Several biochemicals, molecular and genetic studies shows that bZIP transcription factors regulates and play role in various diverse function such as cell elongation [17], development and stress [18]. bZIP transcription factors super family consists of 75 members in Arabidopsis [19]. NAC one of the largest plant's specific transcription factor family comprises more than 100 members [20][21] and have a characteristic conserved domain at N-terminal named NAC domain [22] The NAC are known to posses diverse roles. NAC proteins bound specifically to the CATGTG motif both in vitro and in vivo. MYB factors represent a family of proteins that include the conserved MYB DNA-binding domain. Plants contain a MYB-protein subfamily that is characterised by the R2R3-type MYB domain. Remarkably, the myb-domain includes only one of the typical two or three tryptophan repeats found in other myb-like proteins [23]. AP2 and EREBPs are transcription factors that contain the AP2 DNAbinding domain. This family of transcription factor genes play a variety of roles throughout the plant life cycle like floral organ identity determination or control of leaf epidermal cell identity, to forming part of the mechanisms used by plants to respond to various types of biotic and environmental stress and constitute a multigene family [20]. Thus it is evident that the above discussed transcription factors viz., Myb, bZIP, WRKY, Mybrelated proteins, AP2/EREBP, and NAC play crucial role in signal transduction, defence responses against pathogens development and stress, etc. in plant life cycle. And the MAPK cascade is one of the major and evolutionally conserved signaling pathways and plays key role in the regulation of stress and developmental signals in plants. Keeping in view of the above, in present study the interaction of MAPK3 is predicted with the various transcription factors viz., NAC, Myb, bZIP, WRKY, Myb-related proteins, and AP2/EREBP transcription family. In this work we employed Support Vector Machine (SVM) technique for building a supervisory learning model which can be subsequently employed for identification of query sequences. Different types of features were employed as domain knowledge. In supervised learning, objects in a given collection are classified using a set of attributes, or features. The result of the classification process is a set of rules that prescribe assignments of objects to classes based solely on values of features. SVMs are a useful technique for data classification. Support vector machines consider a two-class, linearly separable classification problem. While many decision boundaries exist that are capable of separating all the training samples into two classes. Among these decision boundaries, SVMs find the one that achieves maximum margin between the two classes. The margin is defined as the distance between a planar decision surface that separates two classes and the closest training samples to the decision surface. Given a training set of instancelabel pairs (xi, yi ), i = 1……, l where xi £ R n and y £ (1,-1) l , the SVM [24][25] requires the solution of the following optimization problem: Here training vectors xi are mapped into a higher dimensional space by the function ф. SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. C > 0 is the penalty parameter of the error term [26].

MATERIALS AND METHODS
The approach here used is supervised learning through which we predict whether the MAPK3 proteins interact with other proteins viz., Myb, bZIP, WRKY, Myb-related proteins, AP2/EREBP, and NAC with which its interaction is almost unknown.
Based on examples of interacting pairs and non interacting pairs, we train a binary classifier to predict the class (interacting or non-interacting) of a set of protein sequences. The classification model for predicting protein-protein interaction (PPIs) is based on SVM.

Collection of data set, Feature extraction and Model Construction
PPI can be defined as four interaction modes: electrostatic interaction, hydrophobic interaction, steric interaction and hydrogen bond [27]. Here seven physicochemical feature groups of amino acids are selected to reflect these interaction modes and they are : 1) Amino acid, dipeptide composition,  [28] and was downloaded from TAIR (http://www.arabidopsis.org/). The negative data is formed by the artificial protein sequences by shuffling the sequences of positive data set at k-let count one. It has been demonstrated that if a sequence of one interacting pair is shuffled, then the two proteins can be deemed not to interact with each other. This is based on K-let Shuffling algorithm [29]. The positive and negative data set must meet these requirements: (i) the noninteracting pairs cannot appear in the whole set of interacting pairs, (ii) the number of negative pairs is equal to that of positive pairs. Whereas, the testing sets are composed of protein sequences of Myb, bZIP, WRKY, Myb-related proteins, AP2/EREBP, and NAC, transcription family containing 149, 72, 72, 49, 146, and 104 gene loci respectively; which are downloaded from DATF (http://datf.cbi.pku.edu.cn/). The model is generated by web server "gist-classify" [30] (Gist, version 2.2, http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi) based on SVM. For model generation the SVM is trained by a labelled training set and then using the trained SVM to make predictions about the classifications of an unlabeled test set i.e of Myb, bZIP, WRKY, Myb-related proteins, AP2/EREBP, and NAC, transcription family is achieved.

RESULTS
There is very little work in the literature regarding prediction of MAPK3 interaction with other proteins in Arabidopsis Thaliana. The MAPK cascade is one of the major and evolutionary conserved signaling pathway and plays crucial role in the regulation of stress and developmental signals in plants. In this work we employed SVM based predictive modeling approach to predict the interaction of MAPK3 with Myb, bZIP, WRKY, Myb-related proteins, AP2/EREBP, and with NAC transcription factor family. In Myb transcription factor family 101 gene loci are interacting " Fig. (7)" and 48 are noninteracting out of 149 gene loci " Table ( Fig. (1)".

DISCUSSION
Networks of PPI provide a framework for the understanding of biological processes and can give insights into the mechanisms of diseases. Thus the understanding of biological mechanisms requires the knowledge of protein-protein interaction. MAPK is a conserved link between cell receptor and cell response and is mediated through gene expression which is regulated by transcription factors. The paper focuses on identifying different interacting partners of an important MAP kinase of Arabidopsis involved in disease signalling process. The identification of such partners will further enhance our knowledge about the mechanisms of disease resistance and help in developing novel strategies combating the pathogens. To our understanding this is the first report which has tried to identify the huge amount of downstream interacting partners of MAPK3. In the present work, we predicted the downstream interaction of transcription factors with MAPK3 in Arabidopsis Thaliana using only the information of protein sequences through SVM. The results of our study clearly revealed the complexity of MAPK3 interaction with several variants of same or different transcription factors triggered in response to diverse upstream stimuli. The PPI networks can give insights into the mechanisms of diseases and provide a spectrum for the understanding of biological processes. Interaction networks can aid in designing signal transduction pathway and help to find the disease suppressive agents as well as uncover the key genes those are responsible for senescence and defence responses against pathogens. The same can be verified by wet lab experimentation using phosphoproteomics, Yeast2Hybrid system and affinity immuno chip based system. The results of present study further suggest to validate these results by physiochemical features. The need for bioinformatics methods to find out protein partners is being driven by the International Journal of Bioinformatics Research  ISSN: 0975-3087, E-ISSN: 0975-9115, Vol. 3, Issue 1, 2011 generation of sequences at a rate far beyond our ability to carry out experimental functional analysis. Presently predictions of interaction of proteins are based on docking studies which is very computationally intensive and therefore unlikely to be able to scale to large interactions [3]. The main motivation behind this work is to perform a largescale screening of pairs of proteins likely to interact, before validating the prediction by more expensive computational methods such as docking, or experimental methods for pairs of particular interest. The specificity, precision, and accuracy of PPI obtained in present study can also be compared to provide best prediction model. The non-interaction partners predicted in present study prompted us to look alternative MAPK proteins which are interacting with these non-interacting gene loci of various transcription factors.