Description

Early detection of cancer facilitates treatment and improves patient survival. We hypothesized that molecular biomarkers of cancer could be rationally predicted based on even partial knowledge of transcriptional regulation, functional pathways and gene co-expression networks. To test our data mining approach, we focused on breast cancer, as one of the best-studied models of this disease. We were particularly interested to check whether such a 'guilt by association' approach would lead to pan-cancer markers generally known in the field or whether molecular subtype-specific 'seed' markers will yield subtype-specific extended sets of breast cancer markers. The key challenge of this investigation was to utilize a small number of well-characterized, largely intracellular, breast cancer-related proteins to uncover similarly regulated and functionally related genes and proteins with the view to predicting a much-expanded range of disease markers, especially that of extracellular molecular markers, potentially suitable for the early non-invasive detection of the disease. We selected 23 previously characterized proteins specific to three major molecular subtypes of breast cancer and analyzed their established transcription factor networks, their known metabolic and functional pathways and the existing experimentally derived protein co-expression data. Having started with largely intracellular and transmembrane marker 'seeds' we predicted the existence of as many as 150 novel biomarker genes to be associated with the selected three major molecular sub-types of breast cancer all coding for extracellularly targeted or secreted proteins and therefore being potentially most suitable for molecular diagnosis of the disease. Of the 150 such predicted protein markers, 114 were predicted to be linked through the combination of regulatory networks to basal breast cancer, 48 to luminal and 7 to Her2-positive breast cancer. The reported approach to mining molecular markers is not limited to breast cancer and therefore offers a widely applicable strategy of biomarker mining.