In many real-world applications, data contain heterogeneous input modalities (e.g., web pages include images, text, and etc.). Moreover, data such as images are usually described using different views (i.e. different sets of features). Learning a distance metric or similarity measure that originates from all input modalities or views is essential for many tasks such as content-based retrieval ones. In these cases, similar and dissimilar pairs of data can be used to find a better representation of data in which similarity and dissimilarity constraints are better satisfied. In this paper, we incorporate supervision in the form of pairwise similarity and/or dissimilarity constraints into multi-modal deep networks to combine different modalities into a shared latent space. Using properties of multi-modal data, we design multi-modal deep networks and propose a pre-training algorithm for these networks. In fact, the proposed network has the ability of learning intra- and inter-modal high-order statistics from raw features and we control its high flexibility via an efficient multi-stage pre-training phase corresponding to properties of multimodal data. Experimental results show that the proposed method outperforms recent methods on image retrieval tasks.