This paper studies the haplotype assembly problem from an information
theoretic perspective. A haplotype is a sequence of nucleotide bases on a
chromosome, often conveniently represented by a binary string, that differ from
the bases in the corresponding positions on the other chromosome in a
homologous pair. Information about the order of bases in a genome is readily
inferred using short reads provided by high-throughput DNA sequencing
technologies. In this paper, the recovery of the target pair of haplotype
sequences using short reads is rephrased as a joint source-channel coding
problem. Two messages, representing haplotypes and chromosome memberships of
reads, are encoded and transmitted over a channel with erasures and errors,
where the channel model reflects salient features of high-throughput
sequencing. The focus of this paper is on the required number of reads for
reliable haplotype reconstruction, and both the necessary and sufficient
conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa