||In this dissertation, we propose a new rhetorical structure modeling approach as a critical step in the understanding of extractive summarization of spoken documents. Previous work has shown that explicit rhetorical structure markers, such as paragraph delimiters, titles and subtitles, sentence boundaries, fonts and styles, are essential in helping the reader understand text documents. However, such structural clues are absent in spoken documents. Summaries where salient utterances are extracted from an entire spoken document and grouped together have shown to be difficult to understand. This challenge also gives rise to the difficulty of compiling gold standard reference summaries. On the other hand, it is evident from presentation slides and meeting minutes that humans, when summarizing the content of a talk or a meeting, use explicit rhetorical labels such as titles, subtitles, bullet points, paragraph and page breaks. We suggest that a good extractive summary should be clearly structured like presentation slides or meeting minutes, with explicit structural markers. We show that such type of rhetorical structure is rendered by both acoustic and linguistic features in spoken documents. We present Principal Component Analysis (PCA) graphs derived from such features, showing clear self-clustering of speech utterances according to the underlying rhetorical state - for example acoustic and linguistic feature vectors from the “introduction”, “methodology”, “conclusion” of a conference presentation speech; or the “bill”, “question and answer” or “motion” of a parliamentary speech, or the business items of a meeting minute are grouped together. We then propose different machine learning methods to model these rhetorical states for a structural-based summarization approach. Extracted salient utterances are grouped under the labels of each rhetorical state, in a hierarchical fashion, emulating presentation slides, or meeting minutes, to yield summaries that are more easily understandable by humans. We investigate how to automatically model the inherent rhetorical structure of presentation speech and parliamentary speech. We show one approach of using Rhetorical State Hidden Markov Models (RSHMM) with segmental summarization and another approach of Hidden Markov Support Vector Machine (HMSVM) classifiers to generate structured extractive summaries from presentation speech. We further investigate how to combine rhetorical structure modeling with summarization process into one step for improving summarization performance. Finally we show an approach of using a single Conditional Random Field (CRF) classifier, to perform both rhetorical structure modeling and extractive summarization in one step, by chunking, parsing and extraction of salient utterances, to generate meeting minutes from parliamentary speech. One important challenge to reliable extractive summarization is the lack of agreement between humans on gold standard summaries. However, humans agree more on what a correct flow of a summary should be when given explicit structural labels. We show empirical results where human annotators obtain higher inter-labeler agreement on drafting gold standard summaries, when they are guided by these automatically extracted rhetorical labels. We further find that applying active learning approach for training our summarizer with a given accuracy can help us reduce human annotation efforts evidently. We also show that higher quality extractive summaries are obtained under our proposed framework.