The JPU Corpus: R 142 F

Writing my thesis about Computer Assisted Language Learning (CALL) in EFL teaching, I worked with the corpus based on the texts of the twelve units from Headway Advanced Student's Book (John & Liz Soars 1989). As one of the results of my research I have found several interesting data in connection with the wordlists and frequency tables of the corpus. Finishing my thesis I found it challenging to have the same analysis of the wordlists and frequency tables using the text of the thesis as the corpus. In the following I will describe the results of this analysis and will compare the two corpora.

In the preparatory work of the corpus I had to delete all the prints of the screens, all the tables, the appendices and graphs from the thesis to get the plain text part and saved it in text only format to produce the plain ASCII codes. Then, I used Longman Mini Concordancer to prepare the wordlists and frequency tables. As the next screen shows, the thesis corpus (TC) consists of 9,257 occurrences of 1,590 different words. The token/type ratio is 5.82. 17 types of punctuation occur 1,861 times.

Though the two corpora are different in size (15,515 occurrences of 3,636 words in the Headway Advanced corpus (HAC)), from the token/type ratio (4.26 in HAC) and from the number of punctuation we can say characteristics of the texts. The 5.82 value of token/type presents that only almost every sixth word is different in the thesis, while in the HAC every forth word is different. In the following tables we can see the different types of punctuation and their frequency values both in the TC and in the HAC. The signed types show those, which occur only in that specific corpus and concerning the differences in the corpora types it is quite understandable (@ * \ / signs are used in computer texts).

Type of punctuation and its frequency:
Assuming, that the punctuation sign '.' usually means the end of a sentence, we can figure out that in the TC a sentence consists of about 10 words (the ratio is 10.2), while in the HAC there are 15 words in a sentence (the ratio is 14.6) as an average. This can suggest that the texts in Headway Advance are more complex and longer than in the thesis.

From the next table, containing the word frequency values and the number of the words with that frequency, it can be noted that there are 757 (47.61%) words from 1,590 which occurred only once in the thesis. This value is about the same as the average mentioned by other authors. Compared to the HAC value - 60.34% - it is obvious, that in an advanced level coursebook the repetition of the words is more rare than in a thesis concentrating on a quite narrow topic.

Examining the wordlist with frequency values in more detail (see Appendix), we can find that the first content word, "word" - not a pronoun, article, auxiliary, or preposition - is the twelfth in the list with a frequency value of 89. When we add the frequency value of the other appearance of the same word - "words" ( 66) - to the previous value, we can see that it becomes forward to the seventh place in the list. This is a very high value compared to the result from the HAC, as there, the first content word, "said" was only the 37th in the similar list. The reason of the differences is natural again, as in a coursebook the topics are highly varied, - extracts from literature, articles from different types of magazines, newspaper reports, autobiographies - so content words are not so often repeated.

We can have the theory, that the first most frequent content word in any text is to describe the theme of the text - what it is about - in the HAC it is "life" in the 48th place, and "word(s)" in the TC in the 7th place. It seems to be relevant as a good authentic coursebook is about life, while this analysed thesis is about words. The values of the placement describe how much narrower the topic is in a thesis than in a coursebook.

The next graph shows the ratio of frequencies of the eleven most frequent words and demonstrates that the definite article "the" is more than twice as frequent than the next most common word in the TC, "of".

This result is basically the same as what I found in the HAC in connection with the ratio of the nine most frequent words. If we compare the first nine words in each corpus - TC: the, of, a, in, to, and, is, as, can. HAC: the, of, a, and, to, in, he, was, it. - we can find that there is no personal pronoun in the TC and present tense is characteristic, while in the coursebook past tense is used most regularly in conventional narration with the personal pronoun, "he".

I think, that even this short analysis is enough to prove that with the help of wordlists and frequency tables there are many possibilities and ways of getting information about even unknown texts. And that these pieces of information shows similarities and due to this are relevant and reliable.

The JPU Corpus

Thursday, May 10, 2007

R 142 F

No comments:

Corpus Resources

Posts