1. 在Eclipse環境下編寫WordCount程式,統計所有除Stop-Word(如a,
an, of, in, on, the, this, that,...)外所有出現次數k次以上的單詞計數,
最後的結果按照詞頻從高到低排序輸出
2. 運行程式,對莎士比亞文集文檔數據進行處理
3. 可自行建立一個Stop-Word列表檔,其中包含部分停詞即可,不
需要列出全部停詞;參數k作為輸入參數動態指定(如k=10)
在run中設置
job1conf.set("k", args[2]);
Map中取值
Integer.parseInt(context.getConfiguration().get("k")
會出現取到null
java.io.IOException: Spill failed
和
Caused by: java.lang.NumberFormatException: null
hdfs://localhost:9000/user/huang/Shakespeare_Text hdfs://localhost:9000/user/huang/Shakes ... _step1_new 1000
附加檔案:
Shakespeare.java [6.27 KiB]
被下載 277 次
附加檔案:
Util.java [3.21 KiB]
被下載 254 次
stop-word:a,an, of, in, on, the, this
Shakespeare.txt:
The Complete Works of William Shakespeare
Welcome to the Web's first edition of
the Complete Works of William
Shakespeare. This site has offered
Shakespeare's plays and poetry to the
Internet community since 1993.
Announcement: The restoration of the site
following a disk failure has been delayed. The
text of the plays is available now. The poetry
and other services, including the search engine
and forums, will return shortly. (Nov. 13, 2000)
For other Shakespeare resources, visit the Mr.
William Shakespeare and the Internet Web site.
The original electronic source for this server is
the Complete Moby(tm) Shakespeare, which is
freely available online. The HTML versions of
the plays provided here are placed in the public
domain.