【Mahout三】基于Mahout CBayes算法的20newsgroup流程分析
编程技术  /  houtizong 发布于 3年前   121
1.下载Mahout
http://mirror.bit.edu.cn/apache/mahout/0.10.0/mahout-distribution-0.10.0.tar.gz
2.解压Mahout
3. 配置环境变量
vim /etc/profileexport HADOOP_HOME=/home/hadoop/software/hadoop-2.5.2export PATH=$HADOOP_HOME/bin:$PATHexport MAHOUT_HOME=/home/hadoop/software/mahout-distribution-0.10.0export PATH=$MAHOUT_HOME/bin:$PATH
4. Mahout依赖于Hadoop执行MapReduce作业,不同于一般的添加Hadoop依赖的方式,Mahout不需要修改它专有的配置文件进行Hadoop配置,只要在环境变量中添加HADOOP_HOME即可
1. 针对样本文件创建Sequence File
2. 将Sequence File转换为tfidf向量文件
3. 根据tdidf向量文件创建training向量文件和测试向量文件
4. 训练Naive Bayes模型,输入为训练向量文件,输出为训练模型文件
5.1 基于训练向量文件进行测试
5.2 基于测试向量文件进行测试
1. 运行脚本./examples/bin/classify-20newsgroups.sh
1. 数据文件(20news-bydate.tar.gz)下载到/tmp/mahout-work-hadoop目录下,并解压至本地/tmp/mahout-work-hadoop/20news-all目录下,目录下的数据文件是原始的文本内容.
2. 将数据文件上传到HDFS的/tmp/mahout-work-hadoop/20news-all目录下,执行如下HDFS命令
/home/hadoop/software/hadoop-2.5.2/bin/hdfs dfs -put /tmp/mahout-work-hadoop/20news-all /tmp/mahout-work-hadoop/
此时,HDFS上包含如下路径/tmp/mahout-work-hadoop/20news-all,这个目录下包含了20newsgroup的原始文件
1.通过执行如下的命令,为原始的数据20newsgroups创建Sequence Files
./bin/mahout seqdirectory -i /tmp/mahout-work-hadoop/20news-all -o /tmp/mahout-work-hadoop/20news-seq -ow
通过上面的命令可以看到,输入输出文件目录都是在HDFS上,产出的Sequence Files放于HDFS的/tmp/mahout-work-hadoop/20news-seq目录下,如下所示
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-seqFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:13 /tmp/mahout-work-hadoop/20news-seq/_SUCCESS-rw-r--r-- 1 hadoop supergroup 19202391 2015-05-22 07:13 /tmp/mahout-work-hadoop/20news-seq/part-m-00000
根据这个显示信息,mahout seqdirectory触发了Map任务,产生的part-m-00000文件的大小大约19M
1.执行如下命令,将Sequence Files转换为向量,
./bin/mahout seq2sparse -i /tmp/mahout-work-hadoop/20news-seq -o /tmp/mahout-work-hadoop/20news-vectors -lnorm -nv -wt tfidf
通过上面的命令可知,生成的向量文件存放于HDFS的/tmp/mahout-work-hadoop/20news-vectors目录下
2.HDFS的/tmp/mahout-work-hadoop/20news-vectors目录下内容是
drwxr-xr-x - hadoop supergroup 0 2015-05-22 07:18 /tmp/mahout-work-hadoop/20news-vectors/df-count-rw-r--r-- 1 hadoop supergroup 1937084 2015-05-22 07:15 /tmp/mahout-work-hadoop/20news-vectors/dictionary.file-0-rw-r--r-- 1 hadoop supergroup 1890053 2015-05-22 07:18 /tmp/mahout-work-hadoop/20news-vectors/frequency.file-0drwxr-xr-x - hadoop supergroup 0 2015-05-22 07:20 /tmp/mahout-work-hadoop/20news-vectors/tf-vectorsdrwxr-xr-x - hadoop supergroup 0 2015-05-22 07:21 /tmp/mahout-work-hadoop/20news-vectors/tfidf-vectorsdrwxr-xr-x - hadoop supergroup 0 2015-05-22 07:14 /tmp/mahout-work-hadoop/20news-vectors/tokenized-documentsdrwxr-xr-x - hadoop supergroup 0 2015-05-22 07:15 /tmp/mahout-work-hadoop/20news-vectors/wordcount
20news-vectors目录下有两个文件(dictionary.file-0和frequency.file.0)以及五个目录
2.1 df-count目录
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-vectors/df-countFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:18 /tmp/mahout-work-hadoop/20news-vectors/df-count/_SUCCESS-rw-r--r-- 1 hadoop supergroup 1890073 2015-05-22 07:18 /tmp/mahout-work-hadoop/20news-vectors/df-count/part-r-00000
2.2 tf-vectors目录
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-vectors/tf-vectorsFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:20 /tmp/mahout-work-hadoop/20news-vectors/tf-vectors/_SUCCESS-rw-r--r-- 1 hadoop supergroup 28689283 2015-05-22 07:20 /tmp/mahout-work-hadoop/20news-vectors/tf-vectors/part-r-00000
2.3 tfidf-vectors目录
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-vectors/tfidf-vectorsFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:21 /tmp/mahout-work-hadoop/20news-vectors/tfidf-vectors/_SUCCESS-rw-r--r-- 1 hadoop supergroup 28689283 2015-05-22 07:21 /tmp/mahout-work-hadoop/20news-vectors/tfidf-vectors/part-r-00000
2.4 tokenized-documents
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-vectors/tokenized-documentsFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:14 /tmp/mahout-work-hadoop/20news-vectors/tokenized-documents/_SUCCESS-rw-r--r-- 1 hadoop supergroup 27503580 2015-05-22 07:14 /tmp/mahout-work-hadoop/20news-vectors/tokenized-documents/part-m-00000
2.5 wordcount
[hadoop@hadoop ~]$ hdfs dfs -ls /tmp/mahout-work-hadoop/20news-vectors/wordcountFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2015-05-22 07:15 /tmp/mahout-work-hadoop/20news-vectors/wordcount/_SUCCESS-rw-r--r-- 1 hadoop supergroup 2315037 2015-05-22 07:15 /tmp/mahout-work-hadoop/20news-vectors/wordcount/part-r-00000
3. 以上2个文件和目录的生成过程
3.1 通过Map任务生成tokenized-documents
SparseVectorsFromSequenceFiles: Tokenizing documents in /tmp/mahout-work-hadoop/20news-seq
3.2 根据Tokenized-Documents生成Term Frequency Vectors
这是一个Map Reduce任务,生成的TFV保存在wordcount目录下
15/05/22 07:14:54 INFO SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors15/05/22 07:14:54 INFO DictionaryVectorizer: Creating dictionary from /tmp/mahout-work-hadoop/20news-vectors/tokenized-documents and saving at /tmp/mahout-work-hadoop/20news-vectors/wordcount
3.3 计算IDF
这是一个MapReduce任务,生成的文件放置于df-count目录下
15/05/22 07:17:34 INFO SparseVectorsFromSequenceFiles: Calculating IDF
通过如下命令生成Bayes Model
./bin/mahout trainnb -i /tmp/mahout-work-hadoop/20news-train-vectors -o /tmp/mahout-work-hadoop/model -li /tmp/mahout-work-hadoop/labelindex -ow -c
通过命令可以看出,为Bayes模型的训练提供的输入是/tmp/mahout-work-hadoop/20news-train-vectors,输出是/tmp/mahout-work-hadoop/model
请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!
技术博客集 - 网站简介:
前后端技术:
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用:
1.编程技术分享及讨论交流,内置聊天系统;
2.测试交流框架问题,比如:Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识,如有侵权请发邮件到站长邮箱,站长会尽快处理;
4.站长邮箱:[email protected];
文章归档
文章标签
友情链接