【Mahout一】基于Mahout 命令参数含义

编程技术  /  houtizong 发布于 3年前   112

1. mahout seqdirectory

 

    $ mahout seqdirectory         --input (-i) input               Path to job input directory(原始文本文件).        --output (-o) output             The directory pathname for output.(<Text,Text>Sequence File)        -ow

 

 

功能: 将原始文本数据集转换为< Text, Text > SequenceFile

 

 

 

2. mahout seq2sparke

 

功能: Convert and preprocesses the dataset(<Text,Text> SequenceFile) into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.

即根据Sequence File转换为tfidf向量文件

说明:If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization

 

 

    mahout seq2sparse                               --output (-o) output             The directory pathname for output.              --input (-i) input               Path to job input directory.                    --weight (-wt) weight            The kind of weight to use. Currently TF                                              or TFIDF. Default: TFIDF                        --norm (-n) norm                 The norm to use, expressed as either a                                               float or "INF" if you want to use the                                                Infinite norm.  Must be greater or equal                                             to 0.  The default is not to normalize          --overwrite (-ow)                If set, overwrite the output directory          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should                                             be SequentialAccessVectors. If set true                                              else false                                      --namedVector (-nv)              (Optional) Whether output vectors should                                             be NamedVectors. If set true else false

 

-i Sequence File文件目录

-o 向量文件输出目录

-wt 权重类型,支持TF或者TFIDF两种选项,默认TFIDF

-n 使用的正规化,使用浮点数或者"INF"表示,

-ow 指定该参数,将覆盖已有的输出目录

-seq 指定该参数,那么输出的向量是SequentialAccessVectors

-nv 指定该参数,那么输出的向量是NamedVectors

 

3. mahout split

 

功能:Split the preprocessed dataset into training and testing sets.

将预处理的tfidf向量集转换为training和testing向量集

 

    $ mahout split         -i ${WORK_DIR}/20news-vectors/tfidf-vectors         --trainingOutput ${WORK_DIR}/20news-train-vectors         --testOutput ${WORK_DIR}/20news-test-vectors          --randomSelectionPct 40         --overwrite --sequenceFiles -xm sequential

 

说明:如上是将向量数据集分为训练数据和检测数据,以随机40-60拆分

 

3. mahout trainnb

 

功能:训练分类器

 

mahout trainnb  --input (-i) input               Path to job input directory.                   --output (-o) output             The directory pathname for output.                      --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0  --trainComplementary (-c)        Train complementary? Default is false.                          --labelIndex (-li) labelIndex    The path to store the label index in           --overwrite (-ow)                If present, overwrite the output directory                                          before running job                             --help (-h)                      Print out help                                 --tempDir tempDir                Intermediate output directory                  --startPhase startPhase          First phase to run                             --endPhase endPhase              Last phase to run

 

-i 输入路径

-o 输出路径

-a

-c 补偿性训练

-li label index文件的目录

-ow 指定该参数,删除输出目录

tempDir MapReduce作业的中间结果

startPhase 运行的第一个阶段

endPhase 运行的最后一个阶段

 

4. mahout testnb

 

功能:检验Bayes分类器

mahout testnb     --input (-i) input               Path to job input directory.                    --output (-o) output             The directory pathname for output.              --overwrite (-ow)                If present, overwrite the output directory                                           before running job  --model (-m) model               The path to the model built during training     --testComplementary (-c)         Test complementary? Default is false.                            --runSequential (-seq)           Run sequential?                                 --labelIndex (-l) labelIndex     The path to the location of the label index     --help (-h)                      Print out help                                  --tempDir tempDir                Intermediate output directory                   --startPhase startPhase          First phase to run                              --endPhase endPhase              Last phase to run

-i 输入路径

-o 输出路径

-ow 覆盖输出目录

-c

 

 

 

请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!

留言需要登陆哦

技术博客集 - 网站简介:
前后端技术:
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成

网站主要作用:
1.编程技术分享及讨论交流,内置聊天系统;
2.测试交流框架问题,比如:Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识,如有侵权请发邮件到站长邮箱,站长会尽快处理;
4.站长邮箱:[email protected];

      订阅博客周刊 去订阅

文章归档

文章标签

友情链接

Auther ·HouTiZong
侯体宗的博客
© 2020 zongscan.com
版权所有ICP证 : 粤ICP备20027696号
PHP交流群 也可以扫右边的二维码
侯体宗的博客