【Mahout一】基于Mahout 命令参数含义-技术博客集

【Mahout一】基于Mahout 命令参数含义
编程技术 / houtizong 发布于 3年前 128

1. mahout seqdirectory

    $ mahout seqdirectory         --input (-i) input               Path to job input directory(原始文本文件).        --output (-o) output             The directory pathname for output.（<Text,Text>Sequence File）        -ow

功能：将原始文本数据集转换为< Text, Text > SequenceFile

2. mahout seq2sparke

功能： Convert and preprocesses the dataset（<Text,Text> SequenceFile） into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.

即根据Sequence File转换为tfidf向量文件

说明：If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization

    mahout seq2sparse                               --output (-o) output             The directory pathname for output.              --input (-i) input               Path to job input directory.                    --weight (-wt) weight            The kind of weight to use. Currently TF                                              or TFIDF. Default: TFIDF                        --norm (-n) norm                 The norm to use, expressed as either a                                               float or "INF" if you want to use the                                                Infinite norm.  Must be greater or equal                                             to 0.  The default is not to normalize          --overwrite (-ow)                If set, overwrite the output directory          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should                                             be SequentialAccessVectors. If set true                                              else false                                      --namedVector (-nv)              (Optional) Whether output vectors should                                             be NamedVectors. If set true else false

-i Sequence File文件目录

-o 向量文件输出目录

-wt 权重类型，支持TF或者TFIDF两种选项，默认TFIDF

-n 使用的正规化，使用浮点数或者"INF"表示，

-ow 指定该参数，将覆盖已有的输出目录

-seq 指定该参数，那么输出的向量是SequentialAccessVectors

-nv 指定该参数，那么输出的向量是NamedVectors

3. mahout split

功能：Split the preprocessed dataset into training and testing sets.

将预处理的tfidf向量集转换为training和testing向量集

    $ mahout split         -i ${WORK_DIR}/20news-vectors/tfidf-vectors         --trainingOutput ${WORK_DIR}/20news-train-vectors         --testOutput ${WORK_DIR}/20news-test-vectors          --randomSelectionPct 40         --overwrite --sequenceFiles -xm sequential

说明：如上是将向量数据集分为训练数据和检测数据，以随机40-60拆分

3. mahout trainnb

功能：训练分类器

mahout trainnb  --input (-i) input               Path to job input directory.                   --output (-o) output             The directory pathname for output.                      --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0  --trainComplementary (-c)        Train complementary? Default is false.                          --labelIndex (-li) labelIndex    The path to store the label index in           --overwrite (-ow)                If present, overwrite the output directory                                          before running job                             --help (-h)                      Print out help                                 --tempDir tempDir                Intermediate output directory                  --startPhase startPhase          First phase to run                             --endPhase endPhase              Last phase to run

-i 输入路径

-o 输出路径

-a

-c 补偿性训练

-li label index文件的目录

-ow 指定该参数，删除输出目录

tempDir MapReduce作业的中间结果

startPhase 运行的第一个阶段

endPhase 运行的最后一个阶段

4. mahout testnb

功能：检验Bayes分类器

mahout testnb     --input (-i) input               Path to job input directory.                    --output (-o) output             The directory pathname for output.              --overwrite (-ow)                If present, overwrite the output directory                                           before running job  --model (-m) model               The path to the model built during training     --testComplementary (-c)         Test complementary? Default is false.                            --runSequential (-seq)           Run sequential?                                 --labelIndex (-l) labelIndex     The path to the location of the label index     --help (-h)                      Print out help                                  --tempDir tempDir                Intermediate output directory                   --startPhase startPhase          First phase to run                              --endPhase endPhase              Last phase to run

-i 输入路径

-o 输出路径

-ow 覆盖输出目录

-c

上一篇：【Mahout三】基于Mahout CBayes算法的20newsgroup流程分析

下一篇：【Mahout二】基于Mahout CBayes算法的20newsgroup的脚本分析

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<div > 1. mahout seqdirectory &nbsp; <pre name="code" class="java"> $ mahout seqdirectory --input (-i) input Path to job input directory(原始文本文件). --output (-o) output The directory pathname for output.（&lt;Text,Text&gt;Sequence File） -ow</pre> &nbsp; &nbsp; 功能： 将原始文本数据集转换为&lt; Text, Text &gt; SequenceFile &nbsp; &nbsp; &nbsp; 2. mahout seq2sparke &nbsp; 功能： Convert and preprocesses the dataset（&lt;Text,Text&gt; SequenceFile） into a &lt; Text, VectorWritable &gt; SequenceFile containing term frequencies for each document. 即根据Sequence File转换为tfidf向量文件 说明：If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization &nbsp; &nbsp; <pre name="code" class="java"> mahout seq2sparse --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF. Default: TFIDF --norm (-n) norm The norm to use, expressed as either a float or &quot;INF&quot; if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --overwrite (-ow) If set, overwrite the output directory --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false</pre> &nbsp; -i Sequence File文件目录 -o 向量文件输出目录 -wt 权重类型，支持TF或者TFIDF两种选项，默认TFIDF -n 使用的正规化，使用浮点数或者&quot;INF&quot;表示， -ow 指定该参数，将覆盖已有的输出目录 -seq 指定该参数，那么输出的向量是SequentialAccessVectors -nv 指定该参数，那么输出的向量是NamedVectors &nbsp; 3. mahout split &nbsp; 功能：Split the preprocessed dataset into training and testing sets. 将预处理的tfidf向量集转换为training和testing向量集 &nbsp; <pre name="code" class="java"> $ mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential</pre> &nbsp; 说明：如上是将向量数据集分为训练数据和检测数据，以随机40-60拆分 &nbsp; 3. mahout trainnb &nbsp; 功能：训练分类器 &nbsp; <pre name="code" class="java">mahout trainnb --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --alphaI (-a) alphaI Smoothing parameter. Default is 1.0 --trainComplementary (-c) Train complementary? Default is false. --labelIndex (-li) labelIndex The path to store the label index in --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run</pre> &nbsp; -i 输入路径 -o 输出路径 -a -c 补偿性训练 -li label index文件的目录 -ow 指定该参数，删除输出目录 tempDir MapReduce作业的中间结果 startPhase 运行的第一个阶段 endPhase 运行的最后一个阶段 &nbsp; 4. mahout testnb &nbsp; 功能：检验Bayes分类器 <pre name="code" class="java">mahout testnb --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --overwrite (-ow) If present, overwrite the output directory before running job --model (-m) model The path to the model built during training --testComplementary (-c) Test complementary? Default is false. --runSequential (-seq) Run sequential? --labelIndex (-l) labelIndex The path to the location of the label index --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run</pre> -i 输入路径 -o 输出路径 -ow 覆盖输出目录 -c &nbsp; &nbsp; &nbsp; </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客