【Spark六十一】Spark Streaming结合Flume、Kafka进行日志分析
编程技术 / houtizong 发布于 3年前 232

第一步，Flume和Kakfa对接，Flume抓取日志，写到Kafka中

第二部，Spark Streaming读取Kafka中的数据，进行实时分析

本文首先使用Kakfa自带的消息处理（脚本）来获取消息，走通Flume和Kafka的对接

1. Flume配置

1. 下载Flume和Kafka集成的插件，下载地址：https://github.com/beyondj2ee/flumeng-kafka-plugin。将package目录中的flumeng-kafka-plugin.jar拷贝到Flume安装目录的lib目录下

2. 将Kakfa安装目录libs目录下的如下jar包拷贝到Flume安装目录的lib目录下

kafka_2.10-0.8.1.1.jar

scala-library-2.10.1.jar

metrics-core-2.2.0.jar

3.添加agent配置

producer.sources = sproducer.channels = cproducer.sinks = r#source section    #producer.sources.s.type = seq    producer.sources.s.type = netcatproducer.sources.s.bind = localhostproducer.sources.s.port = 44444producer.sources.s.channels = c# Each sink's type must be defined    producer.sinks.r.type = org.apache.flume.plugins.KafkaSinkproducer.sinks.r.metadata.broker.list=127.0.0.1:9092producer.sinks.r.partition.key=0producer.sinks.r.partitioner.class=org.apache.flume.plugins.SinglePartitionproducer.sinks.r.serializer.class=kafka.serializer.StringEncoderproducer.sinks.r.request.required.acks=0producer.sinks.r.max.message.size=1000000producer.sinks.r.producer.type=syncproducer.sinks.r.custom.encoding=UTF-8##定义Kafka接收消息的Topic的名字producer.sinks.r.custom.topic.name=test#Specify the channel the sink should use    producer.sinks.r.channel = c# Each channel's type is defined.    producer.channels.c.type = memoryproducer.channels.c.capacity = 1000

3.1 上面指定了sink的类型为KafkaSink，目的是将日志送往Kafka消息队列，分区类为SinglePartition

3.2 指定topic的名字为test

3.3 指定Flume的消息源来自于netcat，(localhost,44444)

4. 启动Flume

./flume-ng agent -f ../conf/kafka.conf  -c . -n producer

指定配置文件和agent的名字

Kafka配置

5. 启动Kafka

./kafka-server-start.sh ../config/server.properties

5.1 启动Kafka依赖的Zookeeper，添加topic名字为test，详见

5.2 启动Kakfa的消息接收进程

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

6.启动telnet，输入netcat接受的数据

telnet localhost 44444

数据流转过程

1. 在telnet终端输入数据，被Flume的source接受

2. Flume将数据写入到Kafka消息队列中，在Flume_Kafka的插件中有向Kafka发送消息的逻辑

3. Kafka消息消费者，监听到Kafka队列中来了消息，那么就在Kakfa的消息接收端看到控制台上有输出

问题：

1. 此处Kafka使用SinglePartition的方式接收消息，如果是Kafka集群，那么Flume如何写入消息到一个topic的多个partition中

2. Flume的消息源是监听端口44444实现的，如何监听日志文件呢？日志文件可以自动增长，另外也会自动的创建新的日志文件，这用Kafka如何处理？

对于监听日志文件，应该使用Flume结合Log4J的方式，有个专门针对Flume的Log4J Appender，可以将写入到文件的内容通过Appender发送给Flume作为数据源，Flume的源收到数据后，就可以通过Channel发送给Sink(此处的Sink是KafkaSingk)

关于Kafka的Partition

1. 第一个问题，SinglePartition的实现

package org.apache.flume.plugins;import kafka.producer.Partitioner;import kafka.utils.VerifiableProperties;import org.slf4j.Logger;import org.slf4j.LoggerFactory;public class SinglePartition implements Partitioner<String> {public SinglePartition(VerifiableProperties props) {}@Overridepublic int partition(String key, int numberOfPartions) {  return 0;}}

可见，只要把partition方法实现为 key.hashCode()%numberOfPartitions即可

2. 第二个问题，如何设置Kafka的一个topic几个partition？

在创建topic时，就需要指定partition的个数

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

建立一个分区数为17，复制因为为3的topic，看看zk上记录了哪些信息，

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 17 --topic test

2.1. 报错：也就是说，复制因子不能比brokers的个数大

[hadoop@hadoop kafka_2.10-0.8.1.1]$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 17 --topic test_many_partitionsError while executing topic command replication factor: 3 larger than available brokers: 1kafka.admin.AdminOperationException: replication factor: 3 larger than available brokers: 1at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:70)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:155)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:86)at kafka.admin.TopicCommand$.main(TopicCommand.scala:50)at kafka.admin.TopicCommand.main(TopicCommand.scala)

2.2 新建了topic后，Kafka server日志显示

[2015-02-14 02:53:53,526] INFO Completed load of log test_many_partitions-4 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,526] INFO Created log for partition [test_many_partitions,4] in /tmp/kafka-logs with properties {segment.index.bytes -> 10485760, file.delete.delay.ms -> 60000, segment.bytes -> 536870912, flush.ms -> 9223372036854775807, delete.retention.ms -> 86400000, index.interval.bytes -> 4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 604800000, max.message.bytes -> 1000012, flush.messages -> 9223372036854775807, min.cleanable.dirty.ratio -> 0.5, retention.ms -> 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,527] WARN Partition [test_many_partitions,4] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,4] (kafka.cluster.Partition)[2015-02-14 02:53:53,540] INFO Completed load of log test_many_partitions-13 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,541] INFO Created log for partition [test_many_partitions,13] in /tmp/kafka-logs with properties {segment.index.bytes -> 10485760, file.delete.delay.ms -> 60000, segment.bytes -> 536870912, flush.ms -> 9223372036854775807, delete.retention.ms -> 86400000, index.interval.bytes -> 4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 604800000, max.message.bytes -> 1000012, flush.messages -> 9223372036854775807, min.cleanable.dirty.ratio -> 0.5, retention.ms -> 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,541] WARN Partition [test_many_partitions,13] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,13] (kafka.cluster.Partition)[2015-02-14 02:53:53,554] INFO Completed load of log test_many_partitions-1 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,555] INFO Created log for partition [test_many_partitions,1] in /tmp/kafka-logs with properties {segment.index.bytes -> 10485760, file.delete.delay.ms -> 60000, segment.bytes -> 536870912, flush.ms -> 9223372036854775807, delete.retention.ms -> 86400000, index.interval.bytes -> 4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 604800000, max.message.bytes -> 1000012, flush.messages -> 9223372036854775807, min.cleanable.dirty.ratio -> 0.5, retention.ms -> 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,555] WARN Partition [test_many_partitions,1] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,1] (kafka.cluster.Partition)

3.3 查看zk上关于具有多partition的topic，结果如下：

17个partition

[zk: localhost:2181(CONNECTED) 26] ls /brokers/topics[test_many_partitions, test][zk: localhost:2181(CONNECTED) 27] ls /brokers/topics/test_many_partitions[partitions][zk: localhost:2181(CONNECTED) 28] ls /brokers/topics/test_many_partitions/partitions[15, 16, 13, 14, 11, 12, 3, 2, 1, 10, 0, 7, 6, 5, 4, 9, 8][zk: localhost:2181(CONNECTED) 29]

1个partition

[zk: localhost:2181(CONNECTED) 30] ls /brokers/topics/test [partitions][zk: localhost:2181(CONNECTED) 31] ls /brokers/topics/test/partitions[0]

参考：

https://github.com/beyondj2ee/flumeng-kafka-plugin

http://blog.csdn.net/weijonathan/article/details/18301321

http://liyonghui160com.iteye.com/blog/2173235

上一篇：【Scala四】分析Spark源代码总结的Scala语法二

下一篇：【Flume二】HDFS sink细说

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<div > 第一步，Flume和Kakfa对接，Flume抓取日志，写到Kafka中 第二部，Spark Streaming读取Kafka中的数据，进行实时分析 &nbsp; 本文首先使用Kakfa自带的消息处理（脚本）来获取消息，走通Flume和Kafka的对接 <h1>1. Flume配置</h1> 1. 下载Flume和Kafka集成的插件，下载地址：https://github.com/beyondj2ee/flumeng-kafka-plugin。将package目录中的flumeng-kafka-plugin.jar拷贝到Flume安装目录的lib目录下 2. 将Kakfa安装目录libs目录下的如下jar包拷贝到Flume安装目录的lib目录下 kafka_2.10-0.8.1.1.jar scala-library-2.10.1.jar metrics-core-2.2.0.jar 3.添加agent配置 <pre name="code" class="java">producer.sources = sproducer.channels = cproducer.sinks = r#source section #producer.sources.s.type = seq producer.sources.s.type = netcatproducer.sources.s.bind = localhostproducer.sources.s.port = 44444producer.sources.s.channels = c# Each sink's type must be defined producer.sinks.r.type = org.apache.flume.plugins.KafkaSinkproducer.sinks.r.metadata.broker.list=127.0.0.1:9092producer.sinks.r.partition.key=0producer.sinks.r.partitioner.class=org.apache.flume.plugins.SinglePartitionproducer.sinks.r.serializer.class=kafka.serializer.StringEncoderproducer.sinks.r.request.required.acks=0producer.sinks.r.max.message.size=1000000producer.sinks.r.producer.type=syncproducer.sinks.r.custom.encoding=UTF-8##定义Kafka接收消息的Topic的名字producer.sinks.r.custom.topic.name=test#Specify the channel the sink should use producer.sinks.r.channel = c# Each channel's type is defined. producer.channels.c.type = memoryproducer.channels.c.capacity = 1000</pre> 3.1 上面指定了sink的类型为KafkaSink，目的是将日志送往Kafka消息队列，分区类为SinglePartition 3.2&nbsp; 指定topic的名字为test 3.3 指定Flume的消息源来自于netcat，(localhost,44444) &nbsp; 4. 启动Flume &nbsp; <pre name="code" class="java">./flume-ng agent -f ../conf/kafka.conf -c . -n producer</pre> &nbsp;指定配置文件和agent的名字 &nbsp; <h1>Kafka配置</h1> &nbsp; 5. 启动Kafka <pre name="code" class="java">./kafka-server-start.sh ../config/server.properties</pre> 5.1 启动Kafka依赖的Zookeeper，添加topic名字为test，详见 5.2 启动Kakfa的消息接收进程 <pre name="code" class="java">bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning</pre> &nbsp; 6.启动telnet，输入netcat接受的数据 &nbsp; <pre name="code" class="java">telnet localhost 44444</pre> &nbsp; <h1>数据流转过程</h1> 1. 在telnet终端输入数据，被Flume的source接受 2. Flume将数据写入到Kafka消息队列中，在Flume_Kafka的插件中有向Kafka发送消息的逻辑 3. Kafka消息消费者，监听到Kafka队列中来了消息，那么就在Kakfa的消息接收端看到控制台上有输出 &nbsp; <h1>问题：</h1> 1. 此处Kafka使用SinglePartition的方式接收消息，如果是Kafka集群，那么Flume如何写入消息到一个topic的多个partition中 2. Flume的消息源是监听端口44444实现的，如何监听日志文件呢？日志文件可以自动增长，另外也会自动的创建新的日志文件，这用Kafka如何处理？ 对于监听日志文件，应该使用Flume结合Log4J的方式，有个专门针对Flume的Log4J Appender，可以将写入到文件的内容通过Appender发送给Flume作为数据源，Flume的源收到数据后，就可以通过Channel发送给Sink(此处的Sink是KafkaSingk) &nbsp; <h1>关于Kafka的Partition</h1> &nbsp;1. 第一个问题，SinglePartition的实现 <pre name="code" class="java">package org.apache.flume.plugins;import kafka.producer.Partitioner;import kafka.utils.VerifiableProperties;import org.slf4j.Logger;import org.slf4j.LoggerFactory;public class SinglePartition implements Partitioner&lt;String&gt; {public SinglePartition(VerifiableProperties props) {}@Overridepublic int partition(String key, int numberOfPartions) { return 0;}}</pre> &nbsp;可见，只要把partition方法实现为 key.hashCode()%numberOfPartitions即可 &nbsp; 2. 第二个问题，如何设置Kafka的一个topic几个partition？ 在创建topic时，就需要指定partition的个数 <pre name="code" class="java">bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test</pre> 建立一个分区数为17，复制因为为3的topic，看看zk上记录了哪些信息， &nbsp; <pre name="code" class="java">bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 17 --topic test</pre> &nbsp; 2.1. 报错：也就是说，复制因子不能比brokers的个数大 <pre name="code" class="java">[hadoop@hadoop kafka_2.10-0.8.1.1]$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 17 --topic test_many_partitionsError while executing topic command replication factor: 3 larger than available brokers: 1kafka.admin.AdminOperationException: replication factor: 3 larger than available brokers: 1at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:70)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:155)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:86)at kafka.admin.TopicCommand$.main(TopicCommand.scala:50)at kafka.admin.TopicCommand.main(TopicCommand.scala)</pre> &nbsp; 2.2 新建了topic后，Kafka server日志显示 <pre name="code" class="java">[2015-02-14 02:53:53,526] INFO Completed load of log test_many_partitions-4 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,526] INFO Created log for partition [test_many_partitions,4] in /tmp/kafka-logs with properties {segment.index.bytes -&gt; 10485760, file.delete.delay.ms -&gt; 60000, segment.bytes -&gt; 536870912, flush.ms -&gt; 9223372036854775807, delete.retention.ms -&gt; 86400000, index.interval.bytes -&gt; 4096, retention.bytes -&gt; -1, cleanup.policy -&gt; delete, segment.ms -&gt; 604800000, max.message.bytes -&gt; 1000012, flush.messages -&gt; 9223372036854775807, min.cleanable.dirty.ratio -&gt; 0.5, retention.ms -&gt; 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,527] WARN Partition [test_many_partitions,4] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,4] (kafka.cluster.Partition)[2015-02-14 02:53:53,540] INFO Completed load of log test_many_partitions-13 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,541] INFO Created log for partition [test_many_partitions,13] in /tmp/kafka-logs with properties {segment.index.bytes -&gt; 10485760, file.delete.delay.ms -&gt; 60000, segment.bytes -&gt; 536870912, flush.ms -&gt; 9223372036854775807, delete.retention.ms -&gt; 86400000, index.interval.bytes -&gt; 4096, retention.bytes -&gt; -1, cleanup.policy -&gt; delete, segment.ms -&gt; 604800000, max.message.bytes -&gt; 1000012, flush.messages -&gt; 9223372036854775807, min.cleanable.dirty.ratio -&gt; 0.5, retention.ms -&gt; 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,541] WARN Partition [test_many_partitions,13] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,13] (kafka.cluster.Partition)[2015-02-14 02:53:53,554] INFO Completed load of log test_many_partitions-1 with log end offset 0 (kafka.log.Log)[2015-02-14 02:53:53,555] INFO Created log for partition [test_many_partitions,1] in /tmp/kafka-logs with properties {segment.index.bytes -&gt; 10485760, file.delete.delay.ms -&gt; 60000, segment.bytes -&gt; 536870912, flush.ms -&gt; 9223372036854775807, delete.retention.ms -&gt; 86400000, index.interval.bytes -&gt; 4096, retention.bytes -&gt; -1, cleanup.policy -&gt; delete, segment.ms -&gt; 604800000, max.message.bytes -&gt; 1000012, flush.messages -&gt; 9223372036854775807, min.cleanable.dirty.ratio -&gt; 0.5, retention.ms -&gt; 604800000}. (kafka.log.LogManager)[2015-02-14 02:53:53,555] WARN Partition [test_many_partitions,1] on broker 0: No checkpointed highwatermark is found for partition [test_many_partitions,1] (kafka.cluster.Partition)</pre> &nbsp;3.3 查看zk上关于具有多partition的topic，结果如下： &nbsp; 17个partition <pre name="code" class="java">[zk: localhost:2181(CONNECTED) 26] ls /brokers/topics[test_many_partitions, test][zk: localhost:2181(CONNECTED) 27] ls /brokers/topics/test_many_partitions[partitions][zk: localhost:2181(CONNECTED) 28] ls /brokers/topics/test_many_partitions/partitions[15, 16, 13, 14, 11, 12, 3, 2, 1, 10, 0, 7, 6, 5, 4, 9, 8][zk: localhost:2181(CONNECTED) 29] </pre> &nbsp; 1个partition <pre name="code" class="java">[zk: localhost:2181(CONNECTED) 30] ls /brokers/topics/test [partitions][zk: localhost:2181(CONNECTED) 31] ls /brokers/topics/test/partitions[0]</pre> &nbsp; 参考： https://github.com/beyondj2ee/flumeng-kafka-plugin http://blog.csdn.net/weijonathan/article/details/18301321 http://liyonghui160com.iteye.com/blog/2173235 </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客