还需要wwW54co怎么找,不到目前54cocom的访问地止了

&figure&&img src=&https://pic2.zhimg.com/v2-acae3d65b5eab9be3a897_b.jpg& data-rawwidth=&3422& data-rawheight=&1781& class=&origin_image zh-lightbox-thumb& width=&3422& data-original=&https://pic2.zhimg.com/v2-acae3d65b5eab9be3a897_r.jpg&&&/figure&&h3&介绍&/h3&&p&ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为 Apache 许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。基百科、Stack Overflow、Github 都采用它。&/p&&p&本文从零开始,讲解如何使用 Elasticsearch 搭建自己的全文搜索引擎。每一步都有详细的说明,大家跟着做就能学会。&/p&&h3&环境&/h3&&p&1、VMware&/p&&p&2、Centos 6.6&/p&&p&3、Elasticsearch 5.5.2&/p&&p&4、JDK 1.8&/p&&p&VMware 安装以及在 VMware 中安装 Centos 这个就不说了,环境配置直接默认就好,不过分配给机器的内存最好设置大点(建议 2G),&/p&&p&使用 dhclient 命令来自动获取 IP 地址,查看获取的 IP 地址则使用命令 ip addr 或者 ifconfig ,则会看到网卡信息和 lo 卡信息。&/p&&p&给虚拟机额中的 linux 设置固定的
ip(因为后面发现每次机器重启后又要重新使用 dhclient 命令来自动获取 IP 地址)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&vim
/etc/sysconfig/network-scripts/ifcfg-eth0
&/code&&/pre&&/div&&p&修改:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&onboot=yes
bootproto=static
&/code&&/pre&&/div&&p&增加:(下面可设置可不设置)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&IPADDR=192.168.1.113
网卡IP地址
GATEWAY=192.168.1.1
NETMASK=255.255.255.0
&/code&&/pre&&/div&&p&设置好之后,把网络服务重启一下,service network restart&/p&&p&修改 ip 地址参考: &a href=&https://link.zhihu.com/?target=http%3A//jingyan.baidu.com/article/e4d08ffddf60d70.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&jingyan.baidu.com/artic&/span&&span class=&invisible&&le/e4d08ffddf60d70.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&大环境都准备好了,下面开始安装步骤:&/p&&h3&安装 JDK 1.8&/h3&&p&先卸载自带的 openjdk,查找
openjdk&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&rpm -qa | grep java
&/code&&/pre&&/div&&p&卸载 openjdk&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&yum -y remove
java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el65.x8664
yum -y remove java-1.6.0-openjdk-1.6.0.0-11.1.13.4.el6.x86_64
&/code&&/pre&&/div&&p&&strong&解压 JDK 安装包:&/strong&&/p&&p&附上jdk1.8的下载地址:
&a href=&https://link.zhihu.com/?target=http%3A//www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&oracle.com/technetwork/&/span&&span class=&invisible&&java/javase/downloads/jdk8-downloads-2133151.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&解压完成后配置一下环境变量就 ok&/p&&p&1、在/usr/local/下创建Java文件夹&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&cd /usr/local/
mkdir java
新建java目录
&/code&&/pre&&/div&&p&2、文件夹创建完毕,把安装包拷贝到 Java 目录中,然后解压 jdk 到当前目录&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&cp /usr/jdk-8u144-linux-x64.tar.gz /usr/local/java/
**注意匹配你自己的文件名**
拷贝到java目录
tar -zxvf jdk-8u144-linux-x64.tar.gz
解压到当前目录(Java目录)
&/code&&/pre&&/div&&p&3、解压完之后,Java目录中会出现一个jdk1.8.0_144的目录,这就解压完成了。之后配置一下环境变量。
编辑/etc/下的profile文件,配置环境变量&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&vi /etc/profile
进入profile文件的编辑模式
在最后边追加一下内容(**配置的时候一定要根据自己的目录情况而定哦!**)
JAVA_HOME=/usr/local/java/jdk1.8.0_144
CLASSPATH=$JAVA_HOME/lib/
PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME CLASSPATH
&/code&&/pre&&/div&&p&之后保存并退出文件之后。&/p&&p&让文件生效:source /etc/profile&/p&&p&在控制台输入Java 和 Java -version 看有没有信息输出,如下:java -version&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&java version &1.8.0_144&
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) Client VM (build 25.60-b23, mixed mode)
&/code&&/pre&&/div&&p&能显示以上信息,就说明 JDK 安装成功啦&/p&&hr&&h3&安装 Maven&/h3&&p&因为后面可能会用到 maven ,先装上这个。&/p&&p&1、下载 maven&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
&/code&&/pre&&/div&&p&2、解压至 /usr/local 目录&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&tar -zxvf apache-maven-3.2.5-bin.tar.gz
&/code&&/pre&&/div&&p&3、配置公司给的配置&/p&&p&替换成公司给的 setting.xml 文件,修改关于本地仓库的位置, 默认位置: ${user.home}/.m2/repository&/p&&p&4、配置环境变量etc/profile 最后添加以下两行&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&export MAVEN_HOME=/usr/local/apache-maven-3.2.5
export PATH=${PATH}:${MAVEN_HOME}/bin
&/code&&/pre&&/div&&p&5、测试&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&[root@localhost ~]# mvn -v
Apache Maven 3.2.5 (12a6b3acbb81fd8cea1; T09:29:23-08:00)
Maven home: /usr/local/apache-maven-3.2.5
&/code&&/pre&&/div&&p&VMware 虚拟机里面的三台机器 IP 分别是:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&192.168.153.133
192.168.153.134
192.168.153.132
&/code&&/pre&&/div&&h3&配置 hosts&/h3&&p&在 /etc/hosts下面编写:ip
node 节点的名字(域名解析)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&vim
/etc/hosts
&/code&&/pre&&/div&&p&新增:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&192.168.153.133
192.168.153.134
192.168.153.132
&/code&&/pre&&/div&&h3&设置 SSH 免密码登录&/h3&&p&安装expect命令 : yum -y install expect&/p&&p&将 ssh_p2p.jar 随便解压到任何目录下: (这个 jar 包可以去网上下载)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&unzip ssh_p2p.zip
&/code&&/pre&&/div&&p&修改 resource 的 ip 值&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&vim /ssh_p2p/deploy_data/resource
(各个节点和账户名,密码,free代表相互都可以无密码登陆)
&/code&&/pre&&/div&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#设置为你每台虚拟机的ip地址,用户名,密码
&192.168.153.133,root,123456,free&
&192.168.153,134,root,123456,free&
&192.168.153.132,root,123456,free&
&/code&&/pre&&/div&&p&修改 start.sh 的运行权限&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&chmod u+x start.sh
&/code&&/pre&&/div&&p&运行&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&./start.sh
&/code&&/pre&&/div&&p&测试:&/p&&p&ssh ip地址
(测试是否可以登录)&/p&&h3&安装 ElasticSearch&/h3&&p&下载地址: &a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/downloads/elasticsearch& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/downloads/el&/span&&span class=&invisible&&asticsearch&/span&&span class=&ellipsis&&&/span&&/a&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz
cd /usr/local
elasticsearch-5.5.2.tar.gz
&/code&&/pre&&/div&&p&su tzs切换到 tzs 用户下 ( 默认不支持 root 用户)&/p&&p&sh /usr/local/elasticsearch/bin/elasticsearch -d其中 -d 表示后台启动&/p&&p&在 vmware 上测试是否成功:curl http://localhost:9200/&/p&&p&&/p&&p&出现如上图这样的效果,就代表已经装好了。&/p&&p&elasticsearch 默认 restful-api 的端口是 9200 不支持 IP 地址,也就是说无法从主机访问虚拟机中的服务,只能在本机用 http://localhost:9200 来访问。如果需要改变,需要修改配置文件 /usr/local/elasticsearch/config/elasticsearch.yml 文件,加入以下两行:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&network.bind_host: 0.0.0.0
network.publish_host: _nonloopback:ipv4
&/code&&/pre&&/div&&p&或去除 network.host 和 http.port 之前的注释,并将 network.host 的 IP 地址修改为本机外网 IP。然后重启,Elasticsearch&/p&&p&关闭方法(输入命令:ps -ef | grep elasticsearch,找到进程,然后 kill 掉就行了。&/p&&p&如果外网还是不能访问,则有可能是防火墙设置导致的 ( 关闭防火墙:service iptables stop)&/p&&p&修改配置文件:vim config/elasticsearch.yml&/p&&p&cluster.name : my-app
(集群的名字,名字相同的就是一个集群)&/p&&p&node.name : es1
(节点的名字, 和前面配置的 hosts 中的 name 要一致)&/p&&p&path.data: /data/elasticsearch/data
(数据的路径。没有要创建(mkdir -p /data/elasticsearch/{data,logs}),并且给执行用户权限chown tzs /data/elasticsearch/{data,logs} -R)
path.logs: /data/elasticsearch/logs
(数据 log 信息的路径,同上)
network.host: 0.0.0.0
//允许外网访问,也可以是自己的ip地址
http.port: 9200
//访问的端口
discovery.zen.ping.unicast.hosts: [&192.168.153.133&, &192.168.153.134&, &192.168.153.132&]
//各个节点的ip地址&/p&&p&记得需要添加上:(这个是安装 head 插件要用的, 目前不需要)
http.cors.enabled: true
http.cors.allow-origin: &*&&/p&&p&最后在外部浏览器的效果如下图:&/p&&p&&/p&&h3&安装 IK 中文分词&/h3&&p&可以自己下载源码使用 maven 编译,当然如果怕麻烦可以直接下载编译好的&/p&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/medcl/elasticsearch-analysis-ik/releases/tag/v5.5.2& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/medcl/elasti&/span&&span class=&invisible&&csearch-analysis-ik/releases/tag/v5.5.2&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&注意下载对应的版本放在 plugins 目录下&/p&&p&解压&/p&&p&unzip elasticsearch-analysis-ik-5.5.2.zip&/p&&p&在 es 的 plugins 下新建 ik 目录&/p&&p&mkdir ik&/p&&p&将刚才解压的复制到ik目录下&/p&&p&cp -r elasticsearch/* ik&/p&&p&删除刚才解压后的&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&rm -rf elasticsearch
rm -rf elasticsearch-analysis-ik-5.5.2.zip
&/code&&/pre&&/div&&h4&IK 带有两个分词器&/h4&&p&&strong&ik_max_word&/strong&:会将文本做最细粒度的拆分;尽可能多的拆分出词语&/p&&p&&strong&ik_smart&/strong&:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有&/p&&p&安装完 IK 中文分词器后(当然不止这种中文分词器,还有其他的,可以参考我的文章&a href=&https://link.zhihu.com/?target=http%3A//www.54tianzhisheng.cn//Elasticsearch-analyzers/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Elasticsearch 默认分词器和中分分词器之间的比较及使用方法&/a&),测试区别如下:&/p&ik_max_word&p&curl -XGET '&a href=&https://link.zhihu.com/?target=http%3A//192.168.153.134%3A9200/_analyze%3Fpretty%26analyzer%3Dik_max_word& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&192.168.153.134:9200/_a&/span&&span class=&invisible&&nalyze?pretty&analyzer=ik_max_word&/span&&span class=&ellipsis&&&/span&&/a&' -d '联想是全球最大的笔记本厂商'&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&{
&tokens& : [
&token& : &联想&,
&start_offset& : 0,
&end_offset& : 2,
&type& : &CN_WORD&,
&position& : 0
&token& : &是&,
&start_offset& : 2,
&end_offset& : 3,
&type& : &CN_CHAR&,
&position& : 1
&token& : &全球&,
&start_offset& : 3,
&end_offset& : 5,
&type& : &CN_WORD&,
&position& : 2
&token& : &最大&,
&start_offset& : 5,
&end_offset& : 7,
&type& : &CN_WORD&,
&position& : 3
&token& : &的&,
&start_offset& : 7,
&end_offset& : 8,
&type& : &CN_CHAR&,
&position& : 4
&token& : &笔记本&,
&start_offset& : 8,
&end_offset& : 11,
&type& : &CN_WORD&,
&position& : 5
&token& : &笔记&,
&start_offset& : 8,
&end_offset& : 10,
&type& : &CN_WORD&,
&position& : 6
&token& : &本厂&,
&start_offset& : 10,
&end_offset& : 12,
&type& : &CN_WORD&,
&position& : 7
&token& : &厂商&,
&start_offset& : 11,
&end_offset& : 13,
&type& : &CN_WORD&,
&position& : 8
&/code&&/pre&&/div&ik_smart&p&curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '联想是全球最大的笔记本厂商'&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&{
&tokens& : [
&token& : &联想&,
&start_offset& : 0,
&end_offset& : 2,
&type& : &CN_WORD&,
&position& : 0
&token& : &是&,
&start_offset& : 2,
&end_offset& : 3,
&type& : &CN_CHAR&,
&position& : 1
&token& : &全球&,
&start_offset& : 3,
&end_offset& : 5,
&type& : &CN_WORD&,
&position& : 2
&token& : &最大&,
&start_offset& : 5,
&end_offset& : 7,
&type& : &CN_WORD&,
&position& : 3
&token& : &的&,
&start_offset& : 7,
&end_offset& : 8,
&type& : &CN_CHAR&,
&position& : 4
&token& : &笔记本&,
&start_offset& : 8,
&end_offset& : 11,
&type& : &CN_WORD&,
&position& : 5
&token& : &厂商&,
&start_offset& : 11,
&end_offset& : 13,
&type& : &CN_WORD&,
&position& : 6
&/code&&/pre&&/div&&h3&安装 head 插件&/h3&&p&elasticsearch-head 是一个 elasticsearch 的集群管理工具,它是完全由 html5 编写的独立网页程序,你可以通过插件把它集成到 es。&/p&&p&效果如下图:(图片来自网络)&/p&&p&&/p&&p&&/p&&p&&/p&&p&&/p&&p&&/p&&h4&安装 git&/h4&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&yum remove git
yum install git
git clone git://github.com/mobz/elasticsearch-head.git
拉取 head 插件到本地,或者直接在 GitHub 下载 压缩包下来
&/code&&/pre&&/div&&h4&安装nodejs&/h4&&p&先去官网下载 node-v8.4.0-linux-x64.tar.xz&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&tar -Jxv -f
node-v8.4.0-linux-x64.tar.xz
mv node-v8.4.0-linux-x64
&/code&&/pre&&/div&&p&环境变量设置:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&vim
/etc/profile
&/code&&/pre&&/div&&p&新增:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&export NODE_HOME=/opt/node
export PATH=$PATH:$NODE_HOME/bin
export NODE_PATH=$NODE_HOME/lib/node_modules
&/code&&/pre&&/div&&p&使配置文件生效(这步很重要,自己要多注意这步)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&source /etc/profile
&/code&&/pre&&/div&&p&测试是否全局可用了:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&node -v
&/code&&/pre&&/div&&p&然后&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&mv elasticsearch-head head
npm install -g grunt-cli
npm install
grunt server
&/code&&/pre&&/div&&p&再 es 的配置文件中加:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&http.cors.enabled: true
http.cors.allow-origin: &*&
&/code&&/pre&&/div&&p&在浏览器打开&a href=&https://link.zhihu.com/?target=http%3A//192.168.153.133%3A9100/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&192.168.153.133:9100/&/span&&span class=&invisible&&&/span&&/a&就可以看到效果了,&/p&&h3&遇到问题&/h3&&p&把坑都走了一遍,防止以后再次入坑,特此记录下来&/p&&p&&strong&1、ERROR Could not register mbeans java.security.AccessControlException: access denied (&javax.management.MBeanTrustPermission& &register&)&/strong&&/p&&p&改变 elasticsearch 文件夹所有者到当前用户&/p&&p&sudo chown -R noroot:noroot elasticsearch&/p&&p&这是因为 elasticsearch 需要读写配置文件,我们需要给予 config 文件夹权限,上面新建了 elsearch 用户,elsearch 用户不具备读写权限,因此还是会报错,解决方法是切换到管理员账户,赋予权限即可:&/p&&p&sudo -i&/p&&p&chmod -R 775 config&/p&&p&&strong&2、[WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]&/strong&&strong&org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root&/strong&&/p&&p&原因是elasticsearch默认是不支持用root用户来启动的。&/p&&p&解决方案一:Des.insecure.allow.root=true&/p&&p&修改/usr/local/elasticsearch-2.4.0/bin/elasticsearch,&/p&&p&添加 ES_JAVA_OPTS=&-Des.insecure.allow.root=true&&/p&&p&或执行时添加: sh /usr/local/elasticsearch-2.4.0/bin/elasticsearch -d -Des.insecure.allow.root=true&/p&&p&注意:正式环境用root运行可能会有安全风险,不建议用root来跑。&/p&&p&解决方案二:添加专门的用户&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&useradd elastic
chown -R elastic:elastic
elasticsearch-2.4.0
su elastic
sh /usr/local/elasticsearch-2.4.0/bin/elasticsearch -d
&/code&&/pre&&/div&&p&&strong&3、UnsupportedOperationException: seccomp unavailable: requires kernel 3.5+ with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER compiled in&/strong&&/p&&p&只是警告,使用新的linux版本,就不会出现此类问题了。&/p&&p&&strong&4、ERROR: [4] bootstrap checks failed&/strong&&strong&[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]&/strong&&/p&&p&原因:无法创建本地文件问题,用户最大可创建文件数太小&/p&&p&解决方案:切换到 root 用户,编辑 limits.conf 配置文件, 添加类似如下内容:&/p&&p&vim /etc/security/limits.conf&/p&&p&添加如下内容:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&*
soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096
&/code&&/pre&&/div&&p&&strong&[2]: max number of threads [1024] for user [tzs] is too low, increase to at least [2048]&/strong&&/p&&p&原因:无法创建本地线程问题,用户最大可创建线程数太小&/p&&p&解决方案:切换到root用户,进入limits.d目录下,修改90-nproc.conf 配置文件。&/p&&p&vim /etc/security/limits.d/90-nproc.conf&/p&&p&找到如下内容:&/p&&ul&&li&soft nproc 1024&/li&&/ul&&p&修改为&/p&&ul&&li&soft nproc 2048&/li&&/ul&&p&&strong&[3]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]&/strong&&/p&&p&原因:最大虚拟内存太小&/p&&p&root用户执行命令:&/p&&p&sysctl -w vm.max_map_count=262144&/p&&p&或者修改 /etc/sysctl.conf 文件,添加 “vm.max_map_count”设置
设置后,可以使用
$ sysctl -p&/p&&p&&strong&[4]: system call filte check the logs and fix your configuration or disable system call filters at your own risk&/strong&&/p&&p&原因:Centos6不支持SecComp,而ES5.4.1默认bootstrap.system_call_filter为true进行检测,所以导致检测失败,失败后直接导致ES不能启动。
详见 :&a href=&https://link.zhihu.com/?target=https%3A//github.com/elastic/elasticsearch/issues/22899& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/elastic/elas&/span&&span class=&invisible&&ticsearch/issues/22899&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&解决方法:在elasticsearch.yml中新增配置bootstrap.system_call_filter,设为false,注意要在Memory下面:
bootstrap.memory_lock: false
bootstrap.system_call_filter: false&/p&&p&&strong&5、 java.lang.IllegalArgumentException: property [elasticsearch.version] is missing for plugin [head]&/strong&&/p&&p&再 es 的配置文件中加:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&http.cors.enabled: true
http.cors.allow-origin: &*&
&/code&&/pre&&/div&&h3&最后&/h3&&p&整个搭建的过程全程自己手动安装,不易,如果安装很多台机器,是否可以写个脚本之类的自动搭建呢?可以去想想的。首发于:&a href=&https://link.zhihu.com/?target=http%3A//www.54tianzhisheng.cn//Elasticsearch-install/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&54tianzhisheng.cn/2017/&/span&&span class=&invisible&&09/09/Elasticsearch-install/&/span&&span class=&ellipsis&&&/span&&/a&,转载请注明出处,谢谢配合!&/p&
介绍ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为 Apache 许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能…
&figure&&img src=&https://pic2.zhimg.com/v2-acae3d65b5eab9be3a897_b.jpg& data-rawwidth=&3422& data-rawheight=&1781& class=&origin_image zh-lightbox-thumb& width=&3422& data-original=&https://pic2.zhimg.com/v2-acae3d65b5eab9be3a897_r.jpg&&&/figure&&p&介绍:ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。&/p&&p&Elasticsearch中,内置了很多分词器(analyzers)。下面来进行比较下系统默认分词器和常用的中文分词器之间的区别。&/p&&h2&系统默认分词器:&/h2&&h3&1、standard 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-standard-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&如何使用:&a href=&https://link.zhihu.com/?target=http%3A//www.yiibai.com/lucene/lucene_standardanalyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&yiibai.com/lucene/lucen&/span&&span class=&invisible&&e_standardanalyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分。他会将词汇单元转换成小写形式,并去除停用词和标点符号。&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&cm&&/**StandardAnalyzer分析器*/&/span&
&span class=&kd&&public&/span& &span class=&kt&&void&/span& &span class=&nf&&standardAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&msg&/span&&span class=&o&&){&/span&
&span class=&n&&StandardAnalyzer&/span& &span class=&n&&analyzer&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&StandardAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&Version&/span&&span class=&o&&.&/span&&span class=&na&&LUCENE_36&/span&&span class=&o&&);&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&getTokens&/span&&span class=&o&&(&/span&&span class=&n&&analyzer&/span&&span class=&o&&,&/span& &span class=&n&&msg&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&/code&&/pre&&/div&&h3&2、simple 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-simple-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&如何使用: &a href=&https://link.zhihu.com/?target=http%3A//www.yiibai.com/lucene/lucene_simpleanalyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&yiibai.com/lucene/lucen&/span&&span class=&invisible&&e_simpleanalyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&功能强于WhitespaceAnalyzer, 首先会通过非字母字符来分割文本信息,然后将词汇单元统一为小写形式。该分析器会去掉数字类型的字符。&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&cm&&/**SimpleAnalyzer分析器*/&/span&
&span class=&kd&&public&/span& &span class=&kt&&void&/span& &span class=&nf&&simpleAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&msg&/span&&span class=&o&&){&/span&
&span class=&n&&SimpleAnalyzer&/span& &span class=&n&&analyzer&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&SimpleAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&Version&/span&&span class=&o&&.&/span&&span class=&na&&LUCENE_36&/span&&span class=&o&&);&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&getTokens&/span&&span class=&o&&(&/span&&span class=&n&&analyzer&/span&&span class=&o&&,&/span& &span class=&n&&msg&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&/code&&/pre&&/div&&h3&3、Whitespace 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-whitespace-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&如何使用:&a href=&https://link.zhihu.com/?target=http%3A//www.yiibai.com/lucene/lucene_whitespaceanalyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&yiibai.com/lucene/lucen&/span&&span class=&invisible&&e_whitespaceanalyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&仅仅是去除空格,对字符没有lowcase化,不支持中文;
并且不对生成的词汇单元进行其他的规范化处理。&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&cm&&/**WhitespaceAnalyzer分析器*/&/span&
&span class=&kd&&public&/span& &span class=&kt&&void&/span& &span class=&nf&&whitespaceAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&msg&/span&&span class=&o&&){&/span&
&span class=&n&&WhitespaceAnalyzer&/span& &span class=&n&&analyzer&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&WhitespaceAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&Version&/span&&span class=&o&&.&/span&&span class=&na&&LUCENE_36&/span&&span class=&o&&);&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&getTokens&/span&&span class=&o&&(&/span&&span class=&n&&analyzer&/span&&span class=&o&&,&/span& &span class=&n&&msg&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&/code&&/pre&&/div&&h3&4、Stop 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-stop-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&如何使用:&a href=&https://link.zhihu.com/?target=http%3A//www.yiibai.com/lucene/lucene_stopanalyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&yiibai.com/lucene/lucen&/span&&span class=&invisible&&e_stopanalyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上增加了去除英文中的常用单词(如the,a等),也可以更加自己的需要设置常用单词;不支持中文&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&cm&&/**StopAnalyzer分析器*/&/span&
&span class=&kd&&public&/span& &span class=&kt&&void&/span& &span class=&nf&&stopAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&msg&/span&&span class=&o&&){&/span&
&span class=&n&&StopAnalyzer&/span& &span class=&n&&analyzer&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&StopAnalyzer&/span&&span class=&o&&(&/span&&span class=&n&&Version&/span&&span class=&o&&.&/span&&span class=&na&&LUCENE_36&/span&&span class=&o&&);&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&getTokens&/span&&span class=&o&&(&/span&&span class=&n&&analyzer&/span&&span class=&o&&,&/span& &span class=&n&&msg&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&/code&&/pre&&/div&&h3&5、keyword 分词器&/h3&&p&KeywordAnalyzer把整个输入作为一个单独词汇单元,方便特殊类型的文本进行索引和检索。针对邮政编码,地址等文本信息使用关键词分词器进行索引项建立非常方便。&/p&&h3&6、pattern 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-pattern-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&一个pattern类型的analyzer可以通过正则表达式将文本分成&terms&(经过token Filter 后得到的东西 )。接受如下设置:&/p&&p&一个 pattern analyzer 可以做如下的属性设置:&/p&lowercaseterms是否是小写. 默认为 true 小写.pattern正则表达式的pattern, 默认是 \W+.flags正则表达式的flagsstopwords一个用于初始化stop filter的需要stop 单词的列表.默认单词是空的列表&h3&7、language 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-lang-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&一个用于解析特殊语言文本的analyzer集合。( arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.)可惜没有中文。不予考虑&/p&&h3&8、snowball 分词器&/h3&&p&一个snowball类型的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter这四个filter构成的。&/p&&p&snowball analyzer 在Lucene中通常是不推荐使用的。&/p&&h3&9、Custom 分词器&/h3&&p&是自定义的analyzer。允许多个零到多个tokenizer,零到多个 Char Filters. custom analyzer 的名字不能以 &_&开头.&/p&&p&The following are settings that can be set for a custom analyzer type:&/p&SettingDescriptiontokenizer通用的或者注册的tokenizer.filter通用的或者注册的token filterschar_filter通用的或者注册的 character filtersposition_increment_gap距离查询时,最大允许查询的距离,默认是100&p&自定义的模板:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myTokenFilter2]
char_filter : [my_html]
position_increment_gap: 256
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
max : 2000
char_filter :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
&/code&&/pre&&/div&&h3&10、fingerprint 分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&elastic.co/guide/en/ela&/span&&span class=&invisible&&sticsearch/reference/current/analysis-fingerprint-analyzer.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&hr&&h2&中文分词器:&/h2&&h3&1、ik-analyzer&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/wks/ik-analyzer& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/wks/ik-analy&/span&&span class=&invisible&&zer&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&IKAnalyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。&/p&&p&采用了特有的“正向迭代最细粒度切分算法“,支持细粒度和最大词长两种切分模式;具有83万字/秒(1600KB/S)的高速处理能力。&/p&&p&采用了多子处理器分析模式,支持:英文字母、数字、中文词汇等分词处理,兼容韩文、日文字符&/p&&p&优化的词典存储,更小的内存占用。支持用户词典扩展定义&/p&&p&针对Lucene全文检索优化的查询分析器IKQueryParser(作者吐血推荐);引入简单搜索表达式,采用歧义分析算法优化查询关键字的搜索排列组合,能极大的提高Lucene检索的命中率。&/p&&p&Maven用法:&/p&&div class=&highlight&&&pre&&code class=&language-xml&&&span&&/span&&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&org.wltea.ik-analyzer&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&ik-analyzer&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&3.2.8&span class=&nt&&&/version&&/span&
&span class=&nt&&&/dependency&&/span&
&/code&&/pre&&/div&&p&在IK Analyzer加入Maven Central Repository之前,你需要手动安装,安装到本地的repository,或者上传到自己的Maven repository服务器上。&/p&&p&要安装到本地Maven repository,使用如下命令,将自动编译,打包并安装:
mvn install -Dmaven.test.skip=true&/p&&h4&Elasticsearch添加中文分词&/h4&安装IK分词插件&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/medcl/elasticsearch-analysis-ik& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/medcl/elasti&/span&&span class=&invisible&&csearch-analysis-ik&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&进入elasticsearch-analysis-ik-master&/p&&p&更多安装请参考博客:&/p&&p&1、&a href=&https://link.zhihu.com/?target=http%3A//blog.csdn.net/dingzfang/article/details/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&为elastic添加中文分词&/a&: &a href=&https://link.zhihu.com/?target=http%3A//blog.csdn.net/dingzfang/article/details/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&blog.csdn.net/dingzfang&/span&&span class=&invisible&&/article/details/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&2、&a href=&https://link.zhihu.com/?target=http%3A//www.cnblogs.com/xing901022/p/5910139.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&如何在Elasticsearch中安装中文分词器(IK+pinyin)&/a&:&a href=&https://link.zhihu.com/?target=http%3A//www.cnblogs.com/xing901022/p/5910139.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&cnblogs.com/xing901022/&/span&&span class=&invisible&&p/5910139.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&3、&a href=&https://link.zhihu.com/?target=http%3A//blog.csdn.net/jam00/article/details/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Elasticsearch 中文分词器 IK 配置和使用&/a&: &a href=&https://link.zhihu.com/?target=http%3A//blog.csdn.net/jam00/article/details/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&blog.csdn.net/jam00/art&/span&&span class=&invisible&&icle/details/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&h4&ik 带有两个分词器&/h4&&p&&strong&ik_max_word&/strong&:会将文本做最细粒度的拆分;尽可能多的拆分出词语&/p&&p&&strong&ik_smart&/strong&:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有&/p&&p&区别:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# ik_max_word
curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '联想是全球最大的笔记本厂商'
&tokens& : [
&token& : &联想&,
&start_offset& : 0,
&end_offset& : 2,
&type& : &CN_WORD&,
&position& : 0
&token& : &是&,
&start_offset& : 2,
&end_offset& : 3,
&type& : &CN_CHAR&,
&position& : 1
&token& : &全球&,
&start_offset& : 3,
&end_offset& : 5,
&type& : &CN_WORD&,
&position& : 2
&token& : &最大&,
&start_offset& : 5,
&end_offset& : 7,
&type& : &CN_WORD&,
&position& : 3
&token& : &的&,
&start_offset& : 7,
&end_offset& : 8,
&type& : &CN_CHAR&,
&position& : 4
&token& : &笔记本&,
&start_offset& : 8,
&end_offset& : 11,
&type& : &CN_WORD&,
&position& : 5
&token& : &笔记&,
&start_offset& : 8,
&end_offset& : 10,
&type& : &CN_WORD&,
&position& : 6
&token& : &本厂&,
&start_offset& : 10,
&end_offset& : 12,
&type& : &CN_WORD&,
&position& : 7
&token& : &厂商&,
&start_offset& : 11,
&end_offset& : 13,
&type& : &CN_WORD&,
&position& : 8
# ik_smart
curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '联想是全球最大的笔记本厂商'
&tokens& : [
&token& : &联想&,
&start_offset& : 0,
&end_offset& : 2,
&type& : &CN_WORD&,
&position& : 0
&token& : &是&,
&start_offset& : 2,
&end_offset& : 3,
&type& : &CN_CHAR&,
&position& : 1
&token& : &全球&,
&start_offset& : 3,
&end_offset& : 5,
&type& : &CN_WORD&,
&position& : 2
&token& : &最大&,
&start_offset& : 5,
&end_offset& : 7,
&type& : &CN_WORD&,
&position& : 3
&token& : &的&,
&start_offset& : 7,
&end_offset& : 8,
&type& : &CN_CHAR&,
&position& : 4
&token& : &笔记本&,
&start_offset& : 8,
&end_offset& : 11,
&type& : &CN_WORD&,
&position& : 5
&token& : &厂商&,
&start_offset& : 11,
&end_offset& : 13,
&type& : &CN_WORD&,
&position& : 6
&/code&&/pre&&/div&&p&下面我们来创建一个索引,使用 ik
创建一个名叫 iktest 的索引,设置它的分析器用 ik ,分词器用 ik_max_word,并创建一个 article 的类型,里面有一个 subject 的字段,指定其使用 ik_max_word 分词器&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{
&settings& : {
&analysis& : {
&analyzer& : {
&tokenizer& : &ik_max_word&
&mappings& : {
&article& : {
&dynamic& : true,
&properties& : {
&subject& : {
&type& : &string&,
&analyzer& : &ik_max_word&
&/code&&/pre&&/div&&p&批量添加几条数据,这里我指定元数据 _id 方便查看,subject 内容为我随便找的几条新闻的标题&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d '
{ &index& : { &_id& : &1& } }
{&subject& : &"闺蜜"崔顺实被韩检方传唤 韩总统府促彻查真相& }
{ &index& : { &_id& : &2& } }
{&subject& : &韩举行"护国训练" 青瓦台:决不许国家安全出问题& }
{ &index& : { &_id& : &3& } }
{&subject& : &媒体称FBI已经取得搜查令 检视希拉里电邮& }
{ &index& : { &_id& : &4& } }
{&subject& : &村上春树获安徒生奖 演讲中谈及欧洲排外问题& }
{ &index& : { &_id& : &5& } }
{&subject& : &希拉里团队炮轰FBI 参院民主党领袖批其“违法”& }
&/code&&/pre&&/div&&p&查询 “希拉里和韩国”&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&curl -XPOST http://localhost:9200/iktest/article/_search?pretty
&query& : { &match& : { &subject& : &希拉里和韩国& }},
&highlight& : {
&pre_tags& : [&&font color='red'&&],
&post_tags& : [&&/font&&],
&fields& : {
&subject& : {}
&took& : 113,
&timed_out& : false,
&_shards& : {
&total& : 5,
&successful& : 5,
&failed& : 0
&hits& : {
&total& : 4,
&max_score& : 0.,
&hits& : [ {
&_index& : &iktest&,
&_type& : &article&,
&_id& : &2&,
&_score& : 0.,
&_source& : {
&subject& : &韩举行"护国训练" 青瓦台:决不许国家安全出问题&
&highlight& : {
&subject& : [ &&font color=red&韩&/font&举行"护&font color=red&国&/font&训练" 青瓦台:决不许国家安全出问题& ]
&_index& : &iktest&,
&_type& : &article&,
&_id& : &3&,
&_score& : 0.,
&_source& : {
&subject& : &媒体称FBI已经取得搜查令 检视希拉里电邮&
&highlight& : {
&subject& : [ &媒体称FBI已经取得搜查令 检视&font color=red&希拉里&/font&电邮& ]
&_index& : &iktest&,
&_type& : &article&,
&_id& : &5&,
&_score& : 0.,
&_source& : {
&subject& : &希拉里团队炮轰FBI 参院民主党领袖批其“违法”&
&highlight& : {
&subject& : [ &&font color=red&希拉里&/font&团队炮轰FBI 参院民主党领袖批其“违法”& ]
&_index& : &iktest&,
&_type& : &article&,
&_id& : &1&,
&_score& : 0.,
&_source& : {
&subject& : &"闺蜜"崔顺实被韩检方传唤 韩总统府促彻查真相&
&highlight& : {
&subject& : [ &"闺蜜"崔顺实被&font color=red&韩&/font&检方传唤 &font color=red&韩&/font&总统府促彻查真相& ]
&/code&&/pre&&/div&&p&这里用了高亮属性 highlight,直接显示到 html 中,被匹配到的字或词将以红色突出显示。若要用过滤搜索,直接将 match 改为 term 即可&/p&&h4&热词更新配置&/h4&&p&网络词语日新月异,如何让新出的网络热词(或特定的词语)实时的更新到我们的搜索当中呢&/p&&p&先用 ik 测试一下&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '
成龙原名陈港生
&tokens& : [ {
&token& : &成龙&,
&start_offset& : 1,
&end_offset& : 3,
&type& : &CN_WORD&,
&position& : 0
&token& : &原名&,
&start_offset& : 3,
&end_offset& : 5,
&type& : &CN_WORD&,
&position& : 1
&token& : &陈&,
&start_offset& : 5,
&end_offset& : 6,
&type& : &CN_CHAR&,
&position& : 2
&token& : &港&,
&start_offset& : 6,
&end_offset& : 7,
&type& : &CN_WORD&,
&position& : 3
&token& : &生&,
&start_offset& : 7,
&end_offset& : 8,
&type& : &CN_CHAR&,
&position& : 4
&/code&&/pre&&/div&&p&ik 的主词典中没有”陈港生” 这个词,所以被拆分了。
现在我们来配置一下&/p&&p&修改 IK 的配置文件 :ES 目录/plugins/ik/config/ik/IKAnalyzer.cfg.xml&/p&&p&修改如下:&/p&&div class=&highlight&&&pre&&code class=&language-properties&&&span&&/span&&span class=&na&&&?xml version&/span&&span class=&o&&=&/span&&span class=&s&&&1.0& encoding=&UTF-8&?&&/span&
&span class=&na&&&!DOCTYPE properties SYSTEM &http&/span&&span class=&o&&:&/span&&span class=&s&&//java.sun.com/dtd/properties.dtd&&&/span&
&span class=&err&&&properties&&/span&
&span class=&err&&&comment&IK&/span& &span class=&err&&Analyzer&/span& &span class=&err&&扩展配置&/comment&&/span&
&span class=&err&&&!--用户可以在这里配置自己的扩展字典&/span& &span class=&err&&--&&/span&
&span class=&na&&&entry key&/span&&span class=&o&&=&/span&&span class=&s&&&ext_dict&&custom/mydict.custom/single_word_low_freq.dic&/entry&&/span&
&span class=&err&&&!--用户可以在这里配置自己的扩展停止词字典--&&/span&
&span class=&na&&&entry key&/span&&span class=&o&&=&/span&&span class=&s&&&ext_stopwords&&custom/ext_stopword.dic&/entry&&/span&
&span class=&err&&&!--用户可以在这里配置远程扩展字典&/span& &span class=&err&&--&&/span&
&span class=&na&&&entry key&/span&&span class=&o&&=&/span&&span class=&s&&&remote_ext_dict&&http://192.168.1.136/hotWords.php&/entry&&/span&
&span class=&err&&&!--用户可以在这里配置远程扩展停止词字典--&&/span&
&span class=&na&&&!-- &entry key&/span&&span class=&o&&=&/span&&span class=&s&&&remote_ext_stopwords&&words_location&/entry& --&&/span&
&span class=&err&&&/properties&&/span&
&/code&&/pre&&/div&&p&这里我是用的是远程扩展字典,因为可以使用其他程序调用更新,且不用重启 ES,很方便;当然使用自定义的 mydict.dic 字典也是很方便的,一行一个词,自己加就可以了&/p&&p&既然是远程词典,那么就要是一个可访问的链接,可以是一个页面,也可以是一个txt的文档,但要保证输出的内容是 utf-8 的格式&/p&&p&hotWords.php 的内容&/p&&div class=&highlight&&&pre&&code class=&language-php&&&span&&/span&&span class=&x&&$s = &&&'EOF'&/span&
&span class=&x&&陈港生&/span&
&span class=&x&&元楼&/span&
&span class=&x&&蓝瘦&/span&
&span class=&x&&EOF;&/span&
&span class=&x&&header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);&/span&
&span class=&x&&header('ETag: &&');&/span&
&span class=&x&&echo $s;&/span&
&/code&&/pre&&/div&&p&ik 接收两个返回的头部属性 Last-Modified 和 ETag,只要其中一个有变化,就会触发更新,ik 会每分钟获取一次
重启 Elasticsearch ,查看启动记录,看到了三个词已被加载进来&/p&&p&再次执行上面的请求,返回, 就可以看到 ik 分词器已经匹配到了 “陈港生” 这个词,同理一些关于我们公司的专有名字(例如:永辉、永辉超市、永辉云创、云创 .... )也可以自己手动添加到字典中去。&/p&&h3&2、结巴中文分词&/h3&&h4&特点:&/h4&&p&1、支持三种分词模式:&/p&&ul&&li&精确模式,试图将句子最精确地切开,适合文本分析;&/li&&li&全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;&/li&&li&搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。&/li&&/ul&&p&2、支持繁体分词&/p&&p&3、支持自定义词典&/p&&h3&3、THULAC&/h3&&p&THULAC(THU Lexical Analyzer for Chinese)由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。THULAC具有如下几个特点:&/p&&p&能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。&/p&&p&准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。&/p&&p&速度较快。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。&/p&&p&中文分词工具thulac4j发布&/p&&p&1、规范化分词词典,并去掉一些无用词;&/p&&p&2、重写DAT(双数组Trie树)的构造算法,生成的DAT size减少了8%左右,从而节省了内存;&/p&&p&3、优化分词算法,提高了分词速率。&/p&&div class=&highlight&&&pre&&code class=&language-xml&&&span&&/span&&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&io.github.yizhiru&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&thulac4j&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&${thulac4j.version}&span class=&nt&&&/version&&/span&
&span class=&nt&&&/dependency&&/span&
&/code&&/pre&&/div&&p&&a href=&https://link.zhihu.com/?target=http%3A//www.cnblogs.com/en-heng/p/6526598.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&cnblogs.com/en-heng/p/6&/span&&span class=&invisible&&526598.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&thulac4j支持两种分词模式:&/p&&p&SegOnly模式,只分词没有词性标注;&/p&&p&SegPos模式,分词兼有词性标注。&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&c1&&// SegOnly mode&/span&
&span class=&n&&String&/span& &span class=&n&&sentence&/span& &span class=&o&&=&/span& &span class=&s&&&滔滔的流水,向着波士顿湾无声逝去&&/span&&span class=&o&&;&/span&
&span class=&n&&SegOnly&/span& &span class=&n&&seg&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&SegOnly&/span&&span class=&o&&(&/span&&span class=&s&&&models/seg_only.bin&&/span&&span class=&o&&);&/span&
&span class=&n&&System&/span&&span class=&o&&.&/span&&span class=&na&&out&/span&&span class=&o&&.&/span&&span class=&na&&println&/span&&span class=&o&&(&/span&&span class=&n&&seg&/span&&span class=&o&&.&/span&&span class=&na&&segment&/span&&span class=&o&&(&/span&&span class=&n&&sentence&/span&&span class=&o&&));&/span&
&span class=&c1&&// [滔滔, 的, 流水, ,, 向着, 波士顿湾, 无声, 逝去]&/span&
&span class=&c1&&// SegPos mode&/span&
&span class=&n&&SegPos&/span& &span class=&n&&pos&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&SegPos&/span&&span class=&o&&(&/span&&span class=&s&&&models/seg_pos.bin&&/span&&span class=&o&&);&/span&
&span class=&n&&System&/span&&span class=&o&&.&/span&&span class=&na&&out&/span&&span class=&o&&.&/span&&span class=&na&&println&/span&&span class=&o&&(&/span&&span class=&n&&pos&/span&&span class=&o&&.&/span&&span class=&na&&segment&/span&&span class=&o&&(&/span&&span class=&n&&sentence&/span&&span class=&o&&));&/span&
&span class=&c1&&//[滔滔/a, 的/u, 流水/n, ,/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]&/span&
&/code&&/pre&&/div&&h3&4、NLPIR&/h3&&p&中科院计算所 NLPIR:&a href=&https://link.zhihu.com/?target=http%3A//ictclas.nlpir.org/nlpir/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&ictclas.nlpir.org/nlpir&/span&&span class=&invisible&&/&/span&&span class=&ellipsis&&&/span&&/a&
(可直接在线分析中文)&/p&&p&下载地址:&a href=&https://link.zhihu.com/?target=https%3A//github.com/NLPIR-team/NLPIR& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/NLPIR-team/N&/span&&span class=&invisible&&LPIR&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&中科院分词系统(NLPIR)JAVA简易教程: &a href=&https://link.zhihu.com/?target=http%3A//www.cnblogs.com/wukongjiuwo/p/4092480.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&cnblogs.com/wukongjiuwo&/span&&span class=&invisible&&/p/4092480.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&h3&5、ansj分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/NLPchina/ansj_seg& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/NLPchina/ans&/span&&span class=&invisible&&j_seg&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&这是一个基于n-Gram+CRF+HMM的中文分词的java实现.&/p&&p&分词速度达到每秒钟大约200万字左右(mac air下测试),准确率能达到96%以上&/p&&p&目前实现了.中文分词. 中文姓名识别 .&/p&&p&用户自定义词典,关键字提取,自动摘要,关键字标记等功能
可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目.&/p&&p&maven 引入:&/p&&div class=&highlight&&&pre&&code class=&language-xml&&&span&&/span&&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&org.ansj&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&ansj_seg&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&5.1.1&span class=&nt&&&/version&&/span&
&span class=&nt&&&/dependency&&/span&
&/code&&/pre&&/div&&p&&strong&调用demo&/strong&&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&n&&String&/span& &span class=&n&&str&/span& &span class=&o&&=&/span& &span class=&s&&&欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!&&/span& &span class=&o&&;&/span&
&span class=&n&&System&/span&&span class=&o&&.&/span&&span class=&na&&out&/span&&span class=&o&&.&/span&&span class=&na&&println&/span&&span class=&o&&(&/span&&span class=&n&&ToAnalysis&/span&&span class=&o&&.&/span&&span class=&na&&parse&/span&&span class=&o&&(&/span&&span class=&n&&str&/span&&span class=&o&&));&/span&
&span class=&n&&欢迎&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&使用&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&ansj&/span&&span class=&o&&/&/span&&span class=&n&&en&/span&&span class=&o&&,&/span&&span class=&n&&_&/span&&span class=&o&&,&/span&&span class=&n&&seg&/span&&span class=&o&&/&/span&&span class=&n&&en&/span&&span class=&o&&,,,(,&/span&&span class=&n&&ansj&/span&&span class=&o&&/&/span&&span class=&n&&en&/span&&span class=&o&&,&/span&&span class=&n&&中文&/span&&span class=&o&&/&/span&&span class=&n&&nz&/span&&span class=&o&&,&/span&&span class=&n&&分词&/span&&span class=&o&&/&/span&&span class=&n&&n&/span&&span class=&o&&,),&/span&&span class=&n&&在&/span&&span class=&o&&/&/span&&span class=&n&&p&/span&&span class=&o&&,&/span&&span class=&n&&这里&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,&/span&&span class=&n&&如果&/span&&span class=&o&&/&/span&&span class=&n&&c&/span&&span class=&o&&,&/span&&span class=&n&&你&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,&/span&&span class=&n&&遇到&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&什么&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,&/span&&span class=&n&&问题&/span&&span class=&o&&/&/span&&span class=&n&&n&/span&&span class=&o&&,&/span&&span class=&n&&都&/span&&span class=&o&&/&/span&&span class=&n&&d&/span&&span class=&o&&,&/span&&span class=&n&&可以&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&联系&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&我&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,./&/span&&span class=&n&&m&/span&&span class=&o&&,&/span&&span class=&n&&我&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,&/span&&span class=&n&&一定&/span&&span class=&o&&/&/span&&span class=&n&&d&/span&&span class=&o&&,&/span&&span class=&n&&尽我所能&/span&&span class=&o&&/&/span&&span class=&n&&l&/span&&span class=&o&&,./&/span&&span class=&n&&m&/span&&span class=&o&&,&/span&&span class=&n&&帮助&/span&&span class=&o&&/&/span&&span class=&n&&v&/span&&span class=&o&&,&/span&&span class=&n&&大家&/span&&span class=&o&&/&/span&&span class=&n&&r&/span&&span class=&o&&,./&/span&&span class=&n&&m&/span&&span class=&o&&,&/span&&span class=&n&&ansj&/span&&span class=&o&&/&/span&&span class=&n&&en&/span&&span class=&o&&,&/span&&span class=&n&&_&/span&&span class=&o&&,&/span&&span class=&n&&seg&/span&&span class=&o&&/&/span&&span class=&n&&en&/span&&span class=&o&&,&/span&&span class=&n&&更快&/span&&span class=&o&&/&/span&&span class=&n&&d&/span&&span class=&o&&,,,&/span&&span class=&n&&更&/span&&span class=&o&&/&/span&&span class=&n&&d&/span&&span class=&o&&,&/span&&span class=&n&&准&/span&&span class=&o&&/&/span&&span class=&n&&a&/span&&span class=&o&&,,,&/span&&span class=&n&&更&/span&&span class=&o&&/&/span&&span class=&n&&d&/span&&span class=&o&&,&/span&&span class=&n&&自由&/span&&span class=&o&&/&/span&&span class=&n&&a&/span&&span class=&o&&,!&/span&
&/code&&/pre&&/div&&h3&6、哈工大的LTP&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/HIT-SCIR/ltp& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/HIT-SCIR/ltp&/span&&span class=&invisible&&&/span&&/a&&/p&&p&LTP制定了基于XML的语言处理结果表示,并在此基础上提供了一整套自底向上的丰富而且高效的中文语言处理模块(包括词法、句法、语义等6项中文处理核心技术),以及基于动态链接库(Dynamic Link Library, DLL)的应用程序接口、可视化工具,并且能够以网络服务(Web Service)的形式进行使用。&/p&&p&关于LTP的使用,请参考:
&a href=&https://link.zhihu.com/?target=http%3A//ltp.readthedocs.io/zh_CN/latest/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&ltp.readthedocs.io/zh_C&/span&&span class=&invisible&&N/latest/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&h3&7、庖丁解牛&/h3&&p&下载地址:&a href=&https://link.zhihu.com/?target=http%3A//pan.baidu.com/s/1eQ88SZS& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&pan.baidu.com/s/1eQ88SZ&/span&&span class=&invisible&&S&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&使用分为如下几步:&/p&&ol&&li&配置dic文件:
修改paoding-analysis.jar中的paoding-dic-home.properties文件,将“#paoding.dic.home=dic”的注释去掉,并配置成自己dic文件的本地存放路径。eg:/home/hadoop/work/paoding-analysis-2.0.4-beta/dic&/li&&li&把Jar包导入到项目中:
将paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四个包导入到项目中,这时就可以在代码片段中使用庖丁解牛工具提供的中文分词技术,例如:&/li&&/ol&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&n&&Analyzer&/span& &span class=&n&&analyzer&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&PaodingAnalyzer&/span&&span class=&o&&();&/span& &span class=&c1&&//定义一个解析器&/span&
&span class=&n&&String&/span& &span class=&n&&text&/span& &span class=&o&&=&/span& &span class=&s&&&庖丁系统是个完全基于lucene的中文分词系统,它就是重新建了一个analyzer,叫做PaodingAnalyzer,这个analyer的核心任务就是生成一个可以切词TokenStream。&&/span&&span class=&o&&;&/span& &span class=&o&&&&/span&&span class=&n&&span&/span& &span class=&n&&style&/span&&span class=&o&&=&/span&&span class=&s&&&font-family: Arial, Helvetica, sans-&&/span&&span class=&o&&&&/span&&span class=&c1&&//待分词的内容&/span&&/span&
&span class=&n&&TokenStream&/span& &span class=&n&&tokenStream&/span& &span class=&o&&=&/span& &span class=&n&&analyzer&/span&&span class=&o&&.&/span&&span class=&na&&tokenStream&/span&&span class=&o&&(&/span&&span class=&n&&text&/span&&span class=&o&&,&/span& &span class=&k&&new&/span& &span class=&n&&StringReader&/span&&span class=&o&&(&/span&&span class=&n&&text&/span&&span class=&o&&));&/span& &span class=&c1&&//得到token序列的输出流&/span&
&span class=&k&&try&/span& &span class=&o&&{&/span&
&span class=&n&&Token&/span& &span class=&n&&t&/span&&span class=&o&&;&/span&
&span class=&k&&while&/span& &span class=&o&&((&/span&&span class=&n&&t&/span& &span class=&o&&=&/span& &span class=&n&&tokenStream&/span&&span class=&o&&.&/span&&span class=&na&&next&/span&&span class=&o&&())&/span& &span class=&o&&!=&/span& &span class=&kc&&null&/span&&span class=&o&&)&/span&
&span class=&o&&{&/span&
&span class=&n&&System&/span&&span class=&o&&.&/span&&span class=&na&&out&/span&&span class=&o&&.&/span&&span class=&na&&println&/span&&span class=&o&&(&/span&&span class=&n&&t&/span&&span class=&o&&);&/span& &span class=&c1&&//输出每个token&/span&
&span class=&o&&}&/span&
&span class=&o&&}&/span& &span class=&k&&catch&/span& &span class=&o&&(&/span&&span class=&n&&IOException&/span& &span class=&n&&e&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&n&&e&/span&&span class=&o&&.&/span&&span class=&na&&printStackTrace&/span&&span class=&o&&();&/span&
&span class=&o&&}&/span&
&/code&&/pre&&/div&&h3&8、sogo在线分词&/h3&&p&sogo在线分词采用了基于汉字标注的分词方法,主要使用了线性链链CRF(Linear-chain CRF)模型。词性标注模块主要基于结构化线性模型(Structured Linear Model)&/p&&p&在线使用地址为:
&a href=&https://link.zhihu.com/?target=http%3A//www.sogou.com/labs/webservice/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&sogou.com/labs/webservi&/span&&span class=&invisible&&ce/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&h3&9、word分词&/h3&&p&地址: &a href=&https://link.zhihu.com/?target=https%3A//github.com/ysc/word& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/ysc/word&/span&&span class=&invisible&&&/span&&/a&&/p&&p&word分词是一个Java实现的分布式的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。能通过自定义配置文件来改变组件行为,能自定义用户词库、自动检测词库变化、支持大规模分布式环境,能灵活指定多种分词算法,能使用refine功能灵活控制分词结果,还能使用词频统计、词性标注、同义标注、反义标注、拼音标注等功能。提供了10种分词算法,还提供了10种文本相似度算法,同时还无缝和Lucene、Solr、ElasticSearch、Luke集成。注意:word1.3需要JDK1.8&/p&&p&maven 中引入依赖:&/p&&div class=&highlight&&&pre&&code class=&language-xml&&&span&&/span&&span class=&nt&&&dependencies&&/span&
&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&org.apdplat&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&word&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&1.3&span class=&nt&&&/version&&/span&
&span class=&nt&&&/dependency&&/span&
&span class=&nt&&&/dependencies&&/span&
&/code&&/pre&&/div&&p&ElasticSearch插件:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&1、打开命令行并切换到elasticsearch的bin目录
cd elasticsearch-2.1.1/bin
2、运行plugin脚本安装word分词插件:
./plugin install http://apdplat.org/word/archive/v1.4.zip
安装的时候注意:
如果提示:
ERROR: failed to download
Failed to install word, reason: failed to download
ERROR: incorrect hash (SHA1)
则重新再次运行命令,如果还是不行,多试两次
如果是elasticsearch1.x系列版本,则使用如下命令:
./plugin -u http://apdplat.org/word/archive/v1.3.1.zip -i word
3、修改文件elasticsearch-2.1.1/config/elasticsearch.yml,新增如下配置:
index.analysis.analyzer.default.type : &word&
index.analysis.tokenizer.default.type : &word&
4、启动ElasticSearch测试效果,在Chrome浏览器中访问:
http://localhost:9200/_analyze?analyzer=word&text=杨尚川是APDPlat应用级产品开发平台的作者
5、自定义配置
修改配置文件elasticsearch-2.1.1/plugins/word/word.local.conf
6、指定分词算法
修改文件elasticsearch-2.1.1/config/elasticsearch.yml,新增如下配置:
index.analysis.analyzer.default.segAlgorithm : &ReverseMinimumMatching&
index.analysis.tokenizer.default.segAlgorithm : &ReverseMinimumMatching&
这里segAlgorithm可指定的值有:
正向最大匹配算法:MaximumMatching
逆向最大匹配算法:ReverseMaximumMatching
正向最小匹配算法:MinimumMatching
逆向最小匹配算法:ReverseMinimumMatching
双向最大匹配算法:BidirectionalMaximumMatching
双向最小匹配算法:BidirectionalMinimumMatching
双向最大最小匹配算法:BidirectionalMaximumMinimumMatching
全切分算法:FullSegmentation
最少词数算法:MinimalWordCount
最大Ngram分值算法:MaxNgramScore
如不指定,默认使用双向最大匹配算法:BidirectionalMaximumMatching
&/code&&/pre&&/div&&h3&10、jcseg分词器&/h3&&p&&a href=&https://link.zhihu.com/?target=https%3A//code.google.com/archive/p/jcseg/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&code.google.com/archive&/span&&span class=&invisible&&/p/jcseg/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&h3&11、stanford分词器&/h3&&p&Stanford大学的一个开源分词工具,目前已支持汉语。&/p&&p&首先,去【1】下载Download Stanford Word Segmenter version 3.5.2,取得里面的 data 文件夹,放在maven project的 src/main/resources 里。&/p&&p&然后,maven依赖添加:&/p&&div class=&highlight&&&pre&&code class=&language-xml&&&span&&/span&&span class=&nt&&&properties&&/span&
&span class=&nt&&&java.version&&/span&1.8&span class=&nt&&&/java.version&&/span&
&span class=&nt&&&project.build.sourceEncoding&&/span&UTF-8&span class=&nt&&&/project.build.sourceEncoding&&/span&
&span class=&nt&&&corenlp.version&&/span&3.6.0&span class=&nt&&&/corenlp.version&&/span&
&span class=&nt&&&/properties&&/span&
&span class=&nt&&&dependencies&&/span&
&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&edu.stanford.nlp&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&stanford-corenlp&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&${corenlp.version}&span class=&nt&&&/version&&/span&
&span class=&nt&&&/dependency&&/span&
&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&edu.stanford.nlp&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&stanford-corenlp&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&${corenlp.version}&span class=&nt&&&/version&&/span&
&span class=&nt&&&classifier&&/span&models&span class=&nt&&&/classifier&&/span&
&span class=&nt&&&/dependency&&/span&
&span class=&nt&&&dependency&&/span&
&span class=&nt&&&groupId&&/span&edu.stanford.nlp&span class=&nt&&&/groupId&&/span&
&span class=&nt&&&artifactId&&/span&stanford-corenlp&span class=&nt&&&/artifactId&&/span&
&span class=&nt&&&version&&/span&${corenlp.version}&span class=&nt&&&/version&&/span&
&span class=&nt&&&classifier&&/span&models-chinese&span class=&nt&&&/classifier&&/span&
&span class=&nt&&&/dependency&&/span&
&span class=&nt&&&/dependencies&&/span&
&/code&&/pre&&/div&&p&测试:&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&java.util.Properties&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&edu.stanford.nlp.ie.crf.CRFClassifier&/span&&span class=&o&&;&/span&
&span class=&kd&&public&/span& &span class=&kd&&class&/span& &span class=&nc&&CoreNLPSegment&/span& &span class=&o&&{&/span&
&span class=&kd&&private&/span& &span class=&kd&&static&/span& &span class=&n&&CoreNLPSegment&/span& &span class=&n&&instance&/span&&span class=&o&&;&/span&
&span class=&kd&&private&/span& &span class=&n&&CRFClassifier&/span&
&span class=&n&&classifier&/span&&span class=&o&&;&/span&
&span class=&kd&&private&/span& &span class=&nf&&CoreNLPSegment&/span&&span class=&o&&(){&/span&
&span class=&n&&Properties&/span& &span class=&n&&props&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&Properties&/span&&span class=&o&&();&/span&
&span class=&n&&props&/span&&span class=&o&&.&/span&&span class=&na&&setProperty&/span&&span class=&o&&(&/span&&span class=&s&&&sighanCorporaDict&&/span&&span class=&o&&,&/span& &span class=&s&&&data&&/span&&span class=&o&&);&/span&
&span class=&n&&props&/span&&span class=&o&&.&/span&&span class=&na&&setProperty&/span&&span class=&o&&(&/span&&span class=&s&&&serDictionary&&/span&&span class=&o&&,&/span& &span class=&s&&&data/dict-chris6.ser.gz&&/span&&span class=&o&&);&/span&
&span class=&n&&props&/span&&span class=&o&&.&/span&&span class=&na&&setProperty&/span&&span class=&o&&(&/span&&span class=&s&&&inputEncoding&&/span&&span class=&o&&,&/span& &span class=&s&&&UTF-8&&/span&&span class=&o&&);&/span&
&span class=&n&&props&/span&&span class=&o&&.&/span&&span class=&na&&setProperty&/span&&span class=&o&&(&/span&&span class=&s&&&sighanPostProcessing&&/span&&span class=&o&&,&/span& &span class=&s&&&true&&/span&&span class=&o&&);&/span&
&span class=&n&&classifier&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&CRFClassifier&/span&&span class=&o&&(&/span&&span class=&n&&props&/span&&span class=&o&&);&/span&
&span class=&n&&classifier&/span&&span class=&o&&.&/span&&span class=&na&&loadClassifierNoExceptions&/span&&span class=&o&&(&/span&&span class=&s&&&data/ctb.gz&&/span&&span class=&o&&,&/span& &span class=&n&&props&/span&&span class=&o&&);&/span&
&span class=&n&&classifier&/span&&span class=&o&&.&/span&&span class=&na&&flags&/span&&span class=&o&&.&/span&&span class=&na&&setProperties&/span&&span class=&o&&(&/span&&span class=&n&&props&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kd&&static&/span& &span class=&n&&CoreNLPSegment&/span& &span class=&nf&&getInstance&/span&&span class=&o&&()&/span& &span class=&o&&{&/span&
&span class=&k&&if&/span& &span class=&o&&(&/span&&span class=&n&&instance&/span& &span class=&o&&==&/span& &span class=&kc&&null&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&n&&instance&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&CoreNLPSegment&/span&&span class=&o&&();&/span&
&span class=&o&&}&/span&
&span class=&k&&return&/span& &span class=&n&&instance&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&n&&String&/span&&span class=&o&&[]&/span& &span class=&nf&&doS

我要回帖

更多关于 co.sigmakao.com 的文章

 

随机推荐