搜索模块-改版计划(任务)
0. 为增强关键词的搜索效率, 重做搜索模块, 整个任务细分为以下几点:
- 技术选型 Luence, solr, elasticSearch
- 安装部署 solr, windows
- 测试基础功能 MySQL导入, 基础搜索行为测试
- 增量更新/全量更新/删除 索引
- 编写定时器脚本 python
- 编写应用接口 php
- 编写业务逻辑 php
- 迭代
1. 技术选型
- solr
- 是基于 Lucene( Java)的, Lucene 是一套开源代码库.
- Elasticsearch
- 也是基于 Luence( Java)的, 区别是 solr搜索性能在 索引模块空闲时更好, 索引中则差于Elasticsearch( 网传)
- 与Logstash, Kibana 组团
Elastic Stack
.
- 网传 solr 的社区更成熟一些, solr关键词在google中的结果集大约778W, elasticsearch 则是515W.
2. 安装部署, 基础功能测试, 脚本编写
- base
http://localhost:8983/solr/
- install
- ready
- run/commend
solr start -p [port]
olr stop -all
- core相关
- create
- new a core by admin-GUI
- copy from file system, need to modify somefiles
core_name/core.properties
core_name/data/*
clearcore_name/conf/data-config.xml
core_name/conf/managed-schema
- ready-data-import
- 需要在core_name/lib目录下引用以下jar包
mysql-connector-java-5.1.43-bin.jar
solr-dataimporthandler-6.6.0.jar
solr-dataimporthandler-extras-6.6.0.jar
- 时区问题
core_root/bin/solr.in.sh
linuxcore_root/bin/solr.cmd
windows
- 需要在core_name/lib目录下引用以下jar包
- 定时增量/全量更新
- 需要依赖一个表示时间的字段, CURRENT_TIMESTAMP. [表结构]
- 需要依赖一个软删除的字段,
is_delete
- 系统环境是WAMP, 使用python 2.7.8写了一个脚本. 脚本配置文件model.json
brand
即为core_name, 根据实际情况调整brand
及url.start
即可简单运行.
- 效率
- 3核
- 50W商品, 增量更新, 通过修改 data-config.xml 来增加新字段, 任务开始直到有效果, [2017-08-01 14:10:23] - [2017-08-01 14:25:10], 耗时15m
- 50W商品, 全量更新, 耗时 5m
- 搜索时长 从 3s上下 降低至 100ms~200ms, FASTESP 搜索时长 700ms~900ms, MySQL搜索时长 3s上下, 仅指 关键词搜搜
- create
3. 附录
- Solr could not load MySQL JDBC Driver
代码片段
表结构
1234567891011121314CREATE TABLE `NewTable` (`log_id` mediumint(8) UNSIGNED NOT NULL AUTO_INCREMENT ,`test_time` int(10) UNSIGNED NOT NULL DEFAULT 0 ,`ip_address` varchar(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '' ,`vote_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,`is_delete` tinyint(1) UNSIGNED NOT NULL DEFAULT 0 ,PRIMARY KEY (`log_id`),INDEX `vote_id` (`test_time`) USING BTREE)ENGINE=InnoDBDEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ciAUTO_INCREMENT=15ROW_FORMAT=COMPACT;
Http response(JSON)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071// 检测到当前有进行中的任务{"responseHeader": {"status": 0,"QTime": 1},"initArgs": ["defaults", ["config", "data-config.xml"]],"command": "status","status": "busy","importResponse": "A command is still running...","statusMessages": {"Time Elapsed": "0:0:0.23","Total Requests made to DataSource": "1","Total Rows Fetched": "0","Total Documents Processed": "0","Total Documents Skipped": "0","Delta Dump started": "2017-07-31 02:25:04","Identifying Delta": "2017-07-31 02:25:04"}}// 检测到更改时{"responseHeader": {"status": 0,"QTime": 6},"initArgs": ["defaults", ["config", "data-config.xml"]],"command": "delta-import","status": "idle","importResponse": "","statusMessages": {"Total Requests made to DataSource": "2","Total Rows Fetched": "2","Total Documents Processed": "1","Total Documents Skipped": "0","Delta Dump started": "2017-07-31 02:25:04","Identifying Delta": "2017-07-31 02:25:04","Deltas Obtained": "2017-07-31 02:25:04","Building documents": "2017-07-31 02:25:04","Total Changed Documents": "1","": "Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.","Committed": "2017-07-31 02:25:04","Time taken": "0:0:0.489"}}// 没有更改时{"responseHeader": {"status": 0,"QTime": 40},"initArgs": ["defaults", ["config", "data-config.xml"]],"command": "delta-import","status": "idle","importResponse": "","statusMessages": {}}
HTTP GET DATA(search params)
12345678910111213141516171819202122232425262728// order by id asc_:1501556039804indent:onq:加油口盖罩sort:id ascwt:json// order by goods_sn_tmp desc_:1501556039804indent:onq:加油口盖罩sort:id descwt:json// goods_sn_tmp = XXX_:1501556039804fq:goods_sn_tmp A2174700164indent:onq:加油口盖罩wt:json// between min TO max, must be `TO`(uppercase)_:1501556039804fq:market_price:[1.00 TO 10.00]indent:onq:加油口盖罩sort:market_price ascwt:json
一些细节
dataimporter.last_index_time
取自dataimport.properties last_index_time
. 可手动修改(, 然而并没有什么意义).- ${dih.delta.id} 与 ${dataimporter.delta.id} 都可识别.
- solr 会检测deltaQuery中的SQL语句, 其中不能包含
<
符号.
data-config.xml
样例12345678910<dataConfig><dataSource name="solrDB" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.110:5635/shopcyw?useUnicode=true" user="root" password="pass@word12" batchSize="-1" autoCommit="false"/><document><entity dataSource="solrDB" name="incre" pk="id" query="select log_id as id, ip_address from solr_test" deltaImportQuery="select log_id as id,ip_address from solr_test where log_id = ${dataimporter.delta.id}" deltaQuery="select log_id as id from solr_test where FROM_UNIXTIME(test_time, '%Y-%m-%d %H:%i:%s') > '${dataimporter.last_index_time}'" deletedPkQuery="select log_id as id from solr_test where is_delete = 1"><field column="id" name="id" /><field column="ip_address" name="ip_address" /></entity></document></dataConfig>
时区问题
1windows: solr_root/bin/solr.cmd 更改 `set SOLR_TIMEZONE=UTC` 为 `set SOLR_TIMEZONE=UTC+8`
手动删除记录 by admin-GUI-XML
12<delete><query>id:3</query></delete><commit/>手动删除记录 by admin-GUI-json, 未验证
123456{ "delete":"1" }{ "delete":["id1","id2"] }{"delete":"id":50,"_version_":12345}
一次完整的请求(curl)
12curl "http://192.168.1.161:8983/solr/collec_incre/dataimport?_=1501465864269^&indent=on^&wt=json" -H "Pragma: no-cache" -H "Cookie: dataimport_autorefresh=null; ECS^[visit_times^]=1; session_id_ip=192.168.1.161_c3a85ab5054726b1e1a1192fcd552c12; UM_distinctid=15d30557a49457-0cbf02f98c05d2-1b1b7e58-1fa400-15d30557a4a79; city=1; district=2; CNZZDATA1260948496=1083174609-1499749671-^%^7C1499760989" -H "Origin: http://192.168.1.161:8983" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36" -H "Content-type: application/x-www-form-urlencoded" -H "Accept: application/json, text/plain, */*" -H "Cache-Control: no-cache" -H "Referer: http://192.168.1.161:8983/solr/" -H "Proxy-Connection: keep-alive" --data "command=delta-import^&verbose=false^&clean=false^&commit=true^&optimize=false^&core=collec_incre^&name=dataimport" --compressed &curl "http://192.168.1.161:8983/solr/collec_incre/dataimport?_=1501465864269^&command=status^&indent=on^&wt=json" -H "Pragma: no-cache" -H "Cookie: dataimport_autorefresh=null; ECS^[visit_times^]=1; session_id_ip=192.168.1.161_c3a85ab5054726b1e1a1192fcd552c12; UM_distinctid=15d30557a49457-0cbf02f98c05d2-1b1b7e58-1fa400-15d30557a4a79; city=1; district=2; CNZZDATA1260948496=1083174609-1499749671-^%^7C1499760989" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36" -H "Accept: application/json, text/plain, */*" -H "Cache-Control: no-cache" -H "Referer: http://192.168.1.161:8983/solr/" -H "Proxy-Connection: keep-alive" -H "doNotIntercept: true" --compressedResponse
12345678910111213141516171819202122232425262728{"responseHeader": {"status": 0,"QTime": 6},"initArgs": ["defaults", ["config", "data-config.xml"]],"command": "delta-import","status": "idle","importResponse": "","statusMessages": {"Total Requests made to DataSource": "2","Total Rows Fetched": "2","Total Documents Processed": "1","Total Documents Skipped": "0","Delta Dump started": "2017-07-31 02:25:04","Identifying Delta": "2017-07-31 02:25:04","Deltas Obtained": "2017-07-31 02:25:04","Building documents": "2017-07-31 02:25:04","Total Changed Documents": "1","": "Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.","Committed": "2017-07-31 02:25:04","Time taken": "0:0:0.489"}}