solr


搜索模块-改版计划(任务)


0. 为增强关键词的搜索效率, 重做搜索模块, 整个任务细分为以下几点:

  • 技术选型 Luence, solr, elasticSearch
  • 安装部署 solr, windows
  • 测试基础功能 MySQL导入, 基础搜索行为测试
  • 增量更新/全量更新/删除 索引
  • 编写定时器脚本 python
  • 编写应用接口 php
  • 编写业务逻辑 php
  • 迭代

1. 技术选型

  • solr
    • 是基于 Lucene( Java)的, Lucene 是一套开源代码库.
  • Elasticsearch
    • 也是基于 Luence( Java)的, 区别是 solr搜索性能在 索引模块空闲时更好, 索引中则差于Elasticsearch( 网传)
    • 与Logstash, Kibana 组团 Elastic Stack.
  • 网传 solr 的社区更成熟一些, solr关键词在google中的结果集大约778W, elasticsearch 则是515W.

2. 安装部署, 基础功能测试, 脚本编写

  • base
    • http://localhost:8983/solr/
  • install
  • run/commend
    • solr start -p [port]
    • olr stop -all
  • core相关
    • create
      • new a core by admin-GUI
      • copy from file system, need to modify somefiles
        • core_name/core.properties
        • core_name/data/* clear
        • core_name/conf/data-config.xml
        • core_name/conf/managed-schema
    • ready-data-import
      • 需要在core_name/lib目录下引用以下jar包
        • mysql-connector-java-5.1.43-bin.jar
        • solr-dataimporthandler-6.6.0.jar
        • solr-dataimporthandler-extras-6.6.0.jar
      • 时区问题
        • core_root/bin/solr.in.sh linux
        • core_root/bin/solr.cmd windows
    • 定时增量/全量更新
      • 需要依赖一个表示时间的字段, CURRENT_TIMESTAMP. [表结构]
      • 需要依赖一个软删除的字段, is_delete
      • 系统环境是WAMP, 使用python 2.7.8写了一个脚本. 脚本配置文件model.json brand即为core_name, 根据实际情况调整brandurl.start即可简单运行.
    • 效率
      • 3核
      • 50W商品, 增量更新, 通过修改 data-config.xml 来增加新字段, 任务开始直到有效果, [2017-08-01 14:10:23] - [2017-08-01 14:25:10], 耗时15m
      • 50W商品, 全量更新, 耗时 5m
      • 搜索时长 从 3s上下 降低至 100ms~200ms, FASTESP 搜索时长 700ms~900ms, MySQL搜索时长 3s上下, 仅指 关键词搜搜

3. 附录

  • Solr could not load MySQL JDBC Driver
  • 代码片段

    表结构
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    CREATE TABLE `NewTable` (
    `log_id` mediumint(8) UNSIGNED NOT NULL AUTO_INCREMENT ,
    `test_time` int(10) UNSIGNED NOT NULL DEFAULT 0 ,
    `ip_address` varchar(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '' ,
    `vote_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
    `is_delete` tinyint(1) UNSIGNED NOT NULL DEFAULT 0 ,
    PRIMARY KEY (`log_id`),
    INDEX `vote_id` (`test_time`) USING BTREE
    )
    ENGINE=InnoDB
    DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
    AUTO_INCREMENT=15
    ROW_FORMAT=COMPACT
    ;

    Http response(JSON)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    // 检测到当前有进行中的任务
    {
    "responseHeader": {
    "status": 0,
    "QTime": 1
    },
    "initArgs": [
    "defaults", [
    "config", "data-config.xml"
    ]
    ],
    "command": "status",
    "status": "busy",
    "importResponse": "A command is still running...",
    "statusMessages": {
    "Time Elapsed": "0:0:0.23",
    "Total Requests made to DataSource": "1",
    "Total Rows Fetched": "0",
    "Total Documents Processed": "0",
    "Total Documents Skipped": "0",
    "Delta Dump started": "2017-07-31 02:25:04",
    "Identifying Delta": "2017-07-31 02:25:04"
    }
    }
    // 检测到更改时
    {
    "responseHeader": {
    "status": 0,
    "QTime": 6
    },
    "initArgs": [
    "defaults", [
    "config", "data-config.xml"
    ]
    ],
    "command": "delta-import",
    "status": "idle",
    "importResponse": "",
    "statusMessages": {
    "Total Requests made to DataSource": "2",
    "Total Rows Fetched": "2",
    "Total Documents Processed": "1",
    "Total Documents Skipped": "0",
    "Delta Dump started": "2017-07-31 02:25:04",
    "Identifying Delta": "2017-07-31 02:25:04",
    "Deltas Obtained": "2017-07-31 02:25:04",
    "Building documents": "2017-07-31 02:25:04",
    "Total Changed Documents": "1",
    "": "Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.",
    "Committed": "2017-07-31 02:25:04",
    "Time taken": "0:0:0.489"
    }
    }
    // 没有更改时
    {
    "responseHeader": {
    "status": 0,
    "QTime": 40
    },
    "initArgs": [
    "defaults", [
    "config", "data-config.xml"
    ]
    ],
    "command": "delta-import",
    "status": "idle",
    "importResponse": "",
    "statusMessages": {}
    }

    HTTP GET DATA(search params)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    // order by id asc
    _:1501556039804
    indent:on
    q:加油口盖罩
    sort:id asc
    wt:json
    // order by goods_sn_tmp desc
    _:1501556039804
    indent:on
    q:加油口盖罩
    sort:id desc
    wt:json
    // goods_sn_tmp = XXX
    _:1501556039804
    fq:goods_sn_tmp A2174700164
    indent:on
    q:加油口盖罩
    wt:json
    // between min TO max, must be `TO`(uppercase)
    _:1501556039804
    fq:market_price:[1.00 TO 10.00]
    indent:on
    q:加油口盖罩
    sort:market_price asc
    wt:json

    一些细节
    • dataimporter.last_index_time取自dataimport.properties last_index_time. 可手动修改(, 然而并没有什么意义).
    • ${dih.delta.id} 与 ${dataimporter.delta.id} 都可识别.
    • solr 会检测deltaQuery中的SQL语句, 其中不能包含 < 符号.

    data-config.xml样例

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    <dataConfig>
    <dataSource name="solrDB" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.110:5635/shopcyw?useUnicode=true" user="root" password="pass@word12" batchSize="-1" autoCommit="false"/>
    <document>
    <entity dataSource="solrDB" name="incre" pk="id" query="select log_id as id, ip_address from solr_test" deltaImportQuery="select log_id as id,ip_address from solr_test where log_id = ${dataimporter.delta.id}" deltaQuery="select log_id as id from solr_test where FROM_UNIXTIME(test_time, '%Y-%m-%d %H:%i:%s') > '${dataimporter.last_index_time}'" deletedPkQuery="select log_id as id from solr_test where is_delete = 1"
    >
    <field column="id" name="id" />
    <field column="ip_address" name="ip_address" />
    </entity>
    </document>
    </dataConfig>

    时区问题
    1
    windows: solr_root/bin/solr.cmd 更改 `set SOLR_TIMEZONE=UTC` 为 `set SOLR_TIMEZONE=UTC+8`

    手动删除记录 by admin-GUI-XML
    1
    2
    <delete><query>id:3</query></delete>
    <commit/>
    手动删除记录 by admin-GUI-json, 未验证
    1
    2
    3
    4
    5
    6
    { "delete":"1" }
    { "delete":["id1","id2"] }
    {
    "delete":"id":50,
    "_version_":12345
    }

    一次完整的请求(curl)
    1
    2
    curl "http://192.168.1.161:8983/solr/collec_incre/dataimport?_=1501465864269^&indent=on^&wt=json" -H "Pragma: no-cache" -H "Cookie: dataimport_autorefresh=null; ECS^[visit_times^]=1; session_id_ip=192.168.1.161_c3a85ab5054726b1e1a1192fcd552c12; UM_distinctid=15d30557a49457-0cbf02f98c05d2-1b1b7e58-1fa400-15d30557a4a79; city=1; district=2; CNZZDATA1260948496=1083174609-1499749671-^%^7C1499760989" -H "Origin: http://192.168.1.161:8983" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36" -H "Content-type: application/x-www-form-urlencoded" -H "Accept: application/json, text/plain, */*" -H "Cache-Control: no-cache" -H "Referer: http://192.168.1.161:8983/solr/" -H "Proxy-Connection: keep-alive" --data "command=delta-import^&verbose=false^&clean=false^&commit=true^&optimize=false^&core=collec_incre^&name=dataimport" --compressed &
    curl "http://192.168.1.161:8983/solr/collec_incre/dataimport?_=1501465864269^&command=status^&indent=on^&wt=json" -H "Pragma: no-cache" -H "Cookie: dataimport_autorefresh=null; ECS^[visit_times^]=1; session_id_ip=192.168.1.161_c3a85ab5054726b1e1a1192fcd552c12; UM_distinctid=15d30557a49457-0cbf02f98c05d2-1b1b7e58-1fa400-15d30557a4a79; city=1; district=2; CNZZDATA1260948496=1083174609-1499749671-^%^7C1499760989" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36" -H "Accept: application/json, text/plain, */*" -H "Cache-Control: no-cache" -H "Referer: http://192.168.1.161:8983/solr/" -H "Proxy-Connection: keep-alive" -H "doNotIntercept: true" --compressed
    Response
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    {
    "responseHeader": {
    "status": 0,
    "QTime": 6
    },
    "initArgs": [
    "defaults", [
    "config", "data-config.xml"
    ]
    ],
    "command": "delta-import",
    "status": "idle",
    "importResponse": "",
    "statusMessages": {
    "Total Requests made to DataSource": "2",
    "Total Rows Fetched": "2",
    "Total Documents Processed": "1",
    "Total Documents Skipped": "0",
    "Delta Dump started": "2017-07-31 02:25:04",
    "Identifying Delta": "2017-07-31 02:25:04",
    "Deltas Obtained": "2017-07-31 02:25:04",
    "Building documents": "2017-07-31 02:25:04",
    "Total Changed Documents": "1",
    "": "Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.",
    "Committed": "2017-07-31 02:25:04",
    "Time taken": "0:0:0.489"
    }
    }