可靠的企业战略,数字化转型,智能化转型和企业架构智库

【搜索引擎选型】Solr vs. Elasticsearch:选择开源搜索引擎时应考虑的事项

结合了云,分析和认知搜索的观察结果


Solr vs. Elasticsearch在我们的客户项目和企业搜索社区中经常讨论。但是,随着传统企业搜索已演变为Gartner所谓的“ Insight Engines”,我们重新讨论了该主题,以提供结合了Cloud,Analytics和Cognitive Search功能的最新观察结果,以帮助您评估Solr和Elasticsearch。

通常,当我们帮助客户进行围绕其企业解决方案中使用开源搜索引擎的评估时,会提出以下问题:“ Solr还是Elasticsearch,哪个更好?”虽然可能会有先入为主的观念,这个问题比另一个要好,当被圈定为“哪个对我更好?”时,这个问题更相关。

可以使用多种搜索引擎技术,但是最受欢迎的开放源代码变体是那些依赖于Apache Lucene底层核心功能的技术,从本质上讲,这是使搜索引擎正常工作的部分。 Solr和Elasticsearch是搜索库之上的组件,为完整的搜索产品提供了自己的实现和功能。 Lucene的核心功能为Solr和Elasticsearch的基本搜索功能提供了相同的体验,但是围绕Lucene的实现方法才是差异化的原因。

搜索引擎的作用已经从有效地查找信息转变为在内容分析,预测建模以及与认知/智能搜索功能(例如自然语言处理(NLP),机器学习(ML)和相关性)的集成中发挥关键作用得分。我们已经在客户工作中探索并实现了这些智能功能-在此处了解更多信息。
 

Solr vs. Elasticsearch:哪个对我的组织更好?


这得看情况。

关于采用一种技术而不是另一种技术有许多用例。但是当被问到这个问题时,我通常会从运营管理的角度来类比地回答:“ Solr就像Linux。您可以根据自己的需求进行大量自定义和定制Solr,但与Elasticsearch所需的工作相比,管理和部署要涉及更多的资源,而且要消耗大量资源。 Elasticsearch具有非常好的设计的用户界面(Kibana),非常易于部署,管理和监视(使用X-Pack),该界面允许进行数据探索和创建分析可视化,但是自定义其功能是有限的,并且使用插件框架。

如果您愿意,Elasticsearch可能适合您:

  • 使您的搜索引擎快速启动并运行,而几乎不会产生任何开销;
  • 尽快开始探索您的数据;和
  • 将分析和可视化视为用例的核心组成部分。


如果您满足以下条件,Solr可能适合您:

  • 需要大规模索引和重新处理大量数据;
  • 有可用的资源来投资于管理Solr和可用于交互的工具;和
  • 具有可与Solr配合使用的现有企业框架(例如其他Apache产品(例如Hadoop)或企业框架(例如Cloudera,Hortonworks或基于Hadoop的HDInsights))。


这并不是说Hadoop平台无法与Elasticsearch配合使用(我们已向客户提出了此方案),但是某些平台(尤其是Cloudera和Hortonworks)提供了额外的工具和方法来对生态系统内的数据建立索引和管理Solr(尤其是即将发布的支持Solr 7的Cloudera CDH 6版本。

观察结果:性能,功能和用例


根据经验,评估可以为帮助客户定义策略和实施路线图提供巨大的价值。在评估过程中,我们使用搜索引擎比较矩阵,根据特定客户的优先级,采用加权评分机制,根据特定客户的需求和用例评估搜索引擎的适用性。基于此分析,在为搜索引擎提出整体建议时,有一些共同的功能和用例可作为关注点。

 

 

The chart below captures some of the observations about Solr and Elasticsearch:

  SOLR ELASTICSEARCH
Use Cases
  • Search for large bulk data sets, for example, healthcare (payer / provider), biopharma research, finance, and government
  • Native unformatted record filter and search, such as e-commerce or customer-facing search
  • Static data set searching
  • Large bulk reprocessing
  • Log analytics: enterprise log consumption and analysis or a replacement option for commercial off-the-shelf log analytics products
  • Real-time dashboards for operational timeline or sales and marketing insights
  • High-volume data streams with natural language content from social media and IoT streams
  • Native unformatted record filter and search (e-commerce, customer)
Visualization Tools
  • Banana (Kibana port) can provide support up to Solr 6.x
  • Apache Hue (mostly used in Hadoop deployments) – emerging functionality with Hue Search App
  • Robust visualization development framework with Kibana
  • Maintained and version-matched by Elastic
  • Well-integrated with Grafana for analytics and monitoring
Cloud and Big Data
  • Cloud-based deployments rely heavily on management tools like Cloudera and Hortonworks
  • Fully-hosted options are available through third-party vendors
  • As an Apache project, Solr integrates well with other Apache products, especially those supported in Hadoop
  • Fully-hosted and managed solutions are provided by all the major cloud infrastructure providers (Microsoft Azure, AWS, Google Cloud)
  • Management tools are provided by the cloud hosting provider
  • Elasticsearch Hadoop libraries allow for the integration of Hadoop components with Elasticsearch natively
Cognitive Search Capabilities and Integration
  • Learning to Rank (LTR) module is supported in Solr 6.4 or later
  • As an Apache project, Solr integrates well with OpenNLP (but not an embedded component) for entity extraction and tagging to feed concept-based search
  • Includes a Machine Learning component (with X-Pack)
  • Allows for pattern recognition and time series forecasting (ML and Kibana)
  • Learning to Rank (LTR) plugin supports machine-learning-driven relevancy tuning exercises
  • Open NLP can be utilized in a similar fashion to Solr as an external component supporting cognitive search functions
Management and Operations
  • Overall, more difficult to manage (though Cloudera Manager helps with this in a Hadoop environment)
  • APIs are not available (though Solr 7 supports metrics APIs, requires JMX)
  • Scaling requires manual intervention for shard rebalancing (Solr 7 has an auto-scaling API giving some control over shard allocation and distribution)
  • Easy to set up and scale
  • Automatic shard rebalancing after node addition
  • APIs provide ease of monitoring and state evaluation
  • X-Pack provides out of the box resource dashboards (requires licensing from Elastic)
Development Architecture
  • Excellent pluggable architecture
  • Plugins can be easily developed and integrated
  • Fully open source with vast community support
  • Tight integration with Lucene development
  • More restrictive plugin architecture
  • Plugins are not supported in hosted environments
  • Recently became fully open source with Elasticsearch core and X-Pack (X-Pack code has been released as open source, but still requires commercial licensing to implement)
  • Lags slightly in implementing new Lucene features
  • Frequent point releases with feature additions
Cluster State Management
  • Zookeeper Quorum: minimum 3 nodes required; 5 to 7 recommended depending on the overall size of the cluster
  • Master Nodes (proprietary solution): minimum 3 nodes required. They can exist as independent nodes or dual-role nodes with data nodes
Security
  • Implemented in 3 flavors: basic (username/password in Zookeeper), Hadoop authentication (LDAP), or Kerberos
  • LDAP / Active Directory is not supported directly
  • Custom plugins can be developed
  • Implemented in 3 flavors: basic (username/password in Zookeeper), Hadoop authentication (LDAP), or Kerberos
  • LDAP / Active Directory is not supported directly
  • Custom plugins can be developed
Bulk Indexing Tools
  • Batch API operations
  • Within Cloudera Hadoop: MapReduceIndexerTool (Solr 4.x); Lily HBase batch indexing; and Spark CrunchIndexerTool
  • MapReduceIndexerTool (5.x) from Lucidworks
  • Bulk API operations only
  • Configuration modifications can be made to speed up initial bulk indexing
Near Real Time (NRT) Indexing
(not a comprehensive list)
  • Beats framework
  • Logstash
  • Ingest Nodes
  • Kafka Connect Elasticsearch Sink
  • Spark Streaming
  • Apache NiFi/MiNiFi
  • Accenture Aspire for unstructured data processing and enrichment
Analytics
  • Strong facet-based analytics
  • JSON facets added to support more dynamic aggregations with analytic functions
  • Stream Expressions are added in Solr 7 to support a streaming framework for parallel computation and result emissions for downstream processing
  • Strong analytic capabilities with aggregations
  • Supports analysis on top of aggregations (e.g. moving averages)
  • Provides time-series analysis of continually added data (like logs or social media streams) for trend and efficacy insights
Nested Data Structures
  • Has the notion of parent-child document relationships
  • These exist as separate documents within the index, limiting their aggregation functionality in deeply-nested data structures
  • Deep nesting is well-supported
  • Fully-structured JSON documents can be directly persisted into Elasticsearch
  • Aggregations can be performed against nested structures easily
Query Operations
  • Mostly limited to query URI parameters, leading to complex queries (debuggable in Solr Admin)
  • JSON API (Solr 7) introduced to allow for JSON based query expressions
  • Request handlers can be simply defined in Solr configuration and Java to perform specific and complex tasks related to a given query use case
  • Full-featured Query DSL for writing and expressing complex queries
  • Limited to only JSON
  • Custom request handlers require the development of a plugin. There is no notion of jar references from a custom endpoint as there is in Solr
API Interaction
  • SolrJ (Java) is the most well maintained and up-to-date version and is maintained as part of the Apache project
  • Other Apache maintained APIs: Flare, PHP, Python, Perl
  • Other language APIs exist but are community maintained, and often lag in functionality behind SolrJ (most notably the .NET API)
  • Many APIs are developed and supported directly by Elastic (Java, JavaScript, Groovy, .NET, PHP, Perl, Python, Ruby)
  • Other community APIs exist for Elasticsearch (e.g. C++, Erlang, Go, Haskell, Lua, Perl, R, etc.)

在Solr和Elasticsearch之间选择?考虑这些


决定哪种搜索引擎最适合您的特定用例和需求,不应基于“非此即彼”的假设。 Solr中特定功能的总体重要性可能超过Elasticsearch中的运营优势,例如:

在一个客户端的情况下,与Solr部署相关联的开销以及必须使用SolrNET的过期客户端(当时)的开销被Solr的可插入性所抵消。需要使用自定义加密更新和请求处理程序,才能使用旋转数据加密密钥对索引内容进行加密,从而需要在Elasticsearch上使用Solr。索引加密过程所需的功能无法在Elasticsearch中有效实现。

相反,在不考虑大数据或分析因素的情况下,针对一般搜索用例评估搜索引擎选项时,由于减少了维护和部署的开销以及用于完全托管和托管环境的选项,Elasticsearch成为更受欢迎的选项。

在某些情况下,根据对客户最重要的因素,尽管应用了计分规则,但尚不清楚哪个搜索引擎(包括商业引擎)最能满足客户的需求。在这种情况下,可以使用样本数据集进行“烘焙”,以评估每个引擎在一组特定用例中的表现,从而对客户进行评估。

归根结底,Solr和Elasticsearch都是强大,灵活,可扩展且功能强大的开源搜索引擎。总体用例和业务需求,以及所需的功能,操作注意事项以及与新的认知搜索和分析功能的集成,最终将决定您选择Solr还是Elasticsearch。

 

原文:https://www.accenture.com/us-en/blogs/search-and-content-analytics-blog/solr-elasticsearch-open-source-search-engines

本文:http://jiagoushi.pro/node/906

讨论:请加入知识星球或者微信圈子【首席架构师圈】