2012年12月4日火曜日

nutchとsolrの連携

nutchとsolrの連携をやってみたんだけど いやーこれが大変だった。なので書いておきます。
nutch 1.5.1  solr 4.0

もともとの参考はこれね
http://wiki.apache.org/nutch/NutchTutorial

nutchはここからもってきてね。1.5.1です
http://www.apache.org/dyn/closer.cgi/nutch/
srcの方をもってきてビルドしたほうがいいです。-binファイルはなんでbinになにもはいってないんですかね?だれか教えてください。

解凍してantするとruntimeとかフォルダができてきます。

でよくわからないんだけどbin/nutchってできなんだよな。
使うのはruntimeの方みたいなんだよね。
runtime/local/bin/nutch
これ打ち込んでいろいろhelpでてきたらOK

でnutuch-site.xmlをちょっと書き換えるんだけど

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

を書けって書いてるけど 
runtime/local/conf/nutch-site.xml の方のこと。
直下のconfを書いてたけどはまりました。(wiki書いとけよー 怒)

urls/seed.txt に 

http://nutch.apache.org/

っと書いて。

conf/regex-urlfilter.txtに

# accept anything else


 +^http://([a-z0-9]*\.)*nutch.apache.org/

っと書き直す。


runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 5

っと打ち込むといろいろログが出て来てcrawlフォルダができる。
でもたぶんこれバイナリで読めない。ちょっとわからないのでとばす。

(途中Hadoopのインストール必要だったかもしれません。ねむくて覚えてないのでそこはがんばってインストールしてみてください。)


runtime/local/bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5


でやってもうまくいかない。
nutchのconf/schema-solr4.xml をsolrのconfにコピーしてschema.xmlにする。

でsolr側で
java -jar start.jar
してもエラーがでる

あ ここ以降のエラーは
http://www.searchworkings.org/forum/-/message_boards/view_message/524077#_19_message_524077
メーリングリストが役に立ちました。


2012/12/04 13:50:55 org.apache.solr.core.CoreContainer create
致命的: Unable to create core: collection1
org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
at org.eclipse.jetty.server.Server.doStart(Server.java:263)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
at java.security.AccessController.doPrivileged(Native Method)
at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1102)
at java.lang.Float.parseFloat(Float.java:439)
at org.apache.solr.core.Config.getFloat(Config.java:284)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:358)
... 45 more
2012/12/04 13:50:55 org.apache.solr.common.SolrException log
致命的: null:org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
at org.eclipse.jetty.server.Server.doStart(Server.java:263)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
at java.security.AccessController.doPrivileged(Native Method)
at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1102)
at java.lang.Float.parseFloat(Float.java:439)
at org.apache.solr.core.Config.getFloat(Config.java:284)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:358)
... 45 more


schema.xmlでnutchのバージョンを 1.5.1 から1.5に直した。

<schema name="nutch" version="1.5.1">
<schema name="nutch" version="1.5">

java -jar start.jar

やったらまたエラーが出た。


致命的: Unable to create core: collection1
java.lang.RuntimeException: java.io.IOException: Can't find resource 'stopwords_en.txt' in classpath or 'solr/collection1/conf/', cwd=/home/ec2-user/apache-solr-4.0.0/example

stopwords_en.txt をcollection1/confに中身なにもなしで作ってまた起動。

またまたエラーがでた。

2012/12/04 14:05:54 org.apache.solr.update.UpdateLog init
致命的: Unable to use updateLog: _version_field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)
org.apache.solr.common.SolrException: _version_field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)

schema.xmlの <fields>・・・ </fields>を探して その中に

<field name="_version_" type="long" indexed="true" stored="true"/>

を追加

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1375)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1260)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

ここ参考になった
http://stackoverflow.com/questions/6582934/nutch-no-agents-listed-in-http-agent-name

でまたやってみると
runtime/local/bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

がーっとログがでてうまくいってるっぽい。

自分はamazon web service 使ってるから
http://ec2-75-101-215-247.compute-1.amazonaws.com:8983/solr/

クエリたたいてみるとnutchのクローラ結果がsolrに入ってるー!!!!
めちゃうれしー (^o^)/

あー ねみー 

0 件のコメント:

コメントを投稿