2012年12月8日土曜日

nutch and solr



I tried to connect nutch and solr, it was so troublesome.
nutch 1.5.1  solr 4.0

original page is here.
http://wiki.apache.org/nutch/NutchTutorial
you can download the nutch in here. 1.5.1
http://www.apache.org/dyn/closer.cgi/nutch/

you should download src-version and build it. nothing in the bin folder with bin-version. anyone knows it why?

unzip it and do ant, you seed the runtime folder and so on.

then i found why it does not make the bin/nutch? i do not know the reason.

it seems you should use runtime/local/bin/nutch.

it is ok that terminal shows help sentences after typing runtime/local/bin/nutch.


and you have to modify the nutch-site.xml......

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

solr wiki sais you have to describe the above in the nutch-site.xml.
but this means it is runtime/local/conf/nutch-site.xml.

i have got long lost when writing it in apache-nutch-1.5.1/conf/nutch-site.xml.

and in urls/seed.txt

http://nutch.apache.org/

the above should be written in it.

conf/regex-urlfilter.txt

# accept anything else


 +^http://([a-z0-9]*\.)*nutch.apache.org/

the above should be written in conf/regex-urlfilter.txt.


then you type like this...

runtime/local/bin/nutch crawl urls -dir crawl -depth 3 -topN 5

you find so many log and crawl folder was made.
maybe you might not understand that content, it is binary file.


(sorry you might need to install Hadoop. i do not remember it because i was so sleepy....)


then i typed like here...
runtime/local/bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

i got error....
you should copy conf/schema-solr4.xml in nutch to conf/ in solr, and rename it as schema.xml. 

then in solr,
after do 
java -jar start.jar
you may got error.

the below mail helps you to resolve this error.

http://www.searchworkings.org/forum/-/message_boards/view_message/524077#_19_message_524077



2012/12/04 13:50:55 org.apache.solr.core.CoreContainer create
致命的: Unable to create core: collection1
org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
at org.eclipse.jetty.server.Server.doStart(Server.java:263)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
at java.security.AccessController.doPrivileged(Native Method)
at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1102)
at java.lang.Float.parseFloat(Float.java:439)
at org.apache.solr.core.Config.getFloat(Config.java:284)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:358)
... 45 more
2012/12/04 13:50:55 org.apache.solr.common.SolrException log
致命的: null:org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:113)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
at org.eclipse.jetty.server.Server.doStart(Server.java:263)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
at java.security.AccessController.doPrivileged(Native Method)
at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
at org.eclipse.jetty.start.Main.start(Main.java:602)
at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1102)
at java.lang.Float.parseFloat(Float.java:439)
at org.apache.solr.core.Config.getFloat(Config.java:284)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:358)
... 45 more

you should change the nutch version in schema.xml from 1.5.1 to 1.5.

<schema name="nutch" version="1.5.1">
<schema name="nutch" version="1.5">

so then try again,
java -jar start.jar

I've got error again.


致命的: Unable to create core: collection1
java.lang.RuntimeException: java.io.IOException: Can't find resource 'stopwords_en.txt' in classpath or 'solr/collection1/conf/', cwd=/home/ec2-user/apache-solr-4.0.0/example

try this command,
touch collection1/conf/stopwords_en.txt

so then try again,
java -jar start.jar

I've got error again.

2012/12/04 14:05:54 org.apache.solr.update.UpdateLog init
致命的: Unable to use updateLog: _version_field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)
org.apache.solr.common.SolrException: _version_field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)

find  <fields>・・・ </fields> in schema.xml.
add like below.
<field name="_version_" type="long" indexed="true" stored="true"/>


Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1375)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1260)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


this URL helps you so very well.
http://stackoverflow.com/questions/6582934/nutch-no-agents-listed-in-http-agent-name

try again this command.
runtime/local/bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5


it seems success!!!!

you can see my web site here.
http://ec2-75-101-215-247.compute-1.amazonaws.com:8983/solr/


ahhh!!  i can get the result of nutch crawling in solr !!!

so happy (^o^)/

.....so sleepy...zzzzz

0 件のコメント:

コメントを投稿