如果配置任务根脚本,该任务节点的所有采集流程将被配置的脚本所接管。
EXTRACT:当前采集引擎[ 对象类型: extractor ]
DATADB:当前连接的数据库[ 对象类型: dataBase ]
RESULT:当前结果集对象[ 对象类型: result ]
当前频道节点[channel]对象。
①以下脚本将生成http://xjrb.xjrb.com/xjrb/20141201/index.htm-http://xjrb.xjrb.com/xjrb/20141231/index.htm 共31条链接:
url u; for(i=1;i <=31;i++) { u.entryid = this.id; //频道id u.tmplid = 1; //模板Id u.urlname = "http://xjrb.xjrb.com/xjrb/201412"+ i.Dim(2) + "/index.htm"; //链接地址 u.title = "test"; RESULT.AddLink(u); //添加到最后的结果中 } |
②以下脚本将生成从当前日期递推前十天的链接:
url u;time t1; for(i=0;i<10;i++) { u.title = "test"; //链接标题 u.entryid = this.id; //频道id u.tmplid = 1; //模板Id pre = t1.Preday(i); //向前计算日期 u.urlname = "http://www.cdrb.com.cn/html/"+ pre.year +"-" + pre.month + "/" + pre.day + "/content_2155799.htm"; //链接地址 RESULT.AddLink(u); //添加到最后的结果中 } |
③以下脚本用关键词拼接链接:
url u; var keys={"前嗅","爬虫"}; for(i=0;i<keys.size;i++) { u.title = "检索"; //链接标题 u.entryid = this.id; //频道id u.tmplid = 1; //模板Id u.url = "http://ww.forenose.com/search?keywords="+ keys[i]; //链接地址 RESULT.AddLink(u); //添加到最后的结果中 } |
①以下脚本查找表格并抽取表格数据:
gdoc = EXTRACT.OpenDoc(this,"http://gk.sjtu.edu.cn/index.php/list/fellow/2015-10-30-15-02-59/241-2015-11-18-02-21-01",0); if(gdoc) { dm = gdoc.GetDom(); record rec; if(dm) { tab = dm.FindName("table"); if(tab){ tr = dm.FindName("tr", tab); while(tr){ name = dm.FindName("td", tr); if(name){ //找到数据 posd = 0; corp=0;fund=0; rec.name = dm.GetTextAll(name); //名字 posd = name.next; if(posd){corp = posd.next; rec.position = dm.GetTextAll(posd);} if(corp){fund = corp.next; rec.company = dm.GetTextAll(corp); } if(fund){rec.fund = dm.GetTextAll(fund);} RESULT.AddRec(rec,3); } tr = tr.next; } } } EXTRACT.CloseDoc(gdoc); |
②以下脚本从服务器请求json数据并存入到记录中:
gdoc = EXTRACT.OpenDoc(this,"http://www.w3school.com.cn//example/jquery/demo_ajax_json.js",0); if(gdoc) { jScript js; record rec; data = js.RunJson(gdoc.GetDom().GetSource()); rec.name = data.firstName; rec.family=data.lastName; rec.age = data.age; schea = EXTRACT.GetSchema("schemaName"); //获取表单ID if(schea) sId = schea.id; else sId = 1; RESULT.AddRec(rec,sId); EXTRACT.CloseDoc(gdoc); } |