基本信息
源码名称:C#写的蜘蛛程序也叫小偷程序
源码大小:0.57M
文件格式:.rar
开发语言:C#
更新时间:2015-09-11
友情提示:(无需注册或充值,赞助后即可获取资源下载链接)
嘿,亲!知识可是无价之宝呢,但咱这精心整理的资料也耗费了不少心血呀。小小地破费一下,绝对物超所值哦!如有下载和支付问题,请联系我们QQ(微信同号):813200300
本次赞助数额为: 2 元×
微信扫码支付:2 元
×
请留下您的邮箱,我们将在2小时内将文件发到您的邮箱
源码介绍
"蜘蛛"(Spider)是Internet上一种很有用的程序,搜索引擎利用蜘蛛程序将Web页面收集到数据库,企业利用蜘蛛程序监视竞争对手的网站并跟踪变动,个人用户用蜘蛛程序下载Web页面以便脱机使用,开发者利用蜘蛛程序扫描自己的Web检查无效的链接……对于不同的用户,蜘蛛程序有不同的用途。那么,蜘蛛程序到底是怎样工作的呢?
蜘蛛是一种半自动的程序,就象现实当中的蜘蛛在它的Web(蜘蛛网)上旅行一样,蜘蛛程序也按照类似的方式在Web链接织成的网上旅行。蜘蛛程序之所以是半自动的,是因为它总是需要一个初始链接(出发点),但此后的运行情况就要由它自己决定了,蜘蛛程序会扫描起始页面包含的链接,然后访问这些链接指向的页面,再分析和追踪那些页面包含的链接。从理论上看,最终蜘蛛程序会访问到Internet上的每一个页面,因为Internet上几乎每一个页面总是被其他或多或少的页面引用。
"蜘蛛"(Spider)是Internet上一种很有用的程序,搜索引擎利用蜘蛛程序将Web页面收集到数据库,企业利用蜘蛛程序监视竞争对手的网站并跟踪变动,个人用户用蜘蛛程序下载Web页面以便脱机使用,开发者利用蜘蛛程序扫描自己的Web检查无效的链接……对于不同的用户,蜘蛛程序有不同的用途。那么,蜘蛛程序到底是怎样工作的呢?
蜘蛛是一种半自动的程序,就象现实当中的蜘蛛在它的Web(蜘蛛网)上旅行一样,蜘蛛程序也按照类似的方式在Web链接织成的网上旅行。蜘蛛程序之所以是半自动的,是因为它总是需要一个初始链接(出发点),但此后的运行情况就要由它自己决定了,蜘蛛程序会扫描起始页面包含的链接,然后访问这些链接指向的页面,再分析和追踪那些页面包含的链接。从理论上看,最终蜘蛛程序会访问到Internet上的每一个页面,因为Internet上几乎每一个页面总是被其他或多或少的页面引用。
下载后测试的时候,请将该程序拷贝至其它目录,程序有几个bug:
1.关掉程序后 程序没有完全退出,可在任务管理器中 杀掉,也可以 改下代码 application.exit()
2.将改程序 拷贝至其它目录之后,再运行调试,因为该程序 有个目录bug,带符号的目录 不识别(内部报错)
using System; using System.Collections; using System.Net; using System.IO; using System.Threading; namespace Spider { /// <summary> /// The main class for the spider. This spider can be used with the /// SpiderForm form that has been provided. The spider is completely /// selfcontained. If you would like to use the spider with your own /// application just remove the references to m_spiderForm from this file. /// /// The files needed for the spider are: /// /// Attribute.cs - Used by the HTML parser /// AttributeList.cs - Used by the HTML parser /// DocumentWorker - Used to "thread" the spider /// Done.cs - Allows the spider to know when it is done /// Parse.cs - Used by the HTML parser /// ParseHTML.cs - The HTML parser /// Spider.cs - This file /// SpiderForm.cs - Demo of how to use the spider /// /// This spider is copyright 2003 by Jeff Heaton. However, it is /// released under a Limited GNU Public License (LGPL). You may /// use it freely in your own programs. For the latest version visit /// http://www.jeffheaton.com. /// /// </summary> public class Spider { /// <summary> /// The URL's that have already been processed. /// </summary> private Hashtable m_already; /// <summary> /// URL's that are waiting to be processed. /// </summary> private Queue m_workload; /// <summary> /// The first URL to spider. All other URL's must have the /// same hostname as this URL. /// </summary> private Uri m_base; /// <summary> /// The directory to save the spider output to. /// </summary> private string m_outputPath; /// <summary> /// The form that the spider will report its /// progress to. /// </summary> private SpiderForm m_spiderForm; /// <summary> /// How many URL's has the spider processed. /// </summary> private int m_urlCount = 0; /// <summary> /// When did the spider start working /// </summary> private long m_startTime = 0; /// <summary> /// Used to keep track of when the spider might be done. /// </summary> private Done m_done = new Done(); /// <summary> /// Used to tell the spider to quit. /// </summary> private bool m_quit; /// <summary> /// The status for each URL that was processed. /// </summary> enum Status { STATUS_FAILED, STATUS_SUCCESS, STATUS_QUEUED }; /// <summary> /// The constructor /// </summary> public Spider() { reset(); } /// <summary> /// Call to reset from a previous run of the spider /// </summary> public void reset() { m_already = new Hashtable(); m_workload = new Queue(); m_quit = false; } /// <summary> /// Add the specified URL to the list of URI's to spider. /// This is usually only used by the spider, itself, as /// new URL's are found. /// </summary> /// <param name="uri">The URI to add</param> public void addURI(Uri uri) { Monitor.Enter(this); if( !m_already.Contains(uri) ) { m_already.Add(uri,Status.STATUS_QUEUED); m_workload.Enqueue(uri); } Monitor.Pulse(this); Monitor.Exit(this); } /// <summary> /// The URI that is to be spidered /// </summary> public Uri BaseURI { get { return m_base; } set { m_base = value; } } /// <summary> /// The local directory to save the spidered files to /// </summary> public string OutputPath { get { return m_outputPath; } set { m_outputPath = value; } } /// <summary> /// The object that the spider reports its /// results to. /// </summary> public SpiderForm ReportTo { get { return m_spiderForm; } set { m_spiderForm = value; } } /// <summary> /// Set to true to request the spider to quit. /// </summary> public bool Quit { get { return m_quit; } set { m_quit = value; } } /// <summary> /// Used to determine if the spider is done, /// this object is usually only used internally /// by the spider. /// </summary> public Done SpiderDone { get { return m_done; } } /// <summary> /// Called by the worker threads to obtain a URL to /// to process. /// </summary> /// <returns>The next URL to process.</returns> public Uri ObtainWork() { Monitor.Enter(this); while(m_workload.Count<1) { Monitor.Wait(this); } Uri next = (Uri)m_workload.Dequeue(); if(m_spiderForm!=null) { m_spiderForm.SetLastURL(next.ToString()); m_spiderForm.SetProcessedCount("" (m_urlCount )); long etime = (System.DateTime.Now.Ticks-m_startTime)/10000000L; long urls = (etime==0)?0:m_urlCount/etime; m_spiderForm.SetElapsedTime( etime/60 " minutes (" urls " urls/sec)" ); } Monitor.Exit(this); return next; } /// <summary> /// Start the spider. /// </summary> /// <param name="baseURI">The base URI to spider</param> /// <param name="threads">The number of threads to use</param> public void Start(Uri baseURI,int threads) { // init the spider m_quit = false; m_base = baseURI; addURI(m_base); m_startTime = System.DateTime.Now.Ticks;; m_done.Reset(); // startup the threads for(int i=1;i<threads;i ) { DocumentWorker worker = new DocumentWorker(this); worker.Number = i; worker.start(); } // now wait to be done m_done.WaitBegin(); m_done.WaitDone(); } } }