Programming robots/spiders using TCP

Blitz3D Forums/Blitz3D Programming/Programming robots/spiders using TCP

bytecode77(Posted 2008) [#1]
hello:)
i made a little (buggy) program which downloads a html file, searches that one for links and downloads them, too. this way it is possible to recursively download a whole server. but since that program is awfully buggy, i thought i might wanna try robots/spiders.

if i'm not wrong, i simply have to know what exactly to write into an open TCP stream to the server to get the directory listing, right? but since i dont know what, i thought, i ask here, if you dont mind:)
would appreciate any help :)


Paul "Taiphoz"(Posted 2008) [#2]
I can give some advice on this as Iv done something similar, I cant post any code tho as my code can be used for hacking and thats not a subject liked much on these forums, also your code to index a site can also be easily abused so be careful. And your on the right track, just tweak your code and work out the bugs.

Google's Spider does the exact same thing, Finds the first Index or Linked page and then parses that for further links which it then follows.

However, it also reads a server file called robots.txt which lists stuff the server admin does not want google to index.


bytecode77(Posted 2008) [#3]
hm so it isnt possible... thanks anywayz:)


nawi(Posted 2008) [#4]
I don't think it is possible to download a file listing, you can only search the html pages for links. I do remember creating a search engine crawler once, and if I remember correctly it started from a random page and just followed links from there.