Code archives/Networking/Multi-threaded Code Archives downloader
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
This program attempts to download all current Code Archives entries; it then sorts them all into category sub-folders and renames them according to the index number and title of each entry. At the time of writing, the results come to nearly 15 MB, so, rather than run the code below, just download this archive (5.3 MB), which contains the BlitzMax source code, the compiled (command-line) binary and all Code Archives up to the time of this post: http://www.hi-toro.com/blitz/codearc/codearc.zip The main program requests each Code Archives entry page in turn and spawns a thread (up to 8 threads at once) to download each actual source file. Because it's parsing 2000+ entry pages, it takes a long time if you start from scratch! If you start with the archive above, and just run it occasionally, it takes a few seconds. NB. Vista/7 users: Make sure you run the program in a non-protected folder, otherwise file system virtualisation will kick-in and you won't see the new files. If in doubt, extract the archive (or run the program) from the C:\ folder. Here's the code anyway, included in the above archive. 28 April 2015: Little bug-fix in page parsing. | |||||
' ' *** Rather than copy and paste this code (it takes about 20 minutes to download the entire Code Archives from scratch!), ' please download the code and complete archives up to 22 July 2012 from the link below, THEN run it occasionally ' to update your collection: ' http://www.hi-toro.com/blitz/codearc/codearc.zip ' ----------------------------------------------------------------------------- ' * IMPORTANT! TURN "THREADED BUILD" ON IN PROGRAM MENU -> BUILD OPTIONS! * ' ----------------------------------------------------------------------------- SuperStrict Const AppName:String = "Blitz Code Archiver" Const OVERWRITE_ALL:Int = False ' Only download new entries. True will download ' and overwrite all entries... ' ----------------------------------------------------------------------------- ' Multi-threaded Code Archives downloader for www.blitzbasic.com... ' ----------------------------------------------------------------------------- ' Sorts code into sub-folders: BlitzBasic/ for .bb files and BlitzMax/ for .bmx; ' also creates category sub-folders within each, eg. Graphics, Networking, etc. ' Stores code with entry number and title. ' Skips entries already downloaded/known missing. ' Code not declared as Public Domain will be flagged with a warning at the top. ' Works with site on 30 Nov 2011. Will fail if site layout changes, so page ' parsing code will need to be updated! ' ----------------------------------------------------------------------------- ' ----------------------------------------------------------------------------- ' How it works... ' ----------------------------------------------------------------------------- ' The main thread connects to blitzbasic.com and holds open the connection for ' as long as the server allows, using the same connection to request each ' Code Archives web page in turn. ' The server eventually says "go away" after a certain amount of time/data, so ' the program detects this, re-connects and then resumes its requests on the ' new connection. ' Each page is parsed to find the URL for the downloadable source. A thread is ' then spawned for each separate code downloaded. ' ----------------------------------------------------------------------------- ' Structures... ' ----------------------------------------------------------------------------- ' Handles connection... Type Session Field www:String Field ip:Int Field socket:TSocket Field stream:TSocketStream Field working_index:Int End Type ' Contains download URL and category. (Passed to DownloadCode thread by ' ParseCodePage function.) Type Download Field url:String Field category:String Field pd:Int End Type ' ----------------------------------------------------------------------------- ' Some constants... ' ----------------------------------------------------------------------------- Const CODE_PD:Int = 1 Const CODE_NOT_PD:Int = 2 Const CODE_NOT_FOUND:Int = 4 Const CONNECTION_DROPPED:Int = 8 Const URL_ERROR:Int = 16 Const BB_DOMAIN:String = "www.blitzbasic.com" Const CODE_ARCS:String = "/codearcs/codearcs.php?code=" ' ----------------------------------------------------------------------------- ' Couple of mutexes... ' ----------------------------------------------------------------------------- Global PRINT_MUTEX:TMutex = CreateMutex () Global DIR_MUTEX:TMutex = CreateMutex () ' ----------------------------------------------------------------------------- ' Thread stuff... ' ----------------------------------------------------------------------------- Const NUM_THREADS:Int = 8 ' Spawn (up to) this many download threads at once... Global Thread:TThread [NUM_THREADS] ' Array of threads Global ThreadTrack:Int = 0 ' Thread counter used later ' ----------------------------------------------------------------------------- ' Multi-threaded Print helper... ' ----------------------------------------------------------------------------- Function TPrint (message:String) LockMutex PRINT_MUTEX Print message UnlockMutex PRINT_MUTEX End Function ' ----------------------------------------------------------------------------- ' GO! ' ----------------------------------------------------------------------------- ' These entries are known to be missing (deleted) at time of writing... Local known_missing:String [] = ["4", "5", "36", "40", "57", "69", "70", "71", "72",.. "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86",.. "87", "88", "89", "90", "91", "122", "154", "186", "210", "234", "240", "256", "269",.. "270", "271", "272", "296", "297", "303", "304", "305", "306", "308", "309", "310",.. "311", "334", "344", "398", "405", "408", "409", "424", "425", "440", "484", "486",.. "496", "516", "522", "523", "524", "535", "536", "543", "572", "609", "623", "636",.. "637", "657", "666", "679", "722", "723", "736", "737", "738", "739", "740", "741",.. "742", "759", "760", "761", "762", "763", "767", "768", "770", "771", "772", "774",.. "775", "777", "780", "800", "834", "842", "885", "886", "887", "899", "900", "925",.. "941", "945", "953", "957", "964", "965", "978", "982", "984", "986", "1005", "1007",.. "1012", "1018", "1019", "1020", "1021", "1032", "1037", "1038", "1041", "1042", "1043",.. "1044", "1045", "1046", "1047", "1048", "1049", "1050", "1051", "1052", "1054", "1055",.. "1056", "1058", "1066", "1074", "1075", "1076", "1077", "1155", "1209", "1210", "1211",.. "1212", "1213", "1218", "1226", "1280", "1291", "1301", "1312", "1313", "1317", "1391",.. "1409", "1418", "1459", "1490", "1506", "1526", "1543", "1549", "1559", "1613", "1614",.. "1615", "1616", "1617", "1618", "1619", "1620", "1622", "1698", "1743", "1756", "1797",.. "1984", "2021", "2049", "2062", "2085", "2149", "2158", "2159", "2160", "2229", "2256",.. "2317", "2319", "2326", "2424", "2438", "2445", "2694", "2726", "2729", "2793", "2827",.. "2898", "2899", "2902", "2905", "3028", "3084", "3098", "3135", "3139", "3174"] ' List to be filled with missing entries and entries already downloaded... Global SkipList:TList = ListFromArray (known_missing) Local highest:Int = FindLatestEntry () TPrint "" TPrint "Requesting entries from #1 to #" + highest + ", skipping missing/already downloaded entries..." ' Check both sub-folders for downloaded entries... ParseDir "BlitzBasic" ParseDir "BlitzMax" Local downloader:Session = CreateSession (BB_DOMAIN) If downloader Local index:Int = 1 ' Code Archives index Local result:Int ' Result returned by RequestCode Local dropped:Int ' Flag for dropped connections For index = index To highest Local ignored:Int = False If Not OVERWRITE_ALL ' Check if already downloaded... For Local entry:String = EachIn SkipList If entry = index ignored = True ListRemove SkipList, entry Exit EndIf Next EndIf If Not ignored TPrint "" TPrint "Requesting Code Archives entry #" + index + "..." ' RequestCode does NOT guarantee that an individual request completed ' its download, just that the request was for a valid entry and a ' download thread was spawned... result = RequestCode (downloader, index) ' Note if connection was dropped by the server during download request... If result & CONNECTION_DROPPED dropped = True result = result & ~CONNECTION_DROPPED Else dropped = False EndIf Select result Case CODE_PD 'TPrint "Got code entry #" + index Case CODE_NOT_PD 'TPrint "Code entry #" + index + " is not PD" Case CODE_NOT_FOUND TPrint "" TPrint "~t~tNo code entry #" + index Case URL_ERROR TPrint "" TPrint "~t~tURL doesn't exist for #" + index ' Shouldn't happen! End Select ' Reconnect if session was dropped... ' Doesn't need to re-request: the bb.com server drops the connection ' after a certain amount of time/data but only after completing the ' file it was on. (Random accidental drops NOT tested/accounted for!) If dropped TPrint "" TPrint "Connection dropped" TPrint "" CloseSession downloader downloader = CreateSession ("www.blitzbasic.com") EndIf EndIf Next ' Wait for remaining download threads to finish... For ThreadTrack = 0 Until NUM_THREADS If Thread [ThreadTrack] TPrint "" TPrint "Waiting for thread #" + ThreadTrack + " to finish" WaitThread (Thread [ThreadTrack]) EndIf Next CloseSession downloader Else TPrint "Session failed" EndIf Print "" Input "Press Return/Enter to exit..." ' ----------------------------------------------------------------------------- ' The workers... ' ----------------------------------------------------------------------------- ' Find number of latest entry by parsing Code Archives pages... Function FindLatestEntry:Int () Function GetLatest:Int (url:String) TPrint "~tChecking " + "http://" + BB_DOMAIN + url Local readpage:TStream = ReadStream ("http::" + BB_DOMAIN + url) Local coderef:String = "<a href=codearcs.php?code=" Local latest:Int If readpage Repeat Local line:String = ReadLine (readpage) Local pos:Int = Instr (line, coderef) Local endref:Int = Instr (line, ">", pos + Len (coderef)) latest = Int (Mid (line, pos + Len (coderef), endref - (pos + Len (coderef)))) Until Eof (readpage) CloseStream readpage EndIf Return latest End Function TPrint "" TPrint "Finding latest Code Archives entry from category pages..." TPrint "" Local response:String Local length:Int Local caturl:String = "/codearcs/codearcs.php" Local catlink:String = "<a href=~qcodearcs.php?cat=" Local highest:Int Local readcats:TStream = ReadStream ("http::" + BB_DOMAIN + caturl) If readcats Repeat Local line:String = ReadLine (readcats) Local pos:Int = 1 Repeat pos = Instr (line, catlink, pos + 1) If pos Local endquote:Int = Instr (line, "~q", pos + Len (catlink)) Local url:String = Mid (line, pos + 21, (endquote - pos) - 21) Local hi:Int = GetLatest (caturl + url) If hi > highest Then highest = hi EndIf Until pos = 0 Until Eof (readcats) CloseStream readcats EndIf Return highest End Function ' Adds already downloaded entries to a list... Function MarkDownloaded (file:String) If Lower (ExtractExt (file)) = "bb" Or Lower (ExtractExt (file)) = "bmx" ' Find entry number from file name (eg. "24 - Title.bb")... file = StripAll (file) Local pos:Int = Instr (StripAll (file), " - ") Local entry:String = Left (file, pos - 1) If entry Then ListAddLast SkipList, entry EndIf End Function ' Folder parser... Function ParseDir (dir:String) If Right (dir, 1) <> "\" And Right (dir, 1) <> "/" dir = dir + "/" EndIf Local folder:Int = ReadDir (dir) If folder Repeat Local entry:String = NextFile (folder) If entry = "" Then Exit If entry <> "." And entry <> ".." If FileType (dir + entry) = FILETYPE_FILE Local file:String = entry Local full:String = dir If Right (full, 1) <> "\" And Right (full, 1) <> "/" full = full + "\" EndIf full = full + file MarkDownloaded (full) Else If FileType (dir + entry) = FILETYPE_DIR Local file:String = entry If file <> "." And file <> ".." Local ffolder:String = dir If Right (ffolder, 1) <> "\" And Right (ffolder, 1) <> "/" ffolder = ffolder + "\" EndIf ffolder = ffolder + file ParseDir (ffolder) EndIf EndIf EndIf EndIf Forever EndIf End Function ' Turned out to be pointless... Function CodeArcURL:String (index:Int) Return CODE_ARCS + index End Function ' Check still connected before sending request to server... Function SafeWriteLine (s:Session, message:String) If s If SocketConnected (s.socket) WriteLine s.stream, message EndIf EndIf End Function ' Check still connected before reading line from server... Function SafeReadLine:String (s:Session) If s If SocketConnected (s.socket) Return ReadLine (s.stream) EndIf EndIf End Function ' Check still connected before reading string from server... Function SafeReadString:String (s:Session, bytes:Int) If s If SocketConnected (s.socket) Return ReadString (s.stream, bytes) EndIf EndIf End Function ' Extract "404", "200", etc from server response... Function ReplyCode:String (message:String) If Left (message, 5) = "HTTP/" Local pos:Int = Instr (message, " ") Return Mid (message, pos + 1, 3) EndIf End Function ' Establish connection to server... Function CreateSession:Session (domain:String) TPrint "" TPrint "Creating new TCP session..." TPrint "" Local s:Session = New Session s.www = domain s.ip = HostIp (s.www) s.socket = CreateTCPSocket () If s.socket If ConnectSocket (s.socket, s.ip, 80) s.stream = CreateSocketStream (s.socket) If s.stream = Null TPrint "Couldn't create socket stream" EndIf Else TPrint "Couldn't connect to server " + HostName (s.ip) EndIf Else TPrint "Couldn't create TCP socket" EndIf If s.socket = Null Or s.stream = Null Then s = Null Return s End Function ' Close down connection to server... Function CloseSession (s:Session) If s TPrint "" TPrint "Closing session..." If s.stream Then CloseStream s.stream If s.socket Then CloseSocket s.socket s = Null EndIf End Function ' Wraps up page request into helper function... Function RequestCode:Int (s:Session, index:Int) If s s.working_index = index If SocketConnected (s.socket) Return GetPage (s:Session, CodeArcURL (s.working_index)) Else Return CONNECTION_DROPPED EndIf EndIf End Function ' Request Code Archives entry page... Function GetPage:Int (s:Session, path:String) If s SafeWriteLine s, "GET " + path + " HTTP/1.1" ' State file name and HTTP protocol SafeWriteLine s, "Host: " + s.www ' Not sure what the point of this is -- we're connected! SafeWriteLine s, "User-Agent: " + AppName ' Name of this program -- irrelevant SafeWriteLine s, "Accept: text/plain" ' Code Archives are plain text SafeWriteLine s, "" ' Tell server to go ahead... Local response:String Local found:Int Local chunked:Int = False Repeat response = SafeReadLine (s) If Lower (Left (response, 26)) = "transfer-encoding: chunked" Then chunked = True 'TPrint "HEADER: " + response If Left (response, 5) = "HTTP/" Select ReplyCode (response) Case "200" found = True Case "404" found = False Default found = False End Select EndIf Until response = "" If found If Not chunked Then CloseSession s; RuntimeError " Server has changed; not using chunked encoding. New download function needed!" Return ParseCodePage (s) Else Local error:Int = 0 Repeat response = SafeReadLine (s) TPrint "FLUSH: " + response Until response = "" error = error | URL_ERROR If Not SocketConnected (s.socket) error = error | CONNECTION_DROPPED EndIf Return error EndIf EndIf End Function ' Parses entry's HTML page to find URL to code, Public Domain status, etc, then ' spawns download thread if valid... Function ParseCodePage:Int (s:Session, debug:Int = False) ' Read HTML page and get the bits we need... Local response:String Local length:Int Local pd:Int = False Local url:String Local exists:Int = True Local category:String Local codestart:String = "<tr ><td class=~qcell~q><pre class=~qcode~q>" Local codeend:String = "</pre></td></tr>" Local incode:Int = False Repeat ' Length of data in bytes; hex string possibly followed ' by semi-colon and undefined data... response = SafeReadLine (s) If Instr (response, ";") response = Left (response, Instr (response, ";") - 1) EndIf length = Int ("$" + response) If length ' Read data... ' Can't use ReadLine as content can contain CR/LF... response = SafeReadString (s, length) SafeReadLine s ' Clear the CR/LF following content... Local line:String [] = response.Split ("<br>") For Local loop:Int = 0 Until line.length If Left (line [loop], Len (codestart)) = codestart And Right (line [loop], Len (codeend)) = codeend incode = True Else incode = False EndIf Next If Not incode ' Only interested in first codebox. Need to check what happens if there's a codebox in the description! ' Store category for later use... and figure out HOW to use it... Local seek:String = "<a href=codearcs.php>Code archives</a>/<a href=codearcs.php?cat=" Local catref:Int = Instr (response, seek) If catref ' Find next > and < symbols... Local startcat:Int = Instr (response, ">", catref + Len (seek)) Local endcat:Int = Instr (response, "<", startcat) - 1 ' Category name lies in-between... category = response [startcat..endcat] EndIf ' Non-existent entries generate custom page, NOT a 404... If Instr (response, "ERROR: Internal error") exists = False EndIf ' PD notice found... If Instr (response, "This code has been declared by its author to be Public Domain code.") pd = True EndIf ' Find URL from this line... If Instr (response, ">Download source code<") Local q1:Int = Instr (response, "~q") Local q2:Int = Instr (response, "~q", q1 + 1) url = Mid (response, q1 + 1, (q2 - 1) - q1) EndIf EndIf Else Repeat response = SafeReadLine (s) Until response = "" Exit EndIf Until response = "" ' The server disconnects the TCP session after a certain time or number of bytes, ' but only on completing the current file. This means we should return success but ' also that the server closed the session so we can re-establish it... Local result:Int = 0 If Eof (s.stream) If Not SocketConnected (s.socket) result = result | CONNECTION_DROPPED EndIf EndIf If exists Local dl:Download = New Download dl.url = url dl.category = category If pd result = result | CODE_PD dl.pd = CODE_PD Else result = result | CODE_NOT_PD dl.pd = CODE_NOT_PD EndIf ' --------------------------------------------------------------------- ' Dumb thread-spawner... ' --------------------------------------------------------------------- ' Note: dependent on the fact that this function is NOT multi-threaded! If Thread [ThreadTrack] 'TPrint "Waiting for thread #" + ThreadTrack + " to finish" WaitThread (Thread [ThreadTrack]) EndIf 'TPrint "Spawning thread # " + ThreadTrack Thread [ThreadTrack] = CreateThread (DownloadCode, dl) ThreadTrack = ThreadTrack + 1 If ThreadTrack = NUM_THREADS Then ThreadTrack = 0 ' --------------------------------------------------------------------- Else result = result | CODE_NOT_FOUND EndIf Return result End Function ' Code download child thread retrieves the standalone .bb or .bmx code... Function DownloadCode:Object (obj:Object) Local dl:Download = Download (obj) Local category:String = dl.category If category = "" Then category = "Unknown Category" Local file:String = "/codearcs/" + dl.url Local url:String = "http::" + BB_DOMAIN + file Local ip:Int = HostIp (BB_DOMAIN) Local socket:TSocket = CreateTCPSocket () If socket If ConnectSocket (socket, ip, 80) Local stream:TStream = CreateSocketStream (socket) If stream Local blitzcode:String = ExtractExt (url) Local blitzfolder:String Local comment_symbol:String Select blitzcode Case "bb" blitzfolder = "BlitzBasic" comment_symbol = ";" Case "bmx" blitzfolder = "BlitzMax" comment_symbol = "'" Default ' TPrint "Unexpected file type ~q" + blitzcode + "~q! " + dl.url + " (" + dl.category + ")" ' End RuntimeError "Unexpected file type ~q" + blitzcode + "~q! See DownloadCode Function..." End Select blitzfolder = blitzfolder + "/" + category ' Only have one thread creating folder... LockMutex DIR_MUTEX If FileType (blitzfolder) = 0 Then CreateDir blitzfolder, True UnlockMutex DIR_MUTEX Local saveas:String = blitzfolder + "/" + StripDir (url) Local save:TStream = WriteStream (saveas) If save WriteLine stream, "GET " + file + " HTTP/1.1" WriteLine stream, "Host: " + BB_DOMAIN WriteLine stream, "User-Agent: " + AppName WriteLine stream, "Accept: text/plain" WriteLine stream, "" Repeat ' Skip headers... Until ReadLine (stream) = "" Local linecount:Int = 0 TPrint "" If dl.pd = CODE_NOT_PD WriteLine save, "" WriteLine save, comment_symbol + " " + Replace (LSet (" ", 78), " ", "-") WriteLine save, comment_symbol + " WARNING: This code has NOT been declared as Public Domain!" WriteLine save, comment_symbol + " " + Replace (LSet (" ", 78), " ", "-") WriteLine save, "" EndIf TPrint "~tCategory: " + category Local title:String = "" Repeat Local line:String = ReadLine (stream) If linecount < 5 If Left (line, 2) = "; " Local colon:Int = Instr (line, ": ", 3) If colon Local comment:String = Mid (line, colon + 2) Local autocomment:Int = False ' Automatic comment inserted by server... Select Mid (line, 3, (colon - 1) - 2) Case "ID" Print "~tEntry: #" + comment autocomment = True Case "Author" Print "~tAuthor: " + comment autocomment = True Case "Date" Print "~tDate submitted: " + comment autocomment = True Case "Title" Print "~tTitle: " + comment title = comment autocomment = True Case "Description" Print "~tDescription: " + comment autocomment = True End Select If autocomment ' Swap ; symbol for ' symbol in BlitzMax code! If blitzcode = "bmx" Then line = comment_symbol + Right (line, Len (line) - 1) EndIf EndIf EndIf EndIf WriteLine save, line linecount = linecount + 1 Until Eof (stream) CloseStream save ' Replace invalid characters (based on Windows) with underscore... Local invalid:String[] = ["\", "/", ":", "*", "?", "~q", "<", ">", "|"] For Local char:String = EachIn invalid title = Replace (title, char, "_") Next ' Rename from "24.bb" to "24 - Title.bb"... If Not RenameFile (saveas, blitzfolder + "/" + StripAll (url) + " - " + title + "." + blitzcode) DeleteFile saveas ' This will delete files that were orphaned with old name on ' last run, eg. in event of early program/thread termination... EndIf EndIf CloseStream stream EndIf EndIf EndIf Return Null End Function |
Comments
| ||
OK, you'll find that the code claims this particular entry, ironically enough, doesn't exist! It fails to download itself because I'm doing really dumb Instr () checks for strings found in the HTML here, but the code picks up those strings from the codebox's source too! (The page is downloaded as one large string due to the chunked encoding the server uses.) I thought I'd got it, but then ran into a really weird bug that made no sense at all. I could be lazy and hard-code it to skip this entry, but I think I'll split the string into separate lines and parse the page that way... but not tonight. For now, expect that this entry will be skipped as "non-existent" and try not to laugh too loudly... |
| ||
I dfon't know if it helps for this instance, Jim, but if it is any interest or use, I was working on something that (downlaoded web pages to a local file and then) scoured them for specific strings. In order to recognise the string itself, rather than look for the string in question (due to the formatting of HTML) I instead qualified prefix and suffix strings. These values were of varyig length dependign on the details of the website. For example, downloading the page from "http://www.ip-details.com/ip-search" the actual client's IP location coordinates can be identified in the downloaded html file as being between the prefix of "<span id="+Chr(34)+"latlongspan"+Chr(34)+">"and the suffix of "</span>" This is failproof rovided the prefix is long enough to ensure unique results, and the suffix is used to mark the end of the required string. Once the result has been obtained, it may be adviseable then to use Trim and/or remove/edit characters, such as html control codes. |
| ||
That might work, but I'm not sure -- I was trying something similar last night (realising that my simple strings could easily be in someone else's future contribution), but still ran into trouble. The problem is that as soon as you add code searching for a particular string, the code then contains that string, so when I post it here, that string (in the Blitz code here) gets detected! I'm probably going to have another go shortly -- thanks! |
| ||
the code then contains that string Not if the search string is dynamically generated at runtime via a series of calls to Chr(), or possibly by a couple of strategic Replace()es on a base string. e.g. If you want to search for "@@foo@@", store that in the code as Replace("##foo##", "#", "@"), and there will be no instances of "@@foo@@" in the original source. |
| ||
Yeah, I was getting myself confused on that point, but I may do that... I'm working on it now and have a much better angle, so hopefully will have it sorted soon. |
| ||
OK, looking good! I parsed for the start and end of the first 'codebox' and now only check for stuff outside that area; all seems to be working fine now. (I've re-uploaded the code and archives file at http://www.hi-toro.com/blitz/codearc/codearc.zip so feel free to download from there.) Email me via my profile if you find anything broken... |
| ||
Source and zip archive updated 22 July 2012. Either the site changed its "ERROR: Internal error" text to include a space, or I managed to delete the space at some point. Either way, it started crashing! |
Code Archives Forum