Trying to copy large text file

Monkey Forums/Monkey Programming/Trying to copy large text file

slenkar(Posted March) [#1]
I have a text file which is about 3million lines long
I read it a line at a time,and after 10,000 lines are read
i write them into a file
I am splitting the 3million line file into several that are
200,000 lines each


It crashes after about 300,000 lines read and written though

also could it be made any faster?
maybe using some type of buffer?


	
Import mojo
Import brl
Global fs:FileStream
Function Main()
	fs=FileStream.Open("c:/Users/Chuck/Downloads/worldcitiespop.txt","r")
	
	Local document_counter:Int=0
	Local total_write$=""
	'Local readposf:=FileStream.Open("c:/Users/Chuck/Downloads/pos.txt","r")

	'Local total_write:String
	'fs.Seek(readposf.ReadInt)
	While Not fs.Eof
	 document_counter=document_counter+1
	Local wf:=FileStream.Open("c:/Users/Chuck/Downloads/mydb/mydb"+document_counter+".txt","w")
	Local counter:Int
	Local tenthoucounter:Int
	For Local x=0 To 200000
	counter=counter+1
	Local line_done=False
	Local this_line$=""
		While line_done=False
		Local r$=fs.ReadString(1)
			If r="~n"
			'Print "newline"
			line_done=True
			
			Local arr:=this_line.Split(",")
			arr=arr[..4]
			this_line=",".Join(arr)
			'this_line=this_line[..this_line.Length-1]
			'Print this_line
			total_write=total_write+this_line+"#~n"
	
	
	
	
	
			Else
			this_line=this_line+r
			Endif
		Wend
		
		If counter>10000
			
			counter=0
			tenthoucounter=tenthoucounter+1
			Print "done ten thou"+(tenthoucounter*10000)
			wf.WriteString(total_write)
			total_write=""
		Endif
	Next
tenthoucounter=0
wf.Close()
Wend
'Local posf:=FileStream.Open("c:/Users/Chuck/Downloads/pos.txt","w")

'posf.WriteInt(fs.Position())
'posf.Close()
'wf.WriteString(total_write)
'wf.Close()
fs.Close()




End Function



here is the big text file:
https://www.maxmind.com/en/free-world-cities-database


It crashes on this line:
Method PokeString:Int( address:Int,str:String,encoding:String="utf8" )

Select encoding
Case "utf8"
Local p:=str.ToChars()
Local i:=0,e:=p.Length
Local q:=New Int[e*3],j:=0

after finishing the first 200,000 line document and then getting about halfway through the second one.

I can see the variables in the debug dialog but there doesnt seem to be a way of copying and pasting them anywhere


Pakz(Posted March) [#2]
I tried the code and it errors here too.

Here the error log :



This is the line it crashes at :
wf.WriteString(total_write)



Pakz(Posted March) [#3]
You might want to try to create a textfile yourself with monkey and try that with the code to exclude the possibility that the file is the cause. Maybe if i have some time today i will do that.


slenkar(Posted March) [#4]
Thanks,
I set it up earlier so that the program writes 200,000 lines and then exits, I got about 5 files done this way.
I wrote the seek position of the big file into a text file and retrieved it each time.

The code to do this is in there, but its commented out.

Then I thought why not let the computer create the files and write to them, instead of me having to restart the program 15 times to get 15 files.

I think its something to do with running out of heap memory.
I think monkey used to have this command called 'flush' or something,
The big strings that are being created have to be cleared out with 'garbage collection'
I havent used Monkey in a while so Im not up to date with how it handles garbage.

The program does get quite slow after about 100,000 lines so it could be GC.


muddy_shoes(Posted March) [#5]
The GC won't be triggering because you're running the whole thing in Main. IIRC if you want to write a pure command line version then you have to call the GC yourself. Below is an "Appified" version that avoids the problem and does what I think you're trying to do. It should serve as a reference anyway. No guarantees though. Do your own testing.




muddy_shoes(Posted March) [#6]
A shadowy memory of Monkey past and some Googling dug up the CPP_GC_MODE preprocessor switch that makes the GC trigger outside of mojo Apps.I think it's a bit slower because of the way it causes more GC checks but it's not that much. There's certainly far more of a speed-up to be found in the file processing: Anyway, here's an unappified version of the appified version.




slenkar(Posted March) [#7]
The appified version is fine thanks

it seems stuck at 1,120,000 lines written.

Increasing lines per file to 30,000 does the trick and writes it all out thanks

I guess Mark might be interested in the freezing at 1,120,000 lines

-Edit
The freezing could be due to how Monkey stops and pauses when it is unfocussed