At long last, we're moving to forums powered by, well, Movable Type itself. You'll want to bookmark http://forums.movabletype.org/ for future reference, and in the meantime you can view these old forums as a read-only archive of past posts. Thanks for being part of the community!
![]() ![]() |
Dec 11 2003, 02:11 PM
Post
#1
|
|
|
Group: Members Posts: 11 Joined: 27-October 02 Member No.: 5,380 |
My server had a primary hard drive failure. The generated html was on another drive and is fine, but the MySQL database was on the main drive and is gone.
Is there any way to import the monthly archives? Does anyone have a script to parse the HTML and convert it to a MT import file? Thanks. |
|
|
|
Dec 19 2003, 08:21 AM
Post
#2
|
|
|
Group: Members Posts: 11 Joined: 27-October 02 Member No.: 5,380 |
Never mind. I wrote this. It works
CODE #!/usr/bin/perl
use Date::Manip; # mtfix.pl - parse HTML files to import into Movable Type # usage: mtfix.pl *html > import.mt # Note: you _will_ need to adjust the regex while (<>) { # for each file on the command line # read in entire file to $content, line feeds and all # using slurp mode { local $/; $content = <>;} # locate the fields we need using regex # some matches may include newlines ($author) =($content =~ m|<div class="posted">\s+(.+?)\s+/|s); ($title) = ($content =~ m|<span class="title">(.+)</span>|); ($text) = ($content =~ m|<p>(.+)<a name="more">|s); ($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s); ($date) = ($content =~ m|<div class="date">(.+)</div>|); ($time) = ($content =~ m|Posted\sat\s(.+)<br />|); # Read in all the comments as one big block of text ($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s); # Break up each comment into it's own element in an array @comm = split (/<\/div>\n/, $comments); # convert the date to MM/DD/YYYY hh:mm:ss $datetime = "$date $time"; $parsed = ParseDate($datetime); $datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S"); # Strip out the paragraph tags, MT will add them later anyways. $text =~ s|\<p\>||g; $text =~ s|\</p\>||g; $more =~ s|\<p\>||g; $more =~ s|\</p\>||g; # printout the fields in the proper format print "AUTHOR: $author\n"; print "TITLE: $title\n"; print "DATE: $datetime\n"; print "-----\n"; print "BODY:\n$text\n"; print "-----\n"; print "EXTENDED BODY:\n$more\n"; foreach (@comm) { # For every comment in our aray, printout the necessary formating. Â Â Â Â if (length $_ > 7) { # this is here to ignore the last comment record. Â Â Â Â Â Â Â Â ($CText) = ($_ =~ m|<div class="comments-body">\n(.+)\n\<span|s); Â Â Â Â Â Â Â Â ($CDate) = ($_ =~ m|</a>\son\s(.+)</span>|); Â Â Â Â Â Â Â Â ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|); Â Â Â Â Â Â Â Â ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|); Â Â Â Â Â Â Â Â ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|); Â Â Â Â Â Â Â Â $CText =~ s|\<p\>||g; Â Â Â Â Â Â Â Â $CText =~ s|\</p\>||g; Â Â Â Â Â Â Â Â $parsed = ParseDate($CDate); Â Â Â Â Â Â Â Â $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S"); Â Â Â Â Â Â Â Â print "-----\n"; Â Â Â Â Â Â Â Â print "COMMENT:\n"; Â Â Â Â Â Â Â Â print "AUTHOR: $CAuthor\n"; Â Â Â Â Â Â Â Â print "URL: $CURL\n"; Â Â Â Â Â Â Â Â print "DATE: $CDate\n"; Â Â Â Â Â Â Â Â print "$CText\n\n"; Â Â Â Â } } print "--------\n"; } |
|
|
|
Dec 22 2003, 07:03 AM
Post
#3
|
|
|
Group: Members Posts: 3 Joined: 18-December 03 Member No.: 18,864 |
wow, elegant. ... so that just reads the the HTMLified pages in and spits out the data in correct import format. how very perl of you
|
|
|
|
Dec 22 2003, 05:03 PM
Post
#4
|
|
|
Group: Members Posts: 33 Joined: 9-October 01 From: Boston-area, Massachusetts Member No.: 2,447 |
How exactly does this work? Where does one put this file and run it?
|
|
|
|
Dec 22 2003, 09:02 PM
Post
#5
|
|
|
Group: Members Posts: 1,622 Joined: 23-June 03 From: Abu Dhabi, UAE Member No.: 12,575 |
you should publicise this script more. I've seen many cases where this file could be vital. Like mavenglobe said where do u put it. How do u configure it, permissions etc. !
-------------------- Movalog – All Things Movable Type Movalog Plugins – Blogroll, Protect, CustomFields, InlineEditor, LivePreview, Comment Email Filter Movable Type Style Generator – Create your own unique drop-in stylesheet |
|
|
|
Dec 29 2003, 09:41 AM
Post
#6
|
|
|
Group: Members Posts: 11 Joined: 27-October 02 Member No.: 5,380 |
In order to get this to work, you'll need to install perl.
You can get perl FREE from www.perl.com, there are Windows/Mac/Unix/Linux distributions. Once you've installed Perl, copy that script into your favorite text editor, save it as mtfix.pl (or whatever). You'll need to use your individual archives for the import. Take a look at one of the HTML files to figure out what you need to change in the perl script. The way the perl script works, is that it starts reading the HTML from the start of the file and is looking for a match. ie: CODE ($title) = ($content =~ m|<span class="title">(.+)</span>|); Here we are looking for the Title of the entry. The '(.+)' represents the text we are going to capture, my titles are sitting in a span class. All you need to do to get it to work for yours is to figure out what surrounds each item you are looking for. A little more complicated is the way I got the author information out of the comments. For me, the author name was a link. CODE <span class="comments-post">Posted by: <a target="_blank" href="http://jason.sdf1.net">Jason</a> at December 18, 2003 09:43 AM</span> So this was the code: CODE ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|); ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|); ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|); It grabs that whole line into the temp variable and then looks in the temp variable for the Name and URL. Notice how I had to use \s to represent spaces for the 'Posted by:' search. You may have to use a \n for new lines Okay, so you think you've made the changes you need, and want to go ahead and try it out. Drop to a command prompt, with your saved script and the html files in the same folder and type: CODE perl mtfix.pl 0000001.html Or whatever file name you choose. I recommend trying it out with just one of your files. It's going to print the results to the screen, you can read through and make sure that it's finding the right text. If not, go back and figure out why. If it's perfect, then use this: CODE perl mtfix.pl *.html > import.mt Hope that works out for you. If you need more help, do a google search on Perl and Regular Expressions. Learn something |
|
|
|
Mar 12 2004, 11:18 PM
Post
#7
|
|
|
Group: Members Posts: 147 Joined: 12-October 02 Member No.: 5,052 |
This happened to me last weekend. I can't thank you enough for that Perl script, Anakin. Talk about a lifesaver.
I discovered a few more useful things while I was recovering stuff. (I still am, but I think I've automated everything I possibly could at this point.) Long post on the complete recovery process at my site. |
|
|
|
May 29 2004, 03:50 PM
Post
#8
|
|
|
Group: Members Posts: 50 Joined: 15-February 02 From: Ohio Member No.: 476 |
is there a way to do this without having shell access? my hosting company is very strict on allowing shell access.
|
|
|
|
May 29 2004, 06:11 PM
Post
#9
|
|
|
Group: Members Posts: 50 Joined: 15-February 02 From: Ohio Member No.: 476 |
ok i installed perl on my computer lol have php why not perl right? but the error I recieve is
Can't open *.html: no such file or directory at mtfix.pl line 6 which is while (<>) { # for each file on the command line (at least pretty sure that is line 6 lol |
|
|
|
May 29 2004, 07:42 PM
Post
#10
|
|
|
Group: Members Posts: 50 Joined: 15-February 02 From: Ohio Member No.: 476 |
actually would there be a way to do this using just the monthly archives?
|
|
|
|
Jun 7 2004, 07:16 AM
Post
#11
|
|
|
Group: Members Posts: 11 Joined: 27-October 02 Member No.: 5,380 |
QUOTE (StarryMom @ May 30 2004, 03:42 AM) actually would there be a way to do this using just the monthly archives? It's written primarily for individual archives, you can try passing the monthly's to them without the line "> import.mt" watch the output on the screen and see if it looks right. You will need to save all of the pages from your site locally in order for the script to read them. Someone else was having problems with an extra blank comment being added to the end of each record. Simply play with this line: CODE if (length $_ > 7) { # this is here to ignore the last comment record. Change the 7 to a higher number, like 20 or 35, don't go too high, this is a threshold number between junk for a comment and an actual comment. That text includes the name of the poster, the date, and their text. Plus any email and URLs that may have been entered. Something under 40 is probably a safe bet. |
|
|
|
Jun 7 2004, 07:52 AM
Post
#12
|
|
|
Group: Members Posts: 50 Joined: 15-February 02 From: Ohio Member No.: 476 |
ok here is a sample of the file I am trying to work with,
CODE <div class="blog"> <a name="000008"></a> <span class="title">Linky Love</span> <p>As my first mini post using MT, here are a few of my favorite links! Give them love, comment on THEIR blogs, in their guestbooks, etc. *muah*</p> <p><a href="http://ambienine.vectorstar.net/ambienine/">My Ambie Kitten *purrr*</a>, <a href="http://belle.rose-madder.net/blog.htm">Kim</a>, <a href="http://asnightfalls.net">Crystal</a>, <a href="http://eosrising.net/">CG</a>, <a href="http://www.love-buzz.org/">Sarah</a>.</p> <div class="posted">Posted by Sarah at <a href="http://onestarrynight.com/musings/archives/000008.php#000008">07:59 AM</a></span> </div> Mind you this is a VERY VERY old entry lol When I do it I either have one of two things happen 1. It just puts the first title of the entry in the import file thats it. CODE AUTHOR: TITLE: Linky Love DATE: ----- 2. It doesn't work at all. I swear it hates me lol |
|
|
|
Jun 19 2004, 12:14 AM
Post
#13
|
|
|
Group: Members Posts: 7 Joined: 27-March 03 Member No.: 9,231 |
I try to run this script, but, I get an error on this line ...
#!/usr/bin/perl use Date::Manip; This is the error. QUOTE $ perl mtfix.pl index.php Can't locate Date/Manip.pm in @INC (@INC contains: /etc/perl /usr/lib/perl5/site_perl/5.8.2/i686-linux /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.2/i686-linux /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.2/i686-linux /usr/lib/perl5/5.8.2 /usr/local/lib/site_perl .) at mtfix.pl line 2. BEGIN failed--compilation aborted at mtfix.pl line 2. I tried to install the file using CPAN.pm. But it does not exist. Thoughts? |
|
|
|
Jun 19 2004, 10:09 AM
Post
#14
|
|
|
Group: Members Posts: 7 Joined: 27-March 03 Member No.: 9,231 |
Disregard, I got it working.
|
|
|
|
Jun 19 2004, 06:38 PM
Post
#15
|
|
|
Group: Members Posts: 7 Joined: 27-March 03 Member No.: 9,231 |
OK, I found some minor issues with Anakin513's script. It seemed to drop dates (at least on my archive) and it excluded categories and other useful stuff found in the RDF for each post. So, I tweaked his script and wrote a few directions you may find useful.
Thanks to fxn at irc.freenode.net#perl for the help. Here it is ... CODE #########################################
Script: mt_recover_html.pl Written By: anakin513 (http://jason.sdf1.net/) Tweaked By: apakuni (http://apakuni.com) NOTE: This script assumes you want to publish your recovered posts and sets comments and pings on by default. Use this script at your own risk! ######################################### # Instructions ######################################### # Rename this file to mt_recover_html.pl # Run Test: # $ perl mt_recover_html.pl somemtfile.html # Returns results to screen for review. If all is well, # run the script against the entire archive. # $ perl mt_recover_html.pl *.html > mt_import_file.txt # Import into MT or WP from there. ######################################### #!/usr/bin/perl use Date::Manip; # mtfix.pl - parse HTML files to import into Movable Type # usage: mtfix.pl *html > import.mt # Note: you _will_ need to adjust the regex while (<>) { # for each file on the command line # read in entire file to $content, line feeds and all # using slurp mode { local $/; $content = <>;} # locate the fields we need using regex # some matches may include newlines ($author) =($content =~ m|<span class="posted">Posted by (.+?) at |s); ($title) = ($content =~ m|<h3 class="title">(.+)</h3>|); ($text) = ($content =~ m|</h3>(.+)<a name="more">|s); ($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s); ($date) = ($content =~ m|dc:date="(.+)" />|); ($excerpt) = ($content =~ m|dc:description="(.*)"|); ($primary_cat) = ($content =~ m|dc:subject="(.*)"|); # Read in all the comments as one big block of text ($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s); # Break up each comment into it's own element in an array @comm = split (/<\/div>\n/, $comments); # convert the date to MM/DD/YYYY hh:mm:ss $datetime = "$date $time"; $parsed = ParseDate($datetime); $datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S"); # Strip out the paragraph tags, MT will add them later anyways. $text =~ s|\<p\>||g; $text =~ s|\</p\>||g; $more =~ s|\<p\>||g; $more =~ s|\</p\>||g; # printout the fields in the proper format print "AUTHOR: $author\n"; print "TITLE: $title\n"; print "STATUS: Publish\n"; print "ALLOW COMMENTS: 1\n"; print "CONVERT BREAKS: 0\n"; print "ALLOW PINGS: 1\n"; print "PRIMARY CATEGORY: $primary_cat\n"; print "DATE: $datetime\n"; print "-----\n"; print "BODY:\n$text\n"; print "-----\n"; print "EXTENDED BODY:\n$more\n"; print "-----\n"; print "EXCERPT:\n$excerpt\n"; print "-----\n"; foreach (@comm) { # For every comment in our aray, printout the necessary formating. if (length $_ > 7) { # this is here to ignore the last comment record. ($CText) = ($_ =~ m|<div class="comments-body">(.+)<span class="comments-post">|s); ($CDate) = ($_ =~ m|</a> at (.+)</span>|); ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|); ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|); ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|); $CText =~ s|\<p\>||g; $CText =~ s|\</p\>||g; $parsed = ParseDate($CDate); $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S"); print "-----\n"; print "COMMENT:\n"; print "AUTHOR: $CAuthor\n"; print "URL: $CURL\n"; print "DATE: $CDate\n"; print "$CText\n\n"; } } print "--------\n"; } |
|
|
|
Jun 27 2004, 11:02 AM
Post
#16
|
|
|
Group: Members Posts: 1 Joined: 26-June 04 Member No.: 25,313 |
I'm trying to use this script, and after much trial and tribulation, I finally got the thing to run. However, when I attempt to run it with
CODE *.html > import.mt I get the error:QUOTE Can't open *.html: Invalid argument at mtfix.pl line 6. This is the while line that was spoken of earlier: CODE while (<>) { # for each file on the command line I'm not a perl guy at all. Any perl I know now, I've learned in the past 24 hours by messing with this script. Some help would be greatly appreciated. Thanks. |
|
|
|
![]() ![]() |
| Lo-Fi Version | Time is now: 11.24.09 - 09:18 PM |