Help - Search - Members - Calendar
Full Version: Import from Blog Archive HTMLs
Movable Type Community Forum > Installing and Upgrading > Importing and Exporting
anakin513
My server had a primary hard drive failure. The generated html was on another drive and is fine, but the MySQL database was on the main drive and is gone.

Is there any way to import the monthly archives?

Does anyone have a script to parse the HTML and convert it to a MT import file?

Thanks.
anakin513
Never mind. I wrote this. It works smile.gif

CODE
#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<div class="posted">\s+(.+?)\s+/|s);
($title) = ($content =~ m|<span class="title">(.+)</span>|);
($text) = ($content =~ m|<p>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|<div class="date">(.+)</div>|);
($time) = ($content =~ m|Posted\sat\s(.+)<br />|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
       if (length $_ > 7) { # this is here to ignore the last comment record.
               ($CText) = ($_ =~ m|<div class="comments-body">\n(.+)\n\<span|s);
               ($CDate) = ($_ =~ m|</a>\son\s(.+)</span>|);
               ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
               ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
               ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
               $CText =~ s|\<p\>||g;
               $CText =~ s|\</p\>||g;
               $parsed = ParseDate($CDate);
               $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

               print "-----\n";
               print "COMMENT:\n";
               print "AUTHOR: $CAuthor\n";
               print "URL: $CURL\n";
               print "DATE: $CDate\n";
               print "$CText\n\n";
       }
}
print "--------\n";
}
earlax
wow, elegant. ... so that just reads the the HTMLified pages in and spits out the data in correct import format. how very perl of you smile.gif I hope I never need this, but thanks just the same as if you had saved my butt too!
mavenglobe
How exactly does this work? Where does one put this file and run it?
arvind
you should publicise this script more. I've seen many cases where this file could be vital. Like mavenglobe said where do u put it. How do u configure it, permissions etc. ! wink.gif
anakin513
In order to get this to work, you'll need to install perl.

You can get perl FREE from www.perl.com, there are Windows/Mac/Unix/Linux distributions.

Once you've installed Perl, copy that script into your favorite text editor, save it as mtfix.pl (or whatever). You'll need to use your individual archives for the import. Take a look at one of the HTML files to figure out what you need to change in the perl script.

The way the perl script works, is that it starts reading the HTML from the start of the file and is looking for a match.

ie:
CODE
($title) = ($content =~ m|<span class="title">(.+)</span>|);

Here we are looking for the Title of the entry. The '(.+)' represents the text we are going to capture, my titles are sitting in a span class. All you need to do to get it to work for yours is to figure out what surrounds each item you are looking for.

A little more complicated is the way I got the author information out of the comments. For me, the author name was a link.
CODE
<span class="comments-post">Posted by: <a target="_blank" href="http://jason.sdf1.net">Jason</a> at December 18, 2003 09:43 AM</span>


So this was the code:
CODE
($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);


It grabs that whole line into the temp variable and then looks in the temp variable for the Name and URL. Notice how I had to use \s to represent spaces for the 'Posted by:' search. You may have to use a \n for new lines wink.gif

Okay, so you think you've made the changes you need, and want to go ahead and try it out. Drop to a command prompt, with your saved script and the html files in the same folder and type:

CODE
perl mtfix.pl 0000001.html


Or whatever file name you choose. I recommend trying it out with just one of your files. It's going to print the results to the screen, you can read through and make sure that it's finding the right text. If not, go back and figure out why. If it's perfect, then use this:

CODE
perl mtfix.pl *.html > import.mt


Hope that works out for you. If you need more help, do a google search on Perl and Regular Expressions. Learn something biggrin.gif
scsmith
This happened to me last weekend. I can't thank you enough for that Perl script, Anakin. Talk about a lifesaver.

I discovered a few more useful things while I was recovering stuff. (I still am, but I think I've automated everything I possibly could at this point.)

Long post on the complete recovery process at my site.
StarryMom
is there a way to do this without having shell access? my hosting company is very strict on allowing shell access.
StarryMom
ok i installed perl on my computer lol have php why not perl right? but the error I recieve is

Can't open *.html: no such file or directory at mtfix.pl line 6 which is


while (<>) { # for each file on the command line

(at least pretty sure that is line 6 lol
StarryMom
actually would there be a way to do this using just the monthly archives?
anakin513
QUOTE (StarryMom @ May 30 2004, 03:42 AM)
actually would there be a way to do this using just the monthly archives?

It's written primarily for individual archives, you can try passing the monthly's to them without the line "> import.mt" watch the output on the screen and see if it looks right.
You will need to save all of the pages from your site locally in order for the script to read them.
Someone else was having problems with an extra blank comment being added to the end of each record. Simply play with this line:
CODE
if (length $_ > 7) { # this is here to ignore the last comment record.

Change the 7 to a higher number, like 20 or 35, don't go too high, this is a threshold number between junk for a comment and an actual comment. That text includes the name of the poster, the date, and their text. Plus any email and URLs that may have been entered. Something under 40 is probably a safe bet.
StarryMom
ok here is a sample of the file I am trying to work with,

CODE
<div class="blog">    



<a name="000008"></a>
<span class="title">Linky Love</span>

<p>As my first mini post using MT, here are a few of my favorite links! Give them love, comment on THEIR blogs, in their guestbooks, etc. *muah*</p>

<p><a href="http://ambienine.vectorstar.net/ambienine/">My Ambie Kitten *purrr*</a>, <a href="http://belle.rose-madder.net/blog.htm">Kim</a>, <a href="http://asnightfalls.net">Crystal</a>, <a href="http://eosrising.net/">CG</a>, <a href="http://www.love-buzz.org/">Sarah</a>.</p>



<div class="posted">Posted by Sarah at <a href="http://onestarrynight.com/musings/archives/000008.php#000008">07:59 AM</a></span>

</div>


Mind you this is a VERY VERY old entry lol

When I do it I either have one of two things happen

1. It just puts the first title of the entry in the import file thats it.

CODE
AUTHOR:
TITLE: Linky Love
DATE:
-----


2. It doesn't work at all.

I swear it hates me lol
apakuni
I try to run this script, but, I get an error on this line ...

#!/usr/bin/perl
use Date::Manip;


This is the error.
QUOTE
$  perl mtfix.pl index.php
Can't locate Date/Manip.pm in @INC (@INC contains: /etc/perl /usr/lib/perl5/site_perl/5.8.2/i686-linux /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.2/i686-linux /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.2/i686-linux /usr/lib/perl5/5.8.2 /usr/local/lib/site_perl .) at mtfix.pl line 2.
BEGIN failed--compilation aborted at mtfix.pl line 2.


I tried to install the file using CPAN.pm. But it does not exist. Thoughts?
apakuni
Disregard, I got it working.
apakuni
OK, I found some minor issues with Anakin513's script. It seemed to drop dates (at least on my archive) and it excluded categories and other useful stuff found in the RDF for each post. So, I tweaked his script and wrote a few directions you may find useful.

Thanks to fxn at irc.freenode.net#perl for the help.

Here it is ...
CODE
#########################################
Script:  mt_recover_html.pl
Written By:    anakin513 (http://jason.sdf1.net/)
Tweaked By:    apakuni (http://apakuni.com)  

NOTE: This script assumes you want to publish
your recovered posts and sets comments and pings
on by default.  Use this script at your own risk!
#########################################


# Instructions
#########################################
# Rename this file to mt_recover_html.pl
# Run Test:
# $ perl mt_recover_html.pl somemtfile.html
# Returns results to screen for review.  If all is well,
# run the script against the entire archive.
# $ perl mt_recover_html.pl *.html > mt_import_file.txt
# Import into MT or WP from there.  
#########################################


#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<span class="posted">Posted by (.+?) at |s);
($title) = ($content =~ m|<h3 class="title">(.+)</h3>|);
($text) = ($content =~ m|</h3>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|dc:date="(.+)" />|);
($excerpt) = ($content =~ m|dc:description="(.*)"|);
($primary_cat) = ($content =~ m|dc:subject="(.*)"|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "STATUS: Publish\n";
print "ALLOW COMMENTS: 1\n";
print "CONVERT BREAKS: 0\n";
print "ALLOW PINGS: 1\n";
print "PRIMARY CATEGORY: $primary_cat\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";
print "-----\n";
print "EXCERPT:\n$excerpt\n";
print "-----\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
      if (length $_ > 7) { # this is here to ignore the last comment record.
              ($CText) = ($_ =~ m|<div class="comments-body">(.+)<span class="comments-post">|s);
              ($CDate) = ($_ =~ m|</a> at (.+)</span>|);
              ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
              ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
              ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
              $CText =~ s|\<p\>||g;
              $CText =~ s|\</p\>||g;
              $parsed = ParseDate($CDate);
              $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

              print "-----\n";
              print "COMMENT:\n";
              print "AUTHOR: $CAuthor\n";
              print "URL: $CURL\n";
              print "DATE: $CDate\n";
              print "$CText\n\n";              
             
      }
}
print "--------\n";
}
mistaroblivion
I'm trying to use this script, and after much trial and tribulation, I finally got the thing to run. However, when I attempt to run it with
CODE
*.html > import.mt
I get the error:
QUOTE
Can't open *.html: Invalid argument at mtfix.pl line 6.


This is the while line that was spoken of earlier:
CODE
while (<>) { # for each file on the command line


I'm not a perl guy at all. Any perl I know now, I've learned in the past 24 hours by messing with this script. Some help would be greatly appreciated.

Thanks.
klint
apakuni, does this work with the default MT templates?
karlelvis
Will the same script work if you're writing out all your static content in .php?

We just lost our mqsql database and while most of the other users were writing everything out in .html, I was being clever and using .php.
anakin513
As long as there is a file with the text in it to parse. It should work.
This_mp3
I've done quite a bit of customization to my templates, so i'm not sure how to config the script to parse my particular code. Here's a typical snippet:

CODE
<div class="blog">
         <p class="footer">April 14, 2004</p>
         <p class="blogtitle">Brand, Spankin'</p>
         <div class="blogpost"><p>I swear, this is the last MT install i'll ever do. If this thing blows up again, by golly i'll....</p> <a name="more"></a>  
         </div>
         <p class="blogstat">Posted by Jeremiah at April 14, 2004 11:21 PM  </p>
       </div>
       
     
     <p class="body"><a name="comments"></a>Comments</p>
     
     <div class="blogpost"> <p>test</p></div>
     <p class="blogstat">Posted by: <a href="mailto:nonya@daom.bix">jeremiah</a>
       at April 14, 2004 11:24 PM</p>


I'm not sure i completely understand how to tweak mtfix.pl to get these imported...
This_mp3
ok, so I've toyed with the code in mtfix.pl (and chmod'd it to be executable) for my html.

Here's what I'm trying:

CODE
($author) =($content =~ m|Posted\sby\s+(.+?)\s+/|s);
($title) = ($content =~ m|<p class="blogtitle">(.+)</p>|);
($text) = ($content =~ m|<div class="blogpost">(.+)</p>|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|by\sJeremiah\sat(.+)</p>|);
($time) = ($content =~ m|2004\s(.+)</p>|);


I run the script by typing @ the command prompt:
CODE
perl mtfix.pl 000002.htm


What I get back is:

CODE
Can't locate Date/Manip.pm in @INC (@INC contains: /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 .) at mtfix.pl line 2.
BEGIN failed--compilation aborted at mtfix.pl line 2.


Is there a known fix for this? Did I break it?
baggas
You need to get the Date::Manip module installed on your server. If you know how you could try doing it using cpan, otherwise just email your web host and ask them to do it for you.

~ Baggas
This_mp3
By following the instructions at CPAN, I got Date::Manip installed (took about twenty seconds) and now the parser is working....I think.

CPAN install instructions.
This_mp3
Well, looks like it worked!

Couple of notes:

always add posts in "DRAFT" and NOT in "publish" mode...you'll thank me for this advice.

This process requires a lot of command-line work, so if you're not someone who's particularly comfy with the CLI, you may want to find some help in doing this kind of blog-save.

The one thing that kept tripping me up was forgetting to use the \s to represent a space in the perl script.

Another thing that's not made clear is how to interpret your DATE:TIME data. when testing, mtfix.pl will display both the date and time after the DATE: field. If you see both date and time displayed, THIS IS CORRECT. The correct time/date will be reflected in your actual blogposts.

I spent two hours trying to figure out why the DATE field kept displaying both the date and time.

Make SURE to test a single post that has "MORE" text in it. If you're like me, only one in twenty posts actually uses the extended entry feature, but i lost a couple of these by not checking a couple of the black-sheep before running my batch import.

TEST, TEST, TEST!!
anakin513
Also noteworthy is that when you delete entries in MT, the original post that was built does not get deleted, so you could end up with a bunch of old posts in your new import that you had previously deleted.

Cheers!
live2dive
I am trying this, but I have no perl experience to date. I have been able to get it to almost work. The error I get is:

ERROR: Date::Manip unable to determine TimeZone.
Date::Manip::Date_TimeZone called at C:/Perl/lib/Date/manip.pm line 661
Date::Manip::Date_Init() called at C:/Perl/lib/Date/manip.pm lin 1395
Date::Manip::ParseDate(' ') called at mtfix.pl line 39

I have to think it has something to do with CPAN, but like I say, I am lost in perl.

Any help would be great.

Thanks to any of you perl wiz's out there.
Freakdog
I am at a loss, here. I'm trying, desparately, to do the same thing, here...using Apakuni's script, but, while it is adding the date and time in comments, it's not doing it for the actual blog entry...this, of course, is causing all sorts of problems for the import function.

Any ideas for a perl-clueless person?
Freakdog
Well, unfortunately for me, I've had to go back and add in dates to all of the articles that were processed by this script, then import them.

Oy...what a mess.

I'm still trying to clean up the results (the author had uploaded the same article numerous times, etc. Oy).
dopderbeck
Floating this thread -- I'm having the same problem as described in one of the earlier posts -- my test .html file works, but when I try *.html > import.mt, I get an error -- "can't open *.html: Invalid argument at mt.pl line 8" -- line 8 is "{ local $/; $content = <>;}" Thanks.
live2dive
I'm working on this as I type. Stand by. I about have it fixed. I'll post the solution in a few...
live2dive
Ok, I have been able to get the date thing fixed, and if I use mtfix.pl on a single file, it works great. If I try and use it for a series of .html files, I get an error. I have a little over 200 files, and I really don't care to do them individually.

The error I get is:

CODE
Can't open *.html: Invalid argument at mtfix.pl line 7


That line reads:

CODE
while (<>) { # for each file on the command line
dopderbeck
Ok, here, I think, is the problem: the Windows NT command line interpreter doesn't recognize the "*" wildcard except for a limited set of commands. (See http://www.ss64.com/ntsyntax/wildcards.html for reference). Therefore, the "*.html" syntax doesn't work if you're running a Win NT command line interpreter. So it seems you're back to square one if you don't have access to a Unix shell?
dopderbeck
Further problem -- I tried running it on a Unix shell offered by my ISP. Now it works using the "*" wildcard, but only slurps up the first .html file listed in my archives directory, so the import.mt file has only one entry. Do I need to do something else to get it to slurp up all the entries? (If you hear something, it's the sound of me tearing my hair out!)
Cookie
I'd like to do this same thing for 5 months worth of entries but 1) I can't get Perl to work and 2) I can't get Perl to work.

I'd really like to find someone I could sent my html files to for this conversion. biggrin.gif

If anyone is willing to help me out, I'm sure we can make a little deal.

Thanks!
(I sure wish I could get Perl to work...) sad.gif
Freakdog
QUOTE (Cookie @ Nov 8 2004, 01:29 PM)
I'd really like to find someone I could sent my html files to for this conversion. biggrin.gif

If anyone is willing to help me out, I'm sure we can make a little deal.

I think I've got Cookie covered, folks.
robbyb
QUOTE (dopderbeck @ Oct 24 2004, 08:15 PM)
Therefore, the "*.html" syntax doesn't work if you're running a Win NT command line interpreter.  So it seems you're back to square one if you don't have access to a Unix shell?

Am I really left finding a Unix shell? How would I find one?

Can anyone help? Maybe convert my files for me?

Contact me, if necessary, at rob.beuthling [at] gmail [dot] com


Thanks.
fanzing
Greetings!

I'm trying to use Apakuni's version of Anakin's script, but I'm having a bit of difficulty in adjusting one or two bits of it to fit my template. I was hoping someone here could help me figure out exactly what to put in my version of the script.

I'll need to modify the code in the section after this:
CODE
# locate the fields we need using regex
# some matches may include newlines


Now, here is the relevant portion of my template:

CODE
<a name="more"></a>


<span class="posted">Posted <span class="poster">by [EMAIL=fanzing@fanzing.com?subject=Monitor Duty]Michael Hutchison[/EMAIL]
</span> at April 21, 2005 11:15 AM

    | <a href="http://pub45.ezboard.com/ffanzingforumfrm1">Respond</a>  <br /></span>



I know, mine's a bit wacky, especially since I have it displaying the nickname instead of the username (and then we all use our full names as our nicknames). That's okay, I know it will use that name when it creates the user accounts during importing; we'll survive. :-)

By the way, you can ignore that last "respond" part; that's just the link to our forum.

If there's anything here that doesn't work, I can use site-wide find-and-replace to clean it up via my Dreamweaver program.
jmax
So far, everything works for me except for the minor little thing of recovering the text huh.gif

Anyway, I get the following error message:

Unrecognized character \xCA at mt_recover_html.pl line 70.

I don't know enough about perl to figure out how to debug this.

Any ideas?
je-b
Sorry for coming up with this rather old thread again, and if any easier solutions as to importing the individual archives have been found just let me know, but... I don't get, really:

I downloaded and installed ActivePerl and tried to follow anakin's instructions as well as possible, yet when I place the modified mt_recover_html.pl file in my archives directory and enter the commands in the command prompt nothing really happens: I window pops up for a split second and that's all.
I have no clue about Perl - so if there's anyone to help me out a bit...

Any help greatly appreciated!
I have 400+ entries and I'd really love to see them back there again. sad.gif
Thanks in advance!
taurianthebull
Hello Everyone!
Thanks a lot to guys for creating and sharing such a very useful script to the sake of the whole community.

Unfortunately, i have just been with the same issue you guys had. I tried using anakin513's script and it worked!
But only giving me first single entry and not all of them!
I have PHP files which are generated from MT3.2, and my pages contains many comments but it looks like the script is finding only the first entry and after this it doesn't!

Any ideas, why it is happening?


Please advice,
Thanks.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.