IPB

Welcome Guest ( Log In | Register )

Movable Type

We're moving to movabletype.org!

At long last, we're moving to forums powered by, well, Movable Type itself. You'll want to bookmark http://forums.movabletype.org/ for future reference, and in the meantime you can view these old forums as a read-only archive of past posts. Thanks for being part of the community!

3 Pages V   1 2 3 >  
Reply to this topicStart new topic
> Import from Blog Archive HTMLs, DB is gone, can I import the Archives?
anakin513
post Dec 11 2003, 02:11 PM
Post #1





Group: Members
Posts: 11
Joined: 27-October 02
Member No.: 5,380



My server had a primary hard drive failure. The generated html was on another drive and is fine, but the MySQL database was on the main drive and is gone.

Is there any way to import the monthly archives?

Does anyone have a script to parse the HTML and convert it to a MT import file?

Thanks.
Go to the top of the page
 
+Quote Post
anakin513
post Dec 19 2003, 08:21 AM
Post #2





Group: Members
Posts: 11
Joined: 27-October 02
Member No.: 5,380



Never mind. I wrote this. It works smile.gif

CODE
#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<div class="posted">\s+(.+?)\s+/|s);
($title) = ($content =~ m|<span class="title">(.+)</span>|);
($text) = ($content =~ m|<p>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|<div class="date">(.+)</div>|);
($time) = ($content =~ m|Posted\sat\s(.+)<br />|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
       if (length $_ > 7) { # this is here to ignore the last comment record.
               ($CText) = ($_ =~ m|<div class="comments-body">\n(.+)\n\<span|s);
               ($CDate) = ($_ =~ m|</a>\son\s(.+)</span>|);
               ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
               ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
               ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
               $CText =~ s|\<p\>||g;
               $CText =~ s|\</p\>||g;
               $parsed = ParseDate($CDate);
               $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

               print "-----\n";
               print "COMMENT:\n";
               print "AUTHOR: $CAuthor\n";
               print "URL: $CURL\n";
               print "DATE: $CDate\n";
               print "$CText\n\n";
       }
}
print "--------\n";
}
Go to the top of the page
 
+Quote Post
earlax
post Dec 22 2003, 07:03 AM
Post #3





Group: Members
Posts: 3
Joined: 18-December 03
Member No.: 18,864



wow, elegant. ... so that just reads the the HTMLified pages in and spits out the data in correct import format. how very perl of you smile.gif I hope I never need this, but thanks just the same as if you had saved my butt too!
Go to the top of the page
 
+Quote Post
mavenglobe
post Dec 22 2003, 05:03 PM
Post #4





Group: Members
Posts: 33
Joined: 9-October 01
From: Boston-area, Massachusetts
Member No.: 2,447



How exactly does this work? Where does one put this file and run it?
Go to the top of the page
 
+Quote Post
arvind
post Dec 22 2003, 09:02 PM
Post #5





Group: Members
Posts: 1,622
Joined: 23-June 03
From: Abu Dhabi, UAE
Member No.: 12,575



you should publicise this script more. I've seen many cases where this file could be vital. Like mavenglobe said where do u put it. How do u configure it, permissions etc. ! wink.gif


--------------------
MovalogAll Things Movable Type
Movalog PluginsBlogroll, Protect, CustomFields, InlineEditor, LivePreview, Comment Email Filter
Movable Type Style GeneratorCreate your own unique drop-in stylesheet
Go to the top of the page
 
+Quote Post
anakin513
post Dec 29 2003, 09:41 AM
Post #6





Group: Members
Posts: 11
Joined: 27-October 02
Member No.: 5,380



In order to get this to work, you'll need to install perl.

You can get perl FREE from www.perl.com, there are Windows/Mac/Unix/Linux distributions.

Once you've installed Perl, copy that script into your favorite text editor, save it as mtfix.pl (or whatever). You'll need to use your individual archives for the import. Take a look at one of the HTML files to figure out what you need to change in the perl script.

The way the perl script works, is that it starts reading the HTML from the start of the file and is looking for a match.

ie:
CODE
($title) = ($content =~ m|<span class="title">(.+)</span>|);

Here we are looking for the Title of the entry. The '(.+)' represents the text we are going to capture, my titles are sitting in a span class. All you need to do to get it to work for yours is to figure out what surrounds each item you are looking for.

A little more complicated is the way I got the author information out of the comments. For me, the author name was a link.
CODE
<span class="comments-post">Posted by: <a target="_blank" href="http://jason.sdf1.net">Jason</a> at December 18, 2003 09:43 AM</span>


So this was the code:
CODE
($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);


It grabs that whole line into the temp variable and then looks in the temp variable for the Name and URL. Notice how I had to use \s to represent spaces for the 'Posted by:' search. You may have to use a \n for new lines wink.gif

Okay, so you think you've made the changes you need, and want to go ahead and try it out. Drop to a command prompt, with your saved script and the html files in the same folder and type:

CODE
perl mtfix.pl 0000001.html


Or whatever file name you choose. I recommend trying it out with just one of your files. It's going to print the results to the screen, you can read through and make sure that it's finding the right text. If not, go back and figure out why. If it's perfect, then use this:

CODE
perl mtfix.pl *.html > import.mt


Hope that works out for you. If you need more help, do a google search on Perl and Regular Expressions. Learn something biggrin.gif
Go to the top of the page
 
+Quote Post
scsmith
post Mar 12 2004, 11:18 PM
Post #7





Group: Members
Posts: 147
Joined: 12-October 02
Member No.: 5,052



This happened to me last weekend. I can't thank you enough for that Perl script, Anakin. Talk about a lifesaver.

I discovered a few more useful things while I was recovering stuff. (I still am, but I think I've automated everything I possibly could at this point.)

Long post on the complete recovery process at my site.
Go to the top of the page
 
+Quote Post
StarryMom
post May 29 2004, 03:50 PM
Post #8





Group: Members
Posts: 50
Joined: 15-February 02
From: Ohio
Member No.: 476



is there a way to do this without having shell access? my hosting company is very strict on allowing shell access.
Go to the top of the page
 
+Quote Post
StarryMom
post May 29 2004, 06:11 PM
Post #9





Group: Members
Posts: 50
Joined: 15-February 02
From: Ohio
Member No.: 476



ok i installed perl on my computer lol have php why not perl right? but the error I recieve is

Can't open *.html: no such file or directory at mtfix.pl line 6 which is


while (<>) { # for each file on the command line

(at least pretty sure that is line 6 lol
Go to the top of the page
 
+Quote Post
StarryMom
post May 29 2004, 07:42 PM
Post #10





Group: Members
Posts: 50
Joined: 15-February 02
From: Ohio
Member No.: 476



actually would there be a way to do this using just the monthly archives?
Go to the top of the page
 
+Quote Post
anakin513
post Jun 7 2004, 07:16 AM
Post #11





Group: Members
Posts: 11
Joined: 27-October 02
Member No.: 5,380



QUOTE (StarryMom @ May 30 2004, 03:42 AM)
actually would there be a way to do this using just the monthly archives?

It's written primarily for individual archives, you can try passing the monthly's to them without the line "> import.mt" watch the output on the screen and see if it looks right.
You will need to save all of the pages from your site locally in order for the script to read them.
Someone else was having problems with an extra blank comment being added to the end of each record. Simply play with this line:
CODE
if (length $_ > 7) { # this is here to ignore the last comment record.

Change the 7 to a higher number, like 20 or 35, don't go too high, this is a threshold number between junk for a comment and an actual comment. That text includes the name of the poster, the date, and their text. Plus any email and URLs that may have been entered. Something under 40 is probably a safe bet.
Go to the top of the page
 
+Quote Post
StarryMom
post Jun 7 2004, 07:52 AM
Post #12





Group: Members
Posts: 50
Joined: 15-February 02
From: Ohio
Member No.: 476



ok here is a sample of the file I am trying to work with,

CODE
<div class="blog">    



<a name="000008"></a>
<span class="title">Linky Love</span>

<p>As my first mini post using MT, here are a few of my favorite links! Give them love, comment on THEIR blogs, in their guestbooks, etc. *muah*</p>

<p><a href="http://ambienine.vectorstar.net/ambienine/">My Ambie Kitten *purrr*</a>, <a href="http://belle.rose-madder.net/blog.htm">Kim</a>, <a href="http://asnightfalls.net">Crystal</a>, <a href="http://eosrising.net/">CG</a>, <a href="http://www.love-buzz.org/">Sarah</a>.</p>



<div class="posted">Posted by Sarah at <a href="http://onestarrynight.com/musings/archives/000008.php#000008">07:59 AM</a></span>

</div>


Mind you this is a VERY VERY old entry lol

When I do it I either have one of two things happen

1. It just puts the first title of the entry in the import file thats it.

CODE
AUTHOR:
TITLE: Linky Love
DATE:
-----


2. It doesn't work at all.

I swear it hates me lol
Go to the top of the page
 
+Quote Post
apakuni
post Jun 19 2004, 12:14 AM
Post #13





Group: Members
Posts: 7
Joined: 27-March 03
Member No.: 9,231



I try to run this script, but, I get an error on this line ...

#!/usr/bin/perl
use Date::Manip;


This is the error.
QUOTE
$  perl mtfix.pl index.php
Can't locate Date/Manip.pm in @INC (@INC contains: /etc/perl /usr/lib/perl5/site_perl/5.8.2/i686-linux /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.2/i686-linux /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.2/i686-linux /usr/lib/perl5/5.8.2 /usr/local/lib/site_perl .) at mtfix.pl line 2.
BEGIN failed--compilation aborted at mtfix.pl line 2.


I tried to install the file using CPAN.pm. But it does not exist. Thoughts?
Go to the top of the page
 
+Quote Post
apakuni
post Jun 19 2004, 10:09 AM
Post #14





Group: Members
Posts: 7
Joined: 27-March 03
Member No.: 9,231



Disregard, I got it working.
Go to the top of the page
 
+Quote Post
apakuni
post Jun 19 2004, 06:38 PM
Post #15





Group: Members
Posts: 7
Joined: 27-March 03
Member No.: 9,231



OK, I found some minor issues with Anakin513's script. It seemed to drop dates (at least on my archive) and it excluded categories and other useful stuff found in the RDF for each post. So, I tweaked his script and wrote a few directions you may find useful.

Thanks to fxn at irc.freenode.net#perl for the help.

Here it is ...
CODE
#########################################
Script:  mt_recover_html.pl
Written By:    anakin513 (http://jason.sdf1.net/)
Tweaked By:    apakuni (http://apakuni.com)  

NOTE: This script assumes you want to publish
your recovered posts and sets comments and pings
on by default.  Use this script at your own risk!
#########################################


# Instructions
#########################################
# Rename this file to mt_recover_html.pl
# Run Test:
# $ perl mt_recover_html.pl somemtfile.html
# Returns results to screen for review.  If all is well,
# run the script against the entire archive.
# $ perl mt_recover_html.pl *.html > mt_import_file.txt
# Import into MT or WP from there.  
#########################################


#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<span class="posted">Posted by (.+?) at |s);
($title) = ($content =~ m|<h3 class="title">(.+)</h3>|);
($text) = ($content =~ m|</h3>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|dc:date="(.+)" />|);
($excerpt) = ($content =~ m|dc:description="(.*)"|);
($primary_cat) = ($content =~ m|dc:subject="(.*)"|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "STATUS: Publish\n";
print "ALLOW COMMENTS: 1\n";
print "CONVERT BREAKS: 0\n";
print "ALLOW PINGS: 1\n";
print "PRIMARY CATEGORY: $primary_cat\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";
print "-----\n";
print "EXCERPT:\n$excerpt\n";
print "-----\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
      if (length $_ > 7) { # this is here to ignore the last comment record.
              ($CText) = ($_ =~ m|<div class="comments-body">(.+)<span class="comments-post">|s);
              ($CDate) = ($_ =~ m|</a> at (.+)</span>|);
              ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
              ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
              ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
              $CText =~ s|\<p\>||g;
              $CText =~ s|\</p\>||g;
              $parsed = ParseDate($CDate);
              $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

              print "-----\n";
              print "COMMENT:\n";
              print "AUTHOR: $CAuthor\n";
              print "URL: $CURL\n";
              print "DATE: $CDate\n";
              print "$CText\n\n";              
             
      }
}
print "--------\n";
}
Go to the top of the page
 
+Quote Post
mistaroblivion
post Jun 27 2004, 11:02 AM
Post #16





Group: Members
Posts: 1
Joined: 26-June 04
Member No.: 25,313



I'm trying to use this script, and after much trial and tribulation, I finally got the thing to run. However, when I attempt to run it with
CODE
*.html > import.mt
I get the error:
QUOTE
Can't open *.html: Invalid argument at mtfix.pl line 6.


This is the while line that was spoken of earlier:
CODE
while (<>) { # for each file on the command line


I'm not a perl guy at all. Any perl I know now, I've learned in the past 24 hours by messing with this script. Some help would be greatly appreciated.

Thanks.
Go to the top of the page
 
+Quote Post

3 Pages V   1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



Lo-Fi Version Time is now: 02.09.10 - 10:51 PM