Wednesday, 28 August 2013

Multi-line Matching in Python

Multi-line Matching in Python

I've read all of the articles I could find, even understood a few of them
but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application
specific log file, each line begins with a time stamp which I can match
and I can define two things to identify what I want to capture, some
partial content and a string that will be the termination of what I want
to extract.
My issue is multi-line, in most cases every log line is terminated with a
newline but some entries contain SQL that may have new lines within it and
therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013
11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER =
(ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item
where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists
(select 1 from maximo.invvendor where (exists (select 1 from
maximo.companies where (( contains(name,' $AAAA ') > 0 )) and
(company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum
= item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select
value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM'))
order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want
to still capture it (and potentially replace the line breaks with spaces).
I am currently reading the file a line at a time which obviously isn't
going to work so...
Do I need to process the whole file in one go? They are typically 20mb in
size. How do I read the entire file and iterate through it looking for
single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole
thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log
entries that match different patterns of the starting string (always the
start of a line), the ending string (where I want to capture to) and a
value that is between them as an identifier.
Thanks in advance for any help!
Chris.

No comments:

Post a Comment