Apr 6, 2010

Regex Two Step

I spend a lot of time parsing and cleaning up text files with regular expressions, and there’s a problem that comes up fairly often that used to really puzzle me. How do you replace something that requires complicated, multi-step logic?

Everyone who has used regular expressions is pretty familiar with doing something like this:

>> "5553331234".gsub(/(\d{3})(\d{3})(\d{4})/, '(\1) \2-\3')
=> "(555) 333-1234"

And truth be told, you can do a lot of good stuff with that simple concept. The problem is that it isn’t always so easy as replace “this” with “that”. Many times you want to inject some logic into the replacement process. Thankfully we can can pass blocks to the regular expression methods in Ruby (.NET offers something similar by passing a delegate to Regex.Replace).

This allows us to do some really complicated matching pretty easily. Last week I needed to reformat hundreds of SQL statements that NHibernate was generating. Here’s what some of them looked like:

exec sp_executesql N'INSERT INTO Actions (CreatedAt, Description,
IsReversible, Name, ObjectClass, UserTransactionID) VALUES (@p0,
@p1, @p2, @p3, @p4, @p5)',N'@p0 datetime,@p1 nvarchar(9),@p2 bit,@p3
nvarchar(12),@p4 nvarchar(15),@p5 int',@p0='2010-04-03
00:36:25.1100000',@p1=N'logged
in',@p2=0,@p3=N'Authenticate',@p4=N'UsersController',@p5=5589

exec sp_executesql N'INSERT INTO UserTransactions (CreatedAt,
Description, UserID) VALUES (@p0, @p1, @p2)',N'@p0 datetime,@p1
nvarchar(30),@p2 int',@p0='2010-04-03 00:37:26.9600000',@p1=N'viewed
applicant Michael Smith',@p2=1

I needed these formatted as simple SQL statement without the exec sp_executesql with parameters nonsense. I needed to replace these:

(@p0, @p1, @p2)

With these:

@p0='2010-04-03 00:37:26.9600000', @p1=N'viewed applicant Michael
Smith',@p2=1

So it would look like this:

('2010-04-03 00:37:26.9600000', 'viewed applicant Michael Smith', 1)

Shew. As you can imagine a simple gsub like our first example wasn’t going to cut it. But, by using a block we can easily execute additional logic on each match. Here’s the code:

# Match each SQL statement that starts with exec sp_executesql
sql.gsub!(/exec sp_executesql.*?$/) do |match|
  # collect the params from the end of the statement into an array
  values = match.scan(/(@p\d)=N?(.*?)(?=(,@p\d)|$)/)

  # Remove the beginning and end of the statement
  match.gsub!(/exec sp_executesql N'/, "").gsub!(/',N'@p0.*?$/, "")

  # Replace each of the parameters with its corresponding value
  values.each { |v| match.gsub!(v[0], v[1]) }

  # Return our clean SQL statement
  match
end

You can see here that I’m using a block with gsub instead of providing a replacement string. This gives us a match object for each matching SQL statement where we can make additional replacements, and our statements end up looking like this:

INSERT INTO Actions (CreatedAt, Description, IsReversible, Name,
ObjectClass, UserTransactionID) VALUES ('2010-04-03
00:36:25.1100000', 'logged in', 0, 'Authenticate',
'UsersController', 5589)

INSERT INTO UserTransactions (CreatedAt, Description, UserID)
VALUES ('2010-04-03 00:37:26.9600000', 'viewed applicant Michael
Smith', 1)

While some additional explanation about what’s happening in the block would probably be useful, the point here is that this kind of complicated logic is possible and with very little code and hassle.

If your regular expressions are seeming impossible or overly complicated, see if you can make it happen with the regex 2 (or 3 or 4) step instead.

About
Be inspired and inspiring. Subscribe via RSS.