Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Mathematics
General TopicsResearchOperations ResearchStatisticsMathematical LogicNumerical AnalysisUndergraduate MathAlgebra HelpRecreational Math
Math Software
MapleMathematicaMATLABScilabSASSPSS

Math Forum / Math Software / MATLAB / July 2008



Tip: Looking for answers? Try searching our database.

inverse regexp

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Jean-Yves Tinevez - 25 Jul 2008 18:46 GMT
I was wondering if anyone wrote or met a function that was
inverting regexp.

For instance, let's suppose I want to extract tokens out a a
string. I can do that with regexp:

>> str = 'Johan and Mary';
>> rex = '(?<husband>\S+) and (?<wife>\S+)';
>> s = regexp(str,rex,'names')

s =

   husband: 'Johan'
      wife: 'Mary'

Now I would like to do the inverse transformation, and have
a function that would do this:

n = iregexp(rex,s)

and would output

n = 'Johan and Mary'

Has anyone ever met something like this? Or would have an
idea on how to do it?
jy
Walter Roberson - 25 Jul 2008 18:59 GMT
>>> str = 'Johan and Mary';
>>> rex = '(?<husband>\S+) and (?<wife>\S+)';

>Now I would like to do the inverse transformation, and have
>a function that would do this:
>n = iregexp(rex,s)
>and would output
>n = 'Johan and Mary'

This is impossible to do in the general case: the regular
expressions to be matched upon can include dynamic decision
elements that might (for example) base their decisions upon
random numbers.

Even not including such elements, sub-matches and backwards
references make this fairly complicated. For example, the regex
might have been

(?<husband>\S+([aeiou])\S+) and (?<wife>\S+$2\S+)

meaning that pair is only to match if the names share the same
vowel. It doesn't take much effort to make this more complicated
such that the proper reconstruction of the strings would require a
pattern-aware state machine.

If you are willing to restrict to very simple patterns, then
you can build the necessary routine relatively easily with a few
string replacements. Well, except for complications such as if
someone decided to make trouble and entered patterns right into
what was supposed to be the plain-text string str...
Signature

  "I was very young in those days, but I was also rather dim."
  -- Christopher Priest

Jean-Yves Tinevez - 25 Jul 2008 19:33 GMT
roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <g6d4al$mij$1@canopus.cc.umanitoba.ca>...

> If you are willing to restrict to very simple patterns, then
> you can build the necessary routine relatively easily with a few
> string replacements. Well, except for complications such as if
> someone decided to make trouble and entered patterns right into
> what was supposed to be the plain-text string str...

Okay, so we can restrict the regexp string to contain only
named tokens like in the example above, and nothing else.

I have tried ot do it using regexo itself, and to parse the
initial regexp string. i was unsuccessful so far, for it
only takes the larger possible string out of it, instead of
multiple small ones.

So you have any tip fr this problem?

Cheers
jy
Praetorian - 25 Jul 2008 20:39 GMT
On Jul 25, 12:33 pm, "Jean-Yves Tinevez" <tinevez.spamprotect...@mpi-
cbg.de> wrote:
> rober...@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
> message <g6d4al$mi...@canopus.cc.umanitoba.ca>...
[quoted text clipped - 19 lines]
> Cheers
> jy

I'm not quite sure whether this will help you but look at the 'split'
option in the regexp documentation. The example that combines both the
'match' and 'split' options illustrates how to reconstruct the
original string. I think this is a relatively new option, I'm using
R2007b.

HTH,
Ashish.
Donn Shull - 26 Jul 2008 04:07 GMT
I don't know what the ultimate problem that you are trying
solve is. But I am guessing that since you are using
regular expressions with the token option you are solving a
parsing problem. And since you are interested in reversing
the parsing you are probably trying to solve a text
generation problem. I am wondering if the real problem that
you are trying to solve ia a translation problem. If that
is the case I would suggest looking at template engines as
a way to rewrite the information contained in your tokens.
If you need more information on this idea let me know.

Donn
Jean-Yves Tinevez - 26 Jul 2008 10:29 GMT
Thank you all for these very interesting answers! Here is
why i am trying to do it:

The goal I am pursuing here is just a portable filename
parsing system. Trying to analyze the data of a biological
screen, we have about 5e4 images to analyze. Each of them as
a naming scheme made of a pattern with tokens. For instance:

<run_nbr>_YI<chromosome> - <plate_nbr>-<row><col> - f<field>
- c<channel>.tiff

Using regexp I can get all of the token values automatically.
Now I am trying to do the converse problem, that is: trying
to rebuild the filename using just the regexp filter and the
tokens, so that for instance, I could get all filenames for
a request such as "get me all field image for this plate,
this row, this col..".
Jason Breslau - 26 Jul 2008 03:49 GMT
Hi Jean-Yves,

Interesting problem.

To solve the simple case, in which there are only named tokens and
literal strings, you can use regexprep to reconstruct the match, but as
Ashish pointed out, you will need 'split' to recreate the portions of
the initial string that are not matched.

To make things simpler, you can restrict your pattern to named tokens
that do not contain nested parentheses:

>> str = 'Johan and Mary';
>> rex = '(?<husband>\S+) and (?<wife>\S+)';
>> s = regexp(str,rex,'names');
>> regexprep(rex, '\(\?<(\w*)>.*?\)', '${s.($1)}')

ans =

Johan and Mary

See how this fails with a pattern using nested parentheses:

>> str = 'Johan and Mary';
>> rex = '(?<husband>[a-z]([aeiou](?=[a-z])[^aeiou])*) and (?<wife>\S+)';
>> s = regexpi(str,rex,'names')

s =

    husband: 'Johan'
       wife: 'Mary'

>> regexprep(rex, '\(\?<(\w*)>.*?\)', '${s.($1)}')

ans =

Johan[^aeiou])*) and Mary

To handle nested parentheses, you need to use a recursive pattern:

>> str = 'Johan and Mary';
>> rex = '(?<husband>[a-z]([aeiou](?=[a-z])[^aeiou])*) and (?<wife>\S+)';
>> s = regexpi(str,rex,'names');
>> levelN = '\(([^()]|(??@levelN))*\)';
>> regexprep(rex, '\(\?<(\w*)>([^()]|(??@levelN))*\)', '${s.($1)}')

ans =

Johan and Mary

-=>J
Bruno Luong - 26 Jul 2008 10:21 GMT
Jason Breslau <tendiamonds@mathworks.com> wrote in message

>  >> str = 'Johan and Mary';
>  >> rex = '(?<husband>[a-z]([aeiou](?=[a-z])[^aeiou])*) and (?<wife>\S+)';
>  >> s = regexpi(str,rex,'names');
>  >> levelN = '\(([^()]|(??@levelN))*\)';
>  >> regexprep(rex, '\(\?<(\w*)>([^()]|(??@levelN))*\)', '${s.($1)}')

Impressive! This kind of code reminds me now why I never get
around to understand fully regexp syntax.

Bruno
Jason Breslau - 26 Jul 2008 18:56 GMT
"Bruno Luong" <b.luong@fogale.fr.findthecountry> wrote in
message <g6eq9v$cii$1@fred.mathworks.com>...

> Impressive! This kind of code reminds me now why I never get
> around to understand fully regexp syntax.
>
> Bruno

I'm sorry if I turned you off of it.  Regular Expressions
are really fun, and can be useful, too, if used properly.

The recursive example I used was modified from an example in
Jeffrey Friedl's "Mastering Regular Expressions", which is a
fantastic book on regular expressions.  I couldn't recommend
it more: http://regex.info/

-=>J
Donn Shull - 26 Jul 2008 19:41 GMT
Here is a solution that is a little like swatting flies
with a sledge hammer.

Download antlrworks from http://www.antlr.org/download and
add the jar to your classpath. This will give you access to
the stringtemplate template engine.

You can use a template to solve your problem with the
following code:

% create filename template
t = org.antlr.stringtemplate.StringTemplate
('$run_nbr$_YI$chromosome$ - $plate_nbr$-$row$$col$ -
f$field$ - c$channel$.tiff');

% fake token data
run_nbr = '7';
chromosome = 'A2';
plate_nbr = '53';
row = '45';
col = '87';
field = 'myfield';
channel = '45';

% use template to reconstruct filename
t.setAttribute('run_nbr', run_nbr);
t.setAttribute('chromosome', chromosome);
t.setAttribute('plate_nbr', plate_nbr);
t.setAttribute('row', row);
t.setAttribute('col', col);
t.setAttribute('field', field);
t.setAttribute('channel', channel);

% use filename
fileName = char(t);
disp(fileName);

% reset template to use again
t.reset;

The first line defining the template is one line. The
potential advantage to this approach is that it may be
easier to maintain in the future.
Bruno Luong - 26 Jul 2008 19:47 GMT
"Jason Breslau" <tendiamonds@mathworks.com> wrote in message
<g6fofo$drn$1@fred.mathworks.com>...

> The recursive example I used was modified from an example in
> Jeffrey Friedl's "Mastering Regular Expressions", which is a
> fantastic book on regular expressions.  I couldn't recommend
> it more: http://regex.info/

Thank you Jason, I keep the reference in mind.

Bruno
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2010 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.