Hello Ladies and Gentlemen,
Is there a way to substitute an accented character by its non- accented equivalent with a regular expression?
I'm asking because the LaTeX snippets for sectioning (cha, sec, sub, subs, ...) automatically generate the label associated with a newly created environment, but unfortunately the regexp used for this keeps the characters accented. Beacause of this, it requires to correct manually the label in order for LaTeX to accept it for compilation.
Is it fixable?
Xavier Cambar
On 31 May 2007, at 17:07, Xavier Cambar wrote:
Is there a way to substitute an accented character by its non- accented equivalent with a regular expression?
As far as I know it would be very tricky.
By myself I use perl for that:
Write a command:
Input: Selected Text or Document Output: Replace Selected Text
Command:
perl -e' use Unicode::Normalize; use utf8; no warnings; binmode (STDIN, ":utf8"); binmode (STDOUT, ":utf8"); while(<>){ $_=NFKD($_); s/[\x{0300}-\x{0362}]//g; # combining diacritics s/\x{3099}//g;s/\x{FF9E}//g;s/\x{309B}//g; # Japanese voiced mark s/\x{309A}//g;s/\x{309C}//g;s/\x{FF9F}//g; # Japanese semi-voiced mark print; } '
You can delete the Japanese stuff. The function NFKD decompose any character with a diacritic into its base character plus the diacritics as combining form according to the Unicode specification. The next is simply delete all combining diacritics. Please note, this will delete ALL diacritics, i.e cedilla, diaereses, acute, grave, macron, hook, ogonek etc.!
I guess you have to install the Perl library Unicode::Normalize in beforehand via CPAN, but I don't know this exactly.
How to apply this to the LaTeX snippets for sectioning, I don't know, but maybe my hint helps.
Best,
Hans
While I'm at it, is there also a way to escape some commands in the same time ?
Typically, when I type
\lastname{Kant}'s life as a title, I'd like it to b converted to kant_s_life
But when I type \Z is the integer set (\where \Z is a macro to type Z in mathbf)
I'd rather like z_is_the_integer_set
Does anyone have an idea of some way to achieve that? I guess the first part should be too hard, but I'm far from being an expert in regexp.
Le 31 mai 07 à 17:07, Xavier Cambar a écrit :
Hello Ladies and Gentlemen,
Is there a way to substitute an accented character by its non- accented equivalent with a regular expression?
I'm asking because the LaTeX snippets for sectioning (cha, sec, sub, subs, ...) automatically generate the label associated with a newly created environment, but unfortunately the regexp used for this keeps the characters accented. Beacause of this, it requires to correct manually the label in order for LaTeX to accept it for compilation.
Is it fixable?
Xavier Cambar
For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate
On 1. Jun 2007, at 02:34, Édouard Gilbert wrote:
While I'm at it, is there also a way to escape some commands in the same time ?
Typically, when I type
\lastname{Kant}'s life as a title, I'd like it to b converted to kant_s_life
But when I type \Z is the integer set (\where \Z is a macro to type Z in mathbf)
I'd rather like z_is_the_integer_set
Does anyone have an idea of some way to achieve that? I guess the first part should be too hard, but I'm far from being an expert in regexp.
Not entirely sure of the context -- this is a snippet regexp transformation?
Presumably something like this will do:
${1:your text} → ${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:$1$2 $3)/g}
Here we match: a command of the form \command{«value»} or an escaped character of the form \«char» or multiple word characters or multiple non-word characters
If the last branch is taken, we insert an underscore, otherwise we insert the result of the other captures (we insert all 3, but since this is an OR, only one will actually have a value).
Yes, that was it. Nearly perfect for me. Just got to remove the uppercase letter. Would that be something like :
${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:\L$1\L$2\L$3)/g}
?
Can I suggest to think about including this in every section/ subsection/paragraph snippet of the LaTeX bundle? Well, if it's behaviour match other users expectation, of course.
Édouard
Le 2 juin 07 à 05:21, Allan Odgaard a écrit :
On 1. Jun 2007, at 02:34, Édouard Gilbert wrote:
While I'm at it, is there also a way to escape some commands in the same time ?
Typically, when I type
\lastname{Kant}'s life as a title, I'd like it to b converted to kant_s_life
But when I type \Z is the integer set (\where \Z is a macro to type Z in mathbf)
I'd rather like z_is_the_integer_set
Does anyone have an idea of some way to achieve that? I guess the first part should be too hard, but I'm far from being an expert in regexp.
Not entirely sure of the context -- this is a snippet regexp transformation?
Presumably something like this will do:
${1:your text} → ${1/\\\w+\{(.*?)\}|\\(.)|(\w+)|(\W+)/(?4:_:$1
$2$3)/g}
Here we match: a command of the form \command{«value»} or an escaped character of the form \«char» or multiple word characters or multiple non-word characters
If the last branch is taken, we insert an underscore, otherwise we insert the result of the other captures (we insert all 3, but since this is an OR, only one will actually have a value).
For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate
On 2. Jun 2007, at 13:39, Édouard Gilbert wrote:
Yes, that was it. Nearly perfect for me. Just got to remove the uppercase letter. Would that be something like :
${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:\L$1\L$2\L$3)/g}
?
Yes, though you only need one \L (it lowercases until it sees a \E).
Can I suggest to think about including this in every section/ subsection/paragraph snippet of the LaTeX bundle? Well, if it's behaviour match other users expectation, of course.
Seeing how it already tries to do a smart transform, I don’t see why not go all the way.
Though I’ll wait to see if Haris has any comments.
On Jun 5, 2007, at 3:34 AM, Allan Odgaard wrote:
On 2. Jun 2007, at 13:39, Édouard Gilbert wrote:
Yes, that was it. Nearly perfect for me. Just got to remove the uppercase letter. Would that be something like :
${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:\L$1\L$2\L$3)/g}
?
Yes, though you only need one \L (it lowercases until it sees a \E).
Can I suggest to think about including this in every section/ subsection/paragraph snippet of the LaTeX bundle? Well, if it's behaviour match other users expectation, of course.
Seeing how it already tries to do a smart transform, I don’t see why not go all the way.
Though I’ll wait to see if Haris has any comments.
Sorry for the late response, seems fine to me at first glance, I'll try it a bit in the next couple of days and then commit it.
Haris Skiadas Department of Mathematics and Computer Science Hanover College
Thanks.
While I'm at it, may I suggest to change the leading shortcut (such as sec: or sub:) for paragraphs and subparagraphs ? They are currently set to “ssub:”, like for subsections. Unless there is a good reason for that, such as some standard.
And if some one have suggestion about how to deal with accents in references... I'd like to keep them ASCII all the way long, but having them coloured would be a acceptable alternative.
Édouard
Le 29 juin 07 à 22:43, Charilaos Skiadas a écrit :
On Jun 5, 2007, at 3:34 AM, Allan Odgaard wrote:
On 2. Jun 2007, at 13:39, Édouard Gilbert wrote:
Yes, that was it. Nearly perfect for me. Just got to remove the uppercase letter. Would that be something like :
${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:\L$1\L$2\L$3)/g}
?
Yes, though you only need one \L (it lowercases until it sees a \E).
Can I suggest to think about including this in every section/ subsection/paragraph snippet of the LaTeX bundle? Well, if it's behaviour match other users expectation, of course.
Seeing how it already tries to do a smart transform, I don’t see why not go all the way.
Though I’ll wait to see if Haris has any comments.
Sorry for the late response, seems fine to me at first glance, I'll try it a bit in the next couple of days and then commit it.
Haris Skiadas Department of Mathematics and Computer Science Hanover College
For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate
On Jun 29, 2007, at 4:43 PM, Charilaos Skiadas wrote:
On Jun 5, 2007, at 3:34 AM, Allan Odgaard wrote:
On 2. Jun 2007, at 13:39, Édouard Gilbert wrote:
Yes, that was it. Nearly perfect for me. Just got to remove the uppercase letter. Would that be something like :
${1/\\w+{(.*?)}|\(.)|(\w+)|(\W+)/(?4:_:\L$1\L$2\L$3)/g}
?
Yes, though you only need one \L (it lowercases until it sees a \E).
Can I suggest to think about including this in every section/ subsection/paragraph snippet of the LaTeX bundle? Well, if it's behaviour match other users expectation, of course.
Seeing how it already tries to do a smart transform, I don’t see why not go all the way.
Though I’ll wait to see if Haris has any comments.
Sorry for the late response, seems fine to me at first glance, I'll try it a bit in the next couple of days and then commit it.
Looking at this again, the current commands in the bundle use this:
${1/\\w+{(.*?)}|\(.)|(\w+)|([^\w\]+)/(?4:_:\L$1$2$3)/g}
which differs from the above only in the (\W+) part. So I am confused: Which cases are not covered by the already existing commands? I.e. what is the label, and how is it transformed in the two cases?
As for the non-ascii characters, we could probably create a command that would scan the entire document and try to fix all the sectioning commands, including adding labels if there are none, and changing the labels appropriately.
Haris Skiadas
Haris Skiadas
On 30. Jun 2007, at 04:00, Charilaos Skiadas wrote:
Hey, welcome back Charilaos!
[...]
Looking at this again, the current commands in the bundle use this:
${1/\\w+{(.*?)}|\(.)|(\w+)|([^\w\]+)/(?4:_:\L$1$2$3)/g}
which differs from the above only in the (\W+) part. So I am confused: Which cases are not covered by the already existing commands? I.e. what is the label, and how is it transformed in the two cases?
That’s because I did commit the new transformations, though had to make a minor change compared to what was discussed here.
As for the non-ascii characters, we could probably create a command that would scan the entire document and try to fix all the sectioning commands, including adding labels if there are none, and changing the labels appropriately.
If I understand correctly, they give a LaTeX compile error, if left there? In that case I think we should transform them to some dummy placeholder character (having to post-process the document sounds tedious).
When we can do recursive replacements (in the format string) we can add some humongous regexp to make it smarter with respect to the accents (i.e. effectively strip them, but in practice just handle all known accents).
On Jun 29, 2007, at 11:34 PM, Allan Odgaard wrote:
On 30. Jun 2007, at 04:00, Charilaos Skiadas wrote:
Hey, welcome back Charilaos!
Thanks! Sorry for the lack of support this month, my absence from emails kind of happened along the way, I hadn't planned it at the beginning. I have to say though that it was the most relaxing thing I've done in a while ;).
[...]
Looking at this again, the current commands in the bundle use this:
${1/\\w+{(.*?)}|\(.)|(\w+)|([^\w\]+)/(?4:_:\L$1$2$3)/g}
which differs from the above only in the (\W+) part. So I am confused: Which cases are not covered by the already existing commands? I.e. what is the label, and how is it transformed in the two cases?
That’s because I did commit the new transformations, though had to make a minor change compared to what was discussed here.
As for the non-ascii characters, we could probably create a command that would scan the entire document and try to fix all the sectioning commands, including adding labels if there are none, and changing the labels appropriately.
If I understand correctly, they give a LaTeX compile error, if left there? In that case I think we should transform them to some dummy placeholder character (having to post-process the document sounds tedious).
That sounds reasonable, and underscore would do in that case. The main point of the command I was talking about would be to automatically add labels to an old document that, for whatever reason, did not have any labels for its sectioning commands. (I am currently working a lot with editing old/automatically created from word latex documents, and thinking of tools to make life easier with such files.). So the cleanup of existing labels would be just a side-effect.
When we can do recursive replacements (in the format string) we can add some humongous regexp to make it smarter with respect to the accents (i.e. effectively strip them, but in practice just handle all known accents).
That would indeed be nice!
Haris Skiadas