| miao's profile苗苗 合作 交流PhotosBlogLists | Help |
|
苗苗 合作 交流May 31 Perl compatible regular expression(PCRE) tutorialWhy this article?Regular expression are often considered as some mystery one would better stay away from. A lot of people seem to prefer writing lines and lines of code to solve a problem with simple string functions rather than getting into regular expressions to do it with just one statement. I admit that staring at a regular expression pattern like the one below (and there are worse) for the first time actually may scare someone away: '/"[^"\\\\]*(\\\\.[^"\\\\]*)*"/' However, regular expressions are such a powerful feature that once you got used to them you would probably miss them terribly if they were ever removed from PHP. PHP supports two different flavors of regular expressions. This article is focused on perl compatible regular expressions (short PCRE), because they are even more powerful and often said to be faster. Most of the examples are derived from questions asked in a php forum, so I hope they do not seem too artificial. Content:The PCRE functions and what they are typically used for Writing regular expressions on your own Evaluating the replacement argument The PCRE functions and what they are typically used forThe following is only a brief description of what the different PCRE function do. For a complete description including some more optional arguments, see the manual. preg_grep (pattern, array)This function extracts array elements that match the given pattern. It takes the array to investigate as argument and returns an array consisting of the matching array elements, indexed with the keys from the input array. preg_match (pattern, string[, matches])Although it can take an optional third argument that saves the complete match (and matches of sub-patterns included in parentheses, if there are any) as an array for later use, preg_match() is most commonly used for validation purposes, e.g. to validate user input such as email addresses etc. Returns 1 if a match is found, 0 otherwise. The corresponding simple string functions would be strstr() or stristr(). preg_match_all (pattern, string, matches)Similar to preg_match() except that it continues searching until the end of the string, storing all matches in a multi-dimensional array. By default the first array element indexed with 0 holds an array with all complete matches in the order they were found, and all the following array elements hold arrays containing matches of sub-patterns from left to right within the pattern. It returns the number of complete matches. preg_replace (pattern, replacement, string[, limit])This function replaces all occurrences of the pattern in a given string or array with what is passed to it as replacement argument. It returns the modified string or array. Pattern and replacement arguments may be arrays as well. If both are arrays, the first element of the pattern array will be replaced with the first element of the replacement array (this refers to the order they appear within the array, which is not necessarily the numeric index). The option fourth argument allows to specify a limited number of replacements in case you do not want all occurrences to be replaced. The corresponding simple string functions would be str_replace() or strtr(). preg_replace_callback (pattern, function, string)Similar to preg replace, but it takes a function name as second argument and replaces each match with the return value of the function. You will find an example how to use preg_replace_callback() in the section Evaluating the replacement argument. preg_split (pattern, string)With preg_split() you can extract parts of the investigated string that are delimited by the pattern. It returns an array containing these substrings. The corresponding simple string function would be explode(). preg_quote (string[, delimiter])This function inserts a backslash in front of every character that has a special meaning in the regular expression syntax. It is useful to modify a string e.g. from user input in order to use it in a regex pattern, thus making sure every character has its literal meaning and does not cause unexpected results or errors because it is interpreted as special character. The second argument is an additional character that needs to be escaped, which usually would be the delimiter (see below) you use. A similar simple string function would be addslashes() which escapes quotes and backslashes. Pattern syntaxAll preg functions except preg_quote() take the pattern as first argument. The pattern consists of delimiters, the actual search pattern inside them, and optional modifiers at the end. A simple and really silly pattern passed e.g. to preg_match() might look like this: "/apple/i" This would case-insensitive match any string that contains a sequence of these letters 'a', 'p', 'p', 'l', 'e' in exactly this order, so it does nothing that could not be achieved with the simple string functions like stristr(). DelimitersThe delimiters enclose the actual pattern, separating it from the modifiers that may follow. Any special character except the backslash may be used as delimiter. Fairly common is the forward slash, as in the above example. Since you need to escape any occurrence of the delimiter within the pattern, it sometimes can be convenient to choose another delimiter that is less frequently contained in the pattern. Search patternSingle charactersAny single character matches exactly one occurrence of this character, unless it has a special meaning within the regex pattern. These signs with special meaning, often referred to as meta-characters, are explained below. If you want to find any of them literally, you need to escape it with a backslash. Especially the backslash itself is a little difficult to handle: Since patterns are strings where the backslash is a meta-character as well, you need to escape it once for the regex machine and escape these two backslashes within the string again (this may not be 100% accurate from the technical point of view, but I found it a good way to visualize what is going on). Therefore, a regex for a windows-style path might look like this: "/C:\\\\Windows\\\\Temp/" Character classesBy enclosing them in square brackets, you can define character classes that hold a variety of character and matches one occurrence of any characters inside the bracket. You can also use ranges such as [0-9] which would be equivalent to [0123456789]. In addition, you can negate a character class by inserting a ^ right after the opening square bracket. This would match a character that is not listed. Predefined character classesThere are some predefined character classes in PCRE: . the dot matches any character except the newline by default You can use these predefined character classes in character classes you define yourself, e.g. [a-z\d] if you want something similar to \w, but without underscore and only lowercase letters. If you use the dot in a character class, it will loose its special meaning. QuantifiersQuantifiers allow to define how often the previous character or character class may occur. Whenever you want the quantifier to refer to more than one element, you have to group these elements with parentheses. ? - 0 or 1 occurrences Additionally, you can directly specify different lower and upper limits by enclosing them in curly braces. Examples: {2} - exactly 2 occurrences AlternationThe vertical bar indicates a choice between the parts of the pattern on either side of it. Example: "/apple|pear|banana/" ParenthesesParentheses are used for two purposes: The first one is to group parts of the pattern, either to apply quantifiers to a sequence of characters or character classes, or to limit alternation to a certain part of the pattern. Examples: "/Hello+/" "/(Hello)+/" Where the first pattern would match "Helloooooo" and the second one "HelloHelloHello". "/Hi|Hello/" "/H(i|ello)/" Here both pattern would match either "Hi" or "Hello", but if we omitted the parentheses, the latter would match "Hi" or "ello". The second use of parentheses is to mark sub-pattern you want to reuse. This is described in the section about backreferences later on. AssertionThese signs do not consume any character, the only tie the pattern (or sub-patterns) to specific positions in the investigated string like an anchor. ^ and $ - These signs mark the beginning and end of the string (or of a line if used with m-modifier) You may wonder if it is necessary to escape the dollar sign when using variables in your regex pattern, e.g. like this: "/\$var/"; The answer is no, you actually should not. This is because it is a two step process: before the pattern is passed to the regex compiler, the PHP parser evaluates the variable. Escaping the dollar sign would be the same as including the pattern in single quotes: the variables would not been evaluated. Therefore the regex machine would receive it as it is, and the pattern would be interpreted as "end of the string, followed by 'var'". \b - Word boundary (unless within a character class, where it stands for a backspace character) Unlike other languages, in PHP it is not distinguished between word boundaries at the beginning and at the end of a word. A word boundary simply is a "non-word-character", may it be a space, a comma or a special character, next to a "word-character" and vice versa. Caution: If you develop for a non-English site, problems may occur with characters such as ? ? ? ?etc. being interpreted as "non-word-character". There are a few more assertion characters that are less frequently used: \B for not a word boundary, \A for the beginning of the string, and \z and \Z for the end of the string (these make sense in combination with the m-modifier and ^ and $, in case you need additional assertion characters that do not match beginning and end of each line). ModifiersThe most common modifiers are the following: iMatches case-insensitive. Instead of our first example, you could as well write: "/[aA][pP][pP][lL][eE]/" "/(a|A)(p|P)(p|P)(l|L)(e|E)/" but this does not really increase readability. sBy default the dot matches all characters except a newline character. If the s-modifier is set, it matches the newline as well. mWhen the m-modifier is set, ^ and $ match the beginning and end of each line in a multiline string. As said before, they are just anchors and do not consume anything, so "/$^/" would never be true even with the m-modifier set, since there has to be a newline character in between them. UMatches "ungreedy", i.e. each sub-pattern consumes as little as possible to make the whole pattern match. This applies to all sub-patterns. Another way to match ungreedy is inserting a question mark after the quantifier (see section ungreedy matching below for an example). Since this only applies to the preceding sub-pattern, it allows a more fine-grained control of greedy and ungreedy matching. eThe e-modifier evaluates the replacement argument passed to preg_replace() for replacing the pattern with it. Since it does not apply to the other PCRE functions, it is deferred to a later section. xThe x-modifier allows to extend a pattern over several lines, arrange it in a pleasant way, and comment it. Our discouraging first example rewritten: '/
" # doublequote
[^"\\\\]* # optional sequence of anything except doublequote or backslash
# (i.e. "normal" content of a text string)
( # start of sub-pattern
\\\\. # a backslash followed by any character
# (i.e. escape sequence)
[^"\\\\]* # followed by optional "normal" content again
)* # the sub-pattern is optional itself, but may repeat
" # closing doublequote
/x'
Writing regular expressions on your ownWith what we have up to now, you should be able to work through most regular expressions you find somewhere sign by sign to understand what they are doing. When you are starting to write your own, it is often helpful to break it down into pieces. Let's say we wanted to validate an email address (as a user enters it in a form, not the complete specification including alias and so on). So for a start we define:
In a regex patterns these parts could be expressed as: ^[-_.a-zA-Z0-9]+
@
[-a-zA-Z0-9]+
\.
[a-zA-Z]{2,6}$
By putting this together we receive: "/^[-_.a-zA-Z0-9]+@[-a-zA-Z0-9]+\.[a-zA-Z]{2,6}$/"
Or slightly shorter (but possibly slightly less efficient, too) when we decide to match case-insensitive: "/^[-_.a-z0-9]+@[-a-z0-9]+\.[a-z]{2,6}$/i"
But with this, we would reject valid addresses because we did not consider subdomains or additional suffixes as in 'someone@domain.co.uk' yet. Therefore, we would break down part 3, regarding the occurrence of a dot as "special" since it separates subdomains and domain, which make the "normal" parts. The pattern would be: "normal", optionally follow by "special" and "normal" again, where the optional part may repeat. Or as our new part 3: [-a-z0-9]+(\.[-a-z0-9]+)* Now our pattern looks like this: "/^[-_.a-z0-9]+@[-a-z0-9]+(\.[-a-z0-9]+)*\.[a-z]{2,6}$/i"
Treating the first part the same way to reject email addresses with a leading dot or a sequence of dots, our pattern would become: "/^[-_a-z0-9]+(\.[-_a-z0-9]+)*@[-a-z0-9]+(\.[-a-z0-9]+)*\.[a-z]{2,6}$/i"
<?php
// we assume the email address a user submitted has been extracted from the request variables and assign to $email
// since capitalization does not matter anyway, we may decide to set it to lowercase right away
$email = strtolower($email);
if (!pregmatch("/^[-_a-z0-9]+(\.[-_a-z0-9]+)*@[-a-z0-9]+(\.[-a-z0-9]+)*\.[a-z]{2,6}$/", $email){
echo "Sorry, invalid email address.";
} else {
echo ":)";
}
?>
Another example might be extracting strings in a file that contains PHP code or in a CSV file. We will restrict this to doublequoted strings. The first most simplified definition of a string could be
or (note the pattern is enclosed in singlequotes to save some escaping): '/"[^"]*"/' Now we need to take care of escaped quotes which are perfectly legal in a string. Again, the second part can be broken down into a "normal" part (not a doublequote) and a "special" part (a backslash followed by a doublequote). Both the "normal" part and the "special"-"normal" sequence are optional. First try: '/"[^"]*(\\\\"[^"]*)*"/' This will not work yet, since the backslash that escapes a doublequote would be caught within the first "normal" part, thus leaving the doublequote out there naked, and since the "special"-"normal" sequence is optional while the doublequote at the end is not, it would be interpreted as closing doublequote again. Therefore, we disallow both backslashes and quotes in the "normal" part. '/"[^"\\\\]*(\\\\"[^"\\\\]*)*"/' With what we have now, a string like "She said: \"Hello!\"" would be matches correctly, but strings that contain backslashes not followed by a doublequote like "item 1\nitem2" or "C:\\Windows" would not be found. Thus we need to make the "special" part more general, defining it may be a backslash followed by any character: '/"[^"\\\\]*(\\\\.[^"\\\\]*)*"/'
<?php
//we assume the file has been opened, read and its content has been assign to $content
preg_match_all('/"[^"\\\\]*(\\\\.[^"\\\\]*)*"/', $content, $matches);
for ($i=0; $i<count($matches[0]); $i++){
echo $matches[0][$i]."<br />";
}
?>
BackreferencesBackreferences come in handy when you need to reuse parts of the matches, either in the pattern itself or in the replacement argument passed to preg_replace(). Within the pattern \\1 is the syntax to reference the first sub-pattern in parentheses, \\2 the second and so on (counting opening parenthesis from left to right). In the replacement argument, you can either use the same syntax, or $1, $2 and so on which is recommended. In addition $0 holds the complete match. In the pattern itself, you would need backreferences e.g. to find corresponding html tags. There may be tags without attributes, like <h1> or with attributes, like <font face=...>, so we will need to get the first alphanumeric sequence after the opening angular bracket in order to reuse it in the closing tag. <?php
// if tags are nested, this would always catch the innermost
preg_match_all("%<([a-z0-9]+)[^>]*>[^<]*</\\1>%i", $text, $matches);
?>
Backreferences can be useful in the replacement argument as well. A popular use would be highlighting certain words in the text without changing the original capitalization. Imagine the text in question contained the word repeatedly, but differently capitalized like this: <?php $text = "something TEST A test. Test something else tEsT."; $keyword = "test"; ?> It is no problem to search case-insensitive, but if we used "test" or "TEST" in the replacement argument, we would change the original capitalization. To avoid this, we use a backreference instead. <?php
preg_replace("/$keyword/i", "<b>$0</b>", $text);
?>
Sometimes it is a helpful that you can mark parentheses as "grouping only" if you do not need to reuse this part. To do this, insert a question mark and a colon at the beginning: "/H(?:i|ello)/" Ungreedy matchingBy default quantifiers behave greedy, consuming as much as possible from the string. Example: <?php
$str = "A really <b>important word</b> in a text that contains more <b>important stuff</b>";
echo preg_replace("#<b>(.*)</b>#si", "<i>$0</i>", $str);
?>
The above code would match everything from the first opening to the last closing tag, thus changing the text to "A really <i>important word</b> in a text that contains more <b>important stuff</i>", which is probably not what we wanted. This behavior can be changed with the U modifier or an additional question mark following the quantifier, as explained in the syntax section. Caution: If you apply the pattern below to a non-empty string, it would match only the first character at the beginning, as expected. "/^.+?/s" But you might expect the next pattern to match the last character only: "/.+?$/s" and this is not the case, it will still match the complete string. This is because matching always starts at the leftmost position in the string and is continued until the last requirement of the pattern, in this case the end of the string, is fulfilled. The beginning of the match would only be moved to the second character from the left if there was no match at the current position. Is alternation greedy?No, alternation is neither greedy nor ungreedy, the regex machine simply works it way through the branches of the alternation from left to right until it detects a match or the matching fails. Thus, "/http|https/" would never match 'https', because it is satisfied with the 'http' within 'https' already. This would only be different if the pattern would be continued with more requirements, e.g. "#(http|https)://#" Again, 'http' would be found first, but then the regex machine would compare the 's' in 'https' against the colon in the pattern, and since this fails, try the other branch of the alternation. The best way in both examples probably would be to use single characters for the 'http' sequence followed by an optional 's'. LookaheadsLookaheads are special regex constructs that allow to check if the following characters meet certain requirements without actually capturing them within the match. The syntax for positive and negative lookaheads is (?=...) (?!...) Lookaheads can often be useful, e.g. if we were off to highlight certain words, but only if they are not within html tags. A simple approach to find something unless it is within an html tag might be: We assume that if our keyword is followed by an opening angular bracket without a closing angular bracket in between, it is not within a tag. This should be true for all keywords except 'html' in a well-formed html document. A first try to modify the highlighting example: <?php
$text = "<body>TEXT text <img src=\"text.gif\"> Text</body>";
$keyword = "text";
echo preg_replace("/($keyword)([^>]*<)/i", "<B>$1</B>$2", $text);
?>
When running this, you will find that it fails to mark the second occurrence of the keyword bold. This is because it was caught within the sub-pattern [^>]* of the first match. Using a positive lookahead eliminates the problem: <?php
echo preg_replace("/($keyword)(?=[^>]*<)/i", "<B>$0</B>", $text);
?>
An example for a negative lookahead might be finding relative links in a html document. We assume well-formed html again to keep things simple, but the pattern could easily be modified to allow singlequotes as well, or additional spaces. "%href=\"(?!https?://|ftp://|mailto:|news:|javascript:|#)([^\"]+)%i" LookbehindsThere is a syntax for lookbehinds as well, with a lesser than sign inserted after the questionmark: (?<=...) (?<!...) This can be useful e.g. with preg_split(). Imagine you would want to split a text into sentences, and you define that a sentence is something, followed by an interpunctation sign, followed by a space at which you would like to split the string. The interpunctation sign is required, but it is belonging to the sentence and therefore should not be consumed by the pattern. Lookaheads will not get us far here, since a pattern like '(?=[.?!])\s+/' can never match, and when turning it the other way round, we are splitting at the interpunctation sign and not at the space as intended. '[.?!](?=\s+)/' But with lookbehinds we finally achieve what we wanted: <?php
$text = 'This is a sentence. Is there more to come? I don't think so!';
$sentences = preg_split('/(?<=[.?!])\s+/', $text);
var_dump($sentences);
?>
Keep in mind though that tasks which seems to require a "lookbehind" can be expressed with negative lookaheads just as well sometimes. Let's look at a simplified example with a list of elements, separated by a semicolon and a space. Obviously elements beginning with group= categorize the following elements. <?php $string = "group=fruits; apple; banana; group=music; jazz; pop; rock; folk; group=numbers; one; two; tree;"; $keyword = "rock"; ?> Now, we would like to lookup the group that is preceding a given keyword. In other words, we want
Translated in a regular expression this would be: <?php
preg_match("/group=([^;]+);\s((?!group=)([^;]*;\s))*$keyword/i", $string, $match);
echo "Group of ".$keyword." is ".$match[1]".";
?>
Evaluating the replacement argumentSay we would like to automatically enclose URLs in <a href=...> </a> tags unless they are either within html tags or already surrounded by <a href="http://..."> </a>. We have already used lookaheads to find something that is not within html tags, but adding that new requirement to the previous example would be fairly complicated, so we are going to take a look at a different technique. First we would write a pattern for each unwanted case and one for the general case (note that the first one only matches if the url is directly within tags without any spaces, and change this if you like): <a\s[^>]+>http://\S+</a> <[^>]+http://[^>]+> http://\S+ To keep things simple, we assume the url to be anything until the next space (which often may be incorrect but that is not what we are looking at right now). Then we would combine them in an alternation, capturing both unwanted cases within parentheses, and appending the e-modifier to the pattern. Since both unwanted cases would start matching at the same position in the string if they encounter the text '<a href="http://...', we need to place the one that grabs more first in order to simulate greedy behavior: "#(<a\s[^>]+>http://\S+</a>)|(<[^>]+http://[^>]+>)|http://\S+#ie" What actually does the trick happens in the replacement argument. The logic is: If the complete match equals one of the unwanted matches, replace it with itself (i.e. do nothing), else add the <a href=...> </a> tags around it. <?php
$text = "http://www.domain.com this was the first url.\n";
$text .= "However there is more to come <img scr=\"http://www.domain.com/pic.gif\">\n";
$text .= "image path is http://www.domain.com/pic.gif and here comes one enclosed in tags: <a href=\"http://www.domain.org\">http://www.domain.com</a>";
echo preg_replace("#(<a\s[^>]+>http://\S+</a>)|(<[^>]+http://[^>]+>)|http://\S+#ie",
'"$0"=="$1" || "$0"=="$2" ? "$0" : "<a href=\"$0\">$0</a>"',
$text);
?>
If you do not feel comfortable with the ternary operator, you can use preg_replace_callback() to have a function do the evaluation. It receives the function name as second argument. You do not need to pass any arguments, each match is automatically passed as an array with complete match and matches of the sub-patterns. <?php
function check_url($matches)
{
if ($matches[0]==$matches[1]||$matches[0]==$matches[2]){
return $matches[0];
} else {
return '<a href="'.$matches[0].'">'.$matches[0].'</a>';
}
}
echo preg_replace_callback('%(<a\s[^>]+>http://\S+</a>)|(<[^>]+http://[^>]+>)|http://\S+%i',
'check_url', $text);
?>
Though you can do a lot of things with just one call to a PCRE function, it is sometimes easier to split the task into two. Imagine you wanted to convert a sequence of lines beginning with a hyphen and a space into the proper html list format. Example string: <?php $str = "some text and a list - apples - bananas some more text and the second list: - green - red - yellow - purple and more text... But what I always wanted to tell you ... - oh no, I forgot!"; ?> A pattern to identify the complete lists is not too hard to write, and it is easy wrap <ul> and </ul> around it, too. However it is difficult to reference each single line at the point where we would like to put it into <li> </li> tags. We can solve this with preg_replace_callback(). The pattern identifies the lists, and passes each of them to a function that uses preg_replace() to modify the single lines within the match. This would even allow to set up an additional requirement that a list must consist of at least two lines. <?php
function format_list($matches)
{
return "<ul>\n".preg_replace("/^-\s(.*)/m", "<li>$1</li>", $matches[0])."\n</ul>";
}
echo preg_replace_callback("/(^-\s.*$[\r\n]*){2,}/m", 'format_list', $str);
?>
But in case we skip that extra requirement, two calls to preg_replace() would do it as well: <?php
$str = preg_replace("/(^-\s.*$[\r\n]*)+/m", "<ul>\n$0\n</ul>", $str);
$str = preg_replace("/^-\s(.*)/m", "<li>$1</li>", $str);
echo $str;
?>
When using preg_replace_callback(), you can alternatively create the function on the fly. The advantage is that you do not "waste" function names, but I would usually do this only if the function is rather simple (and not worth reusing it for other purposes of course). Here is an example that replaces matches with a text followed by a sequential number. <?php
$input = '<img scr="/path/img.gif">text text text <img name="test" scr="http://www.domain.com/imgage.jpg" alt="test" /> text';
$output = preg_replace_callback("/<img[^>]*>/i", create_function(
'$matches',
'static $counter = 0;
$counter++;
return "Image ".$counter;'
), $input);
echo $output;
?>
TroubleshootingCompilation failureThis is often caused either by missing delimiters[?] or by unescaped meta-characters[?]. Choose a delimiter that is not contained in the pattern if possible, and escape it where it occurs within the pattern. Carefully escape all other meta-characters you want to match literally. Too much escaping should not hurt in most situations, though it does not increase readability. And count unescaped parentheses, square bracket and curly braces to see if opening and closing elements match. No MatchesIf your pattern fails to match anything at all, you would need to identify the point where it fails. A simply but effective way to do this is to test parts of it with preg_match_all()[?], and printing the array with the matches: <?php
preg_match_all("$pattern", $str, $matches);
print_r($matches);
?>
Replace $pattern with the first element in your pattern. Run that and check if it matches what you think it should. Then add the next element and so on. Certainly you do not need to do this element by element actually, you can always use a group of elements to save time, and cut it further down only if it fails to match. Also check the modifiers[?]. If your pattern contains literal sequences or letters in character classes[?] , make sure you have the i-modifier set if capitalization may differ in the investigated string. If you apply a pattern with ^ or $ assertion characters[?] to a multiline string, check if you would need to set the m-modifier to make your regex work. If you use the dot as a wildcard for any character, remember that it needs the s-modifier being set in order to match newlines. In some situations it may be a good idea to apply your pattern to some dummy data. If it works there, but fails with data from another source, there must be a difference in the data at the point where the regex machine receives them, may this be invisible characters or html entities like or & you are trying to match against their literal equivalent. One huge match instead of several smallMost likely this results from greedy quantifiers. If it is possible in your particular situation, use negated character classes instead of .* or .+, or switch to ungreedy matching by using the U-modifier or inserting a question mark after the quantifier[?]. Some common mistakesWrong use of negated character classesSometimes negated character classes are mistaken as being negative lookaheads[?], like this: "#^[^http://]#i" This does not mean that a string does not begin with 'http://', it only says that the first character is not 'h', 't', 'p' or a semicolon or a slash. Missing escapesSay you would want to detect if a given filename has the extension htm or html, but forgot to escape to dot: "/.html?$/i" This would match all valid filenames, but it would match file.phtml or files/htm just as well. All elements optionalAn example of this may be to find numbers, allowing the formats "46", "45.999" but as well ".999". One could be tempted to write the pattern like this: "/\d*(\.\d+)?/" That does match all numbers, but unfortunately it does match anything else too including empty strings, because none of the parts is required. Further readingUnfortunately I am not aware of many online tutorials etc. on PCRE apart from the PHP manual. Though some find it hard to read, the manual definitely is a good and most accurate source of information. And especially if you would like to know how regex machines internally work, and learn about optimizations and efficiency, "Mastering Regular Expressions" by Jeffrey Friedl is a great book. March 26 豆瓣首席架构师洪强宁谈豆瓣架构原文刊自《程序员》杂志,截取了谈话中比较关键的一段作为留存。
---- 好短的分割线
关于豆瓣的系统架构图,首先我们在Web server上做个划分,把网站内容分为动态内容和静态内容。在豆瓣上所有的html都是动态内容,图片都是静态内容。分成两个Web 服务可以做不同的调优。 对动态内容,我们用的是nginx和lighttpd的混合,nginx做负载的平衡,lighttpd通过 SCGi 与application server相连,application server是基于 quixote这个框架写的。 application server拿到用户的请求,分析用户的url,并且利用外部的资源,比如数据库,组合成一个html,返回。从数据库存取会比较慢,数据库有大量的IO,我们使用cache,我们使用的是Memcached,这是一个分布式的内存的cache,比如你可以用很多机器,每个机器有两个G的内存,我们自己开发了client端来使用它,另外如果用户有搜索请求,我们会用搜索引擎。Xapian是一个C++写的开源的搜索引擎,我们通过Web service去访问它。其他,我们还提供了另外的Web service接口响应用户的请求,比如要访问某个文件。spread是我们最近加了一部分,用户有的请求可以采用这样的异步服务。 数据库是这样的,两个MySQL做成一对,一个master ,一个 slave,根据应用划分,使得load不会太高。这个图上»¬的是两对,实际上有三对。还有一个slave,一方面作为备份,一方面用作数据挖掘,因为不能对线上的数据做直接操作。 对于静态部分,我们也是用nginx,你注意到豆瓣现在有日记的贴图功能系统,用户可能上传很多图片,我们采用的方案是用了mogile FS ,这是一个分布式的文件系统,同时可以做备份,保持高可用性,可以提高很大的IO。 关于application server,它都是用Python写的。我们是用的MVC方式,Controller我们用的是quixote ,它接受用户的请求,根据这个URL去找到Model的某个具体的函数来执行,它是一个dispatcher,当中会判断用户的权限等。然后再传给View,View根据模版进行渲染,形成网页。View的模版,我们以前是用的是PTL,PTL很高效,最近引用了mako,这是一个比较现代的开源的模版,用它写出的代码比较好维护,比PTL好维护一些.。同时,在使用mako的同时,我们的工程师做了很多加速的工作,现在mako的代码有很多是豆瓣的人写的。 你如果注意过Python的Web开发框架的话,你会发现Python的有三个比较著名的框架,Django,Pylons,TurboGears,Pylons默认的模版就是Mako。 下面的就是Model,业务模块,核心是类是User,因为Web2.0是以人为本,我们肯定会有一个User。只有人也做不了事情,还要有物。豆瓣的物,就是Subject,比如书,比如评论,比如小组等。 与数据库进行链接,我们一个很轻量级的与数据库进行链接,这也是一个开源项目,SQL Farm Manager。这个Web service,豆瓣中有很多用的都是Web service。 September 03 Adobe Flex Builder 2 下载截止9/3日,链接有效
Adobe Flex Builder 2 下载 Flex Builder 2.0 License: 1307-1581-4356-2616-4951-7949 (Commercial Version) 1307-1581-4356-2939-1231-4484 (Education Version) Charting License: 1301-4581-4356-7349-9369-3351 (Commercial Version) March 19 最近要买电视机的可以留意看一下
近来网上有许多人在销售松下济南厂的32F500DN,这款机器到底效果如何呢
本来我一直盯着松下的50PV,但半路杀出个程咬金----在网上买了32F500DN.这款机器前天运到我家,经过两天的评测,结论如下: 1.32F500DN使用了日本原装超黑超精细平面辉聚显像管(100%MADE IN JAPAN),与上一代32F500D的北松管有着天壤之别,亮度更高,色彩更纯更艳丽,画面通透... 2.音响效果改进:由32F500D的7W+7W升级为9W+9W,好震撼... 3.功耗降低:由32F500D的179W降为165W...(应该是原装管耗电更小) 4.净重:由32F500D的54.5KG增为32F500DN的58KG...(真是货真价实) 5.塑料外壳都升级了,真不可思意! 6.与SONY的二号CRT机皇HR32M90相比:32F500DN的色彩饱和度更高更鲜亮,红蓝两色表现更突出,白色的纯度更好.只是在图像立体感上 不及TRINITRON管,像素点距不及SONY SUPER FINE PITCH TRINITRON管(32的),32F500DN的外观工业设计更简洁,而HR32像个螃蟹. 图像方面HR32M90可给96分,32F500DN给95.7分;外观HR32给 75分,32F500DN 88分. 注:SONY的HR32M90图象表现比HR36M90还要好(36的只有92分),因为32的管是日本产,36的管是USA产的.小日本做得好认真啊! 7.32F500D的图像与上面俩者不是同一重量级,故不比较 March 05 上海广播频率完全手册(2006年12月1日新版,2007年适用)上海广播频率完全手册(2006年12月1日新版,2007年适用): 周一为了听温总的报告,才想起自己的手机是带收音功能的,赶忙的插上耳机却发现连中央人民广播电台的频率都不记得了,一番搜索才有了以上的答案。温总求真务实,执政为民的报告倒是没听进几句,看着这些广播频率表,却有一种情绪在慢慢升腾。乘着周一有点闲散时间,就想起到了过往的日子。 曾几何时,一个带收音功能的Walkman成为我学生时代最渴望拥有的物件之一(还有就是一双Nike的球鞋,可惜这份期望直到工作后才得以实现)。18岁生日的时候,善心大发的父母为我买了部SONY Walkman作为礼物。在当时只能用"久旱逢甘霖"来形容我的心情,从此每天野营的《音乐早餐》和《中文金曲馆》伴着太阳一同把我叫起,日落之时又有裴子安《澳大利亚音乐航班》和《旁氏流行歌曲排行榜》伴随左右,晚间则在小凡的《篇篇情》中结束。在那个既无Cable TV,更没Internet的时代,这个其貌不扬的黑灰匣子陪我度过了整个后青春期时代。成为我收集资讯最主要的工具以及了解外面世界的窗口。我在陆悦农的《今夜不太晚》中听到了《第一次亲密接触》的广播剧,并由此让我对网络充满了青春少年般的幻想与憧憬。我在《白丽音乐万花筒》里认识了Oasis,直到如今Oasis仍旧是我的最爱。每周日痴痴的守候就是为了能一听他们的靡靡之声,可恼的只是每次节目开始的时段正好是家里开饭的时间。对一个新陈代谢正处于巅峰状态的青年人来说,这是一个多么痛苦的抉择。 时过境迁,如今早已不用等在Walkman前听Oasis了。Discman,Mp3 Player,DVD Player极大限度的充实了我的生活空间。但心境却总觉得不如过往的那般虔诚了。正所谓“书不借不读”。渴求信息的时代转眼变成了被信息包围的时代,我的Walkman也结束了它的历史使命,静静躺在抽屉里,同他一起躺着的还有一盒盒的录音磁带,那上头有自己学生时代最欣赏的歌手熊天平和齐秦的声音。 关上抽屉连同把记忆的闸门一起合上,关于Walkman的事儿就写到这儿吧。 December 01 转载:一点点印刷的知识印前
露 白:漏白,印刷用纸多为白色,印刷或制版时,该连接的色不密合,露出白纸底色。 打 白:挂网时代的照相制版工艺。为补救上网图片深色位感光不足,可移开原稿闪光一次或放一张纸补点曝光,或直接使用flash灯,闪动白光,以增加原稿的深位网,使影像柔化。 爆 肥:暴食当然会肥,菲林银粒感多了光也会扩大地盘。手工套版更在感光片加隔透明厚胶片中曝光加肥。 补漏白:Colortrapping,分色制版时有意使颜色交接位扩张爆肥,减少套印不准的影响。 实 地:指没有网点的色块面积,通常指满版。 反 白:文字或线条用阴纹?印刷,露出的是纸白。 撞 网:不是渔民工?。调幅网分色工艺,网点角度分配出错,或每一网角距离小於25°,龟纹就开始明显。 飞 网:镜头制版的挂网工艺,正常曝光後取下挂网,补充短暂曝光增加反差。 狗 牙:狗的牙齿是凹凸交错的。图片像素不足,放大後边沿就出现狗牙状。 玫花点:像花鹿般的网纹。差的叫席纹,更差的是龟纹。 齐 头:版面排位的指令,以字首作基准线。延伸到拼版、装订,指以版头位为基准。 散 尾:文字排版的一种。只求字距统一,不求行末文字齐整。 蒙 片:不是迷魂药。是手工分色时的遮掩片,可用菲林晒制或红胶片割制,可作退地或修色之用。 蓝 版:不是打篮球,也不是RGB的B(蓝色),而是CMYK的C(青)版。 印刷 鬼 影:来历不明的印纹或暗影。多因旧型印刷机供墨不均引起。 瓜 打:不是指水瓜打狗。活版印刷时代「黑手党」执字粒使用的排版比字面较低的定位铅粒。 打 斗:学孙悟空的拿手好戏。底面印刷车有自动翻纸装置,咬纸口印面,反咬纸尾印底,一气呵成。 自 反:指一种节约印版的印刷方法。让纸张先印完一面,乾後把纸左右反转及底面反转,称为底面自反版,而纸尾当牙口底面反转,称为牙口反版尾。是印版不变,再印纸张背面的工艺。 飞 墨:印刷机转速快而墨身稠度不够,离心力使墨液飞溅。 墨 线:在印版上画一条规线,使刚好印在纸张规位,可一目了然监控针位。 浮污:印版亲水不力,变成亲油,当然起薄薄的油污,问题多在水斗水的酸?度不对。 起 炮:炮,滚筒俗称。橡皮滚筒离开压印滚筒的动作。 夹 炮:太多纸张夹在压印滚筒和橡皮滚筒间,安全感应使印刷机停止转动。 哪 渣:不应印到纸张上的墨污,问题也出在水墨平衡。 打 掣:印刷机停止转动,原因多为进纸不顺或双张进纸触发安全装置。 针 位:不是打针的位置。印张的挡规边位。纸张有长短,印刷套色及裁切需有针位?对齐。 连 晒:节约菲林的连续晒版工艺。用套准十字移动曝光。 过 底:印刷事故的术语。指墨层太厚实不及乾燥,污染了压在上面的纸张背面。 车 头:上声,菲律宾称司机为车头。印刷的车头不是机长,而是指印刷转速数。 石 数:石印时代对印刷数量的称谓。纸张压印一次色称一石。 二 手:不是指二手货。指印刷机的副手,或称睇掣。 打 稿:不是与稿有仇,而是通过打样机预先印刷一个正式印刷时的样稿。 飞 达:不是快递,是印刷机送纸的传送装置。 装订 出 血:被刀了当然出血。印刷装订工艺要求页面的地色或图片,须跨出裁切线3mm,称为出血。 飞 边:飞,裁切、去掉之意。飞边指切除出血边位,乃装订术语。 切 斜:变形,裁切歪了,直角变菱角书,多由纸闸压力不均或纸栅不正引致。 磨 光:以砑光滚筒处理印张,表面会光滑,此为加工表面处理工艺。 反手摺:日本摺书机的摺纸。32版摺法第4摺须反摺。 正 版:不是指软件。书版首码所在版面叫正版,次码所在版面称反版,正反版称一组、一帖或一框。 纸 闸:不是关纸的门,是切纸的机器。 骑马钉:书本装订的一种方法,动作如跨上马背。薄本书(6帖以下)套好後,跨放在铁架上,以穿压铁线钉。 猪肠卷:摺书贴的一种方法,动作如卷肠粉,用3个上梭2个下梭可摺32版。 风琴摺:摺书贴的一种方法。书摺摺完拉开如屏风。 反封面:手工装订上封面的一种方法,先上封底边胶定好位,後上书脊封面边胶,再做一个「反」封面的动作。 毛 书:不是书本长毛,指锁好线而未上封面裁切的坯书。 笃头布:精装书脊上下各一段连结皮壳的布条。起牢固美观的作用。 火 印:精装封面的一种加工动作如烫金,湿度较高。 排版 高调:受光多的图片位当然光亮雪白,日本分色风格喜欢高光位无网点,以拉长图片层次。 低调:不是声音微弱,是指图片阴暗,或称暗调。 爆机:不是恐怖事件。内存或磁盘空间不够都会使电脑死火。 磅:不是指重量。是字体排版之量度单位,英文字母最小单位是Point,1英寸分72单位磅。 级:不是指阶层,光学照排时代是指文字大小,4级为1个mm。 号:不是指喇叭,是指铅印时代字粒大小,最大特号字72磅,最小8号字5磅。 平体:不是指发型,而是指把方块型以镜头变形,使字扁平,平1为1成(10%),平2为2成,平3为3成,平4为4成。 长体:不是指身型,而是指窄身字,长1窄1成,长2窄2成,长3窄3成,长4窄4成。 喷笔:以压缩气的喷色笔,利用气刷喷画。DTP时代之前之手工制作渐变色方法。 字节:不是文字的节日,是电脑机器语言的单位Byte,8个bit等於一字节。 |
||||||
|
|