However compelling the title of this blogpost, I attended some great talks today. Not all of them were among the best (anyone who attended the opening keynote would probably agree *ahem* ;)), but I’ll highlight some of the stuff I heard today.
Ian Barber – Document Classification in PHP
I think Ian is a talented speaker. It was very easy listening and a very clear presentation of the subject, which probably isn’t the easiest to talk about. Basically, document classification can be done automatically by labeling your documents based on it’s contents. Whether the document is an entire book, a blog post or even one line of text, it can be labeled (or tagged) and therefore linked to a taxonomy, which gives the document a relational value in respect to the tag (or label). I’m not going to explain it in-depth here, but a few notes might be interesting:
If you ever run into the problem of document classification (is a document of a certain type, with an email being either “spam” or “ham” posing a great example), there are some keywords to look for. The evaluation of rules, based on an “F-beta” or “E-measure”, the Vector Space Model and “TFIDF”, Dimensionality Reduction by stemming and stop words, the X (or Chi) square, Overfitting and K nearest neighbour. The TDIDF and K Nearest Neighbour stuff was really interesting. I think, if you get Ian’s slides from slideshare (he’ll probably put them up there), it will be pretty explanatory, since the slides contain a basic example and graphical representation of some models. BTW, Xapian was mentioned as well. Ring a bell, anyone? 🙂
Sebastian Bergmann – PHP Compiler Internals
First of all a very decent basic outline of the PHP architecture, how Zend Engine relates to the PHP core, the SAPI en the modules. Nothing really new there. Next was a brief introduction on how compilers (in general) work, by lexing, parsing and bytecode generation. This wasn’t really new to me, but for a newbie to compiler internals, this definitely was a welcome start of the presentation.
The lexing (or scanning, or tokenizing) part basically means defining a spelling of your language, and naming each part of the language for later use. For instance: the “if” keyword is tokenized to an internal T_IF token. The php team switched to re2c from flex.
The Syntax analysis (or Parser) correlates these keywords into a parse tree (so the if has a parameter “condition” of some sort, and a following statement). A parser generator is used for this task, also a switch made here from bison to lemon.
Finally, the bytecode generator translates this parse tree into a linear set of opcode instructions, comparable to assembly. bytekit was mentioned as Sebastian’s own bytecode analyzer, and VLD as an alternative. The latter’s acronym actually stands for Vulcan Logic Disassembler, which makes you wonder how pointy the ears of the developer were and if he has been suppressing his emotions for long, now. Let it out, man! A list of these opcodes can be found at zapt.info/opcodes.html.
Sebastian introduced us to the new keyword “unless”, which was merely for presentation reasons (phew!), to show how you could change bits and pieces of the parser, and testing it using bytekit. Very cool stuff, read into it, and try out the bytekit tool 🙂
Maybe someone out there will find the time to actually build a smarty lexer + parser to hook into the Zend Engine, which eliminates the endless “why compile a template engine with a template engine”-discussion 😉
Eli White – Code & Release management
Eli started off with a crash course in VCS terminology. I’m not getting into that here, I’m assuming you’re familiar with VCS and if not, you probably should be reading a version control book or blogpost right now, in stead of this one 😉
First of all, Eli encouraged to set up a set of guidelines or policies so you will, to incorporate into your development team’s every day work. Very good tip I think, and easily overlooked.
Then, using subversion as an example, he showed a few approaches to version and release management. Stage branches: (basically the trunk for development, staging and production environment). I think the cons of this approach are obvious: messy merging and no fixed state copies of your releases (tags). Feature branches: The trunk for development, every feature gets its own branch and the features are merged back into trunk before testing. The main problem here is that the branches could easily get a “life of their own” so it’s will be a lot of work merging these branches either back into the trunk, or into another branch. Last was release branches. The trunk is development, you test against a branch, merge fixes from the branch back into trunk and tag a release from the branch.
I agree with Eli that a nice mix between the last two would be the most flexible approach. Of course, you need to keep reviewing your policies mentioned earlier to match this way of development.
Unfortunately, Eli didn’t get into the source code management for a common code base or external sources. I’ll write a blogpost on that some day, to explain how I think you can do this best, which is for all you framework-building, library-using, never-build-from-scratch guys out there 🙂
Finally, Eli elaborated on how to put things live. This wasn’t actually very different from Lorna Jane Mitchell’s talk from last year, but nevertheless a interesting notion. Basically: write a script. Use this script to put things “live” in your testing environment, and use SVN exports in combination with rsync, which is the most reliable and flexible. At the live server, either use a symlinked DocumentRoot and maintain the previous (few) versions for rolling back purposes, or put a “Work-In-Progress” sign up. Of course, let the script take care of that as well.
On updating your database: basically never do any destructive updates that are hard to rollback, until you are a few versions farther.
Lorenzo Alberton – Trees in the database
I think this might have been Lorenzo Alberton’s first presentation on a conference. He was a bit shaky and hasty, but fortunately his enthusiasm made up for some part of that. Most of the presentation wasn’t really interesting to me, but that probably has to do with the fact that I already know some stuff about it; not necessarily everyone does.
The adjacency list model (why not to use it), materialized path model (why it doesn’t really solve the adjacency list model’s deficits) and the nested set model. If you don’t know the last, please learn it, it’s great. Lorenzo showed a few modified approaches to implementing the nested set model, with some clever usage of rational numbers, which I had never heard of. It didn’t really register with me at that time, because I was tired and becoming fed up with 3 days of conference :P. I’ll figure it out some time. Then a proprietary Oracle solution, which had some cool features, but hey, who’s using Oracle? (not me). Finally the SQL ’99 -standards approach, which basically implements a dynamic view to generate the joins needed dynamically in order to get the children and descendants of a node in the tree. Great concept, but not very widely implemented. I would really need to see the slides again to reproduce the rest, so I’m gonna stop here 😛
Well, that’s about it for today. There are some ideas that popped in my head today, but I won’t post them until they’ve matured 😉 Hope you enjoyed the read of this an the previous post and of course feel free to comment.