%HTMLlat1;
]>
For years, the comments were systematically rejected on the blog because most of them were spam, and I didn't have a good way of filtering them out. A CAPTCHA would have been a solution, but I read that the ones based on warped text are easily defeated, and I didn't want to bother with images anyway.
I recently rediscovered the idea of a CAPTCHA based on arithmetics, which I now have implemented. The poster of a comment must do a simple arithmetics operation involving two single digit numbers and one operator. It should not be difficult to defeat, but arithmetics CAPTCHAs are apparently uncommon, so it is likely that most bots don't implement such solvers. It has already repelled a couple of spam comments today.
]]>?all=1
dans l'URL. Ceci s'applique
aussi aux flux RSS.
Summaries and beers are not interesting to most people, so I replaced the
optional filter with a mandatory “second page”. To disable this filter, add
?all=1
to the URL. The same applies to RSS feeds.
Yhteenvedot ja oluet eivät kiinnosta useimpia lukijoita, joten korvasin
vapaaehtoisen suodattimen pakollisella “toisella sivulla”. Suodattimen voi
pysäyttää lisäämällä ?all=1
URL-osoitteen loppuun. Sama koskee
RSS-syötteitä.
subset=life
à l'URL du blog. Pour les plus fainéants, voici des
liens directs vers les version
HTML et
RSS du blog.
To filter out the “less interesting” content of the blog i.e., the book and
movie summaries, the beers and chocolats, you just have to append the
subset=life
parameter to the blog's URL. For the laziest of you, here are
direct links to the HTML and
RSS versions of the
blog.
Jos haluat suodattaa “vähemmän kiinnostavan” sisällön pois, eli kirjojen ja
elokuvien yhteenvedot, oluet ja suklaat, sinun tarvitsee vain lisätä subset=life
blogin URL:iin. Tässä vielä laiskimille suoria linkkejä
HTML- ja
RSS- versioihin.
J'ai aussi fait en sorte d'afficher la dernière entrée dudit microblog dans l'en-tête du blog, histoire de lui donner un peu plus de visibilité.
]]>http://users.jyu.fi/~mweber/blog/index.rss?lang=en
as
the RSS feed's URL. You can replace the en
at the end with fr
or fi
for
French and Finnish respectively.
J'ai changé mon flux RSS afin d'afficher les histoires en HTML au lieu de
l'ancienne version en texte simpe. De plus, j'ai ajouté à chaque item RSS un
lien vers des services de traduction automatique (Google Translate et
Babelfish). Ce lien n'apparait que si le flux RSS est appelé avec un paramètre
langue. Par exemple pour le Français, il faut utiliser
http://users.jyu.fi/~mweber/blog/index.rss?lang=fr
comme URL du flux RSS. On
peut remplacer le fr
à la fin par en
ou fi
pour l'anglais ou le finnois,
respectivement.
I cannot be held responsible for death by fatal hilarity when reading the automatic translations.
Je ne peux être tenu responsable si vous mourez de rire en lisant les traductions automatiques.
]]>I implemented menus for the category list, because it was uselessly long.
Items that contain subitems are in bold font. Anyway, indenting subcategories
with non-breakable spaces has been annoying me for a long time, but I never
had a good reason to rewrite the categorytree
module to get rid of them.
It seems that making a menu was a good enough reason.
The mechanism for the menu (using purely CSS) is adapted from Eric Meyer's Pure CSS Menus. I have no idea if it works with IE (at the time Meyer implemented his menu in 2002, IE's CSS engine was too crappy to render them properly), but who cares about IE users anyway.
]]>I removed the RANDOM spam filter (comparing letter frequences of a comment with statistics for English, French and Finnish). It wasn't catching much spam anyway, and the one it did catch was not random, but contained lots of medicine names, which don't follow the statistical patterns of English, French or Finnish. And also, it did filter out one legitimate comment where the author's name was Finnish and the content was in English (thus matching neither Finnish nor English statistics), which is not nice.
]]>A closer look at the spams shows that in the past 3 weeks, 100% of the comment spam has been caught. The messages that I needed to moderate manually were all trackback spam. The latter is harder to spot because there is no form that should be fetched prior to posting and that could be used for laying traps.
Comment spam represent 64.1% of the spams, whereas trackback spam represents 35.9% (with a total of 139303 spams in 795 days i.e., an average of 175 spams per day).
Given the amount of legitimate trackback I get (exactly 0 in two years), maybe I should simply disable it?
]]>Out of the last 6370 spams, 5798 (91.0%) were blocked based on the IP address of the sender (IPBLACKLIST).
Out of the other 572, 394 (68.9%) were blocked by a simple trap (COIN, a field that should be left empty), 83 (14.5%) were blocked because they contained the same URL more than twice (SAMEURL), 49 (8.57%) had too many urls per word (TOOMANYURL), 15 (2.62%) were blocked by keyword (KEYWORD), 7 (1.22%) had the same values for title, blog name and excerpt (SAMETITLE), 5 (0.874%) had more than 4 URLs pointing to the same server (SAMESERVER), 3 (0.524%) contained random data (RANDOM, none of them actually did but they were spam nonetheless) and 2 (0.350%) contained only hex data (HEXDATA).
Overall, 14 spams had to be hand moderated, which makes a false negative percentage of 0.22%.
The false positives I've had were because of the TOOMANYURL filter, but it also catches a lot of spams. In most of these, the URLs were not real but made of random letters.
]]>Exemple: http://users.jyu.fi/~mweber/blog/index.rss?filterlang=en+fr n'affiche que les articles écrits en anglais ou en français dans le flux RSS.
-----
Articles can be filtered based on language, simply by adding the filterlang parameter to the URL.
Example: http://users.jyu.fi/~mweber/blog/index.rss?filterlang=en+fi would display only the articles written in English or in Finnish in the RSS feed.
]]>I dug again into my hacked blosxom to find more ugliness, such as opening
files for all the stories, including the ones that are not going to be
displayed. This divided the time spent in blosxom::generate
by 3.
Surprisingly, replacing the look-behind assertions in
interpolate_fancy::__ANON__
's regexps didn't made as big a difference as in
textrite::rite
(1.11 ms against 0.813 ms), probably because there wasn't
that many of them and weren't called that often
Profiling results are now like this:
Total Elapsed Time = 1.236275 Seconds User+System Time = 1.066275 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 10.2 0.109 0.127 134 0.0008 0.0010 interpolate_fancy::__ANON__ 7.50 0.080 0.178 13 0.0061 0.0137 blosxom::BEGIN 5.06 0.054 0.428 1 0.0540 0.4283 blosxom::generate 4.41 0.047 0.098 368 0.0001 0.0003 entries_index::__ANON__ 3.75 0.040 0.040 7 0.0057 0.0057 CGI::_compile 3.75 0.040 0.054 9 0.0044 0.0061 CGI::import 3.75 0.040 0.040 40 0.0010 0.0010 textrite::rite 3.66 0.039 0.039 690 0.0001 0.0001 File::Basename::fileparse 2.81 0.030 0.060 8 0.0037 0.0074 Net::SMTP::BEGIN 2.81 0.030 -0.000 62 0.0005 - Exporter::import 2.81 0.030 0.030 40 0.0007 0.0007 magiclink::story 1.88 0.020 0.020 5 0.0040 0.0040 autotrack::BEGIN 1.88 0.020 0.030 7 0.0029 0.0042 IO::File::BEGIN 1.88 0.020 0.089 8 0.0025 0.0111 writeback::BEGIN 1.88 0.020 0.020 41 0.0005 0.0005 writeback::real_path
The User+System Time is lower, but for some reason the Total Elapsed Time is about the same.
]]>Profiling results for the new blog engine (main page):
Total Elapsed Time = 1.245757 Seconds User+System Time = 1.235757 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 14.9 0.185 0.586 1 0.1851 0.5857 blosxom::generate 12.0 0.149 0.187 134 0.0011 0.0014 interpolate_fancy::__ANON__ 6.47 0.080 0.158 14 0.0057 0.0113 blosxom::BEGIN 5.66 0.070 0.070 40 0.0017 0.0017 textrite::rite 4.53 0.056 0.095 367 0.0002 0.0003 entries_index::__ANON__ 4.05 0.050 0.059 8 0.0062 0.0074 Net::SMTP::BEGIN 3.24 0.040 0.108 8 0.0050 0.0135 writeback::BEGIN 2.43 0.030 0.030 7 0.0043 0.0043 CGI::_compile 2.43 0.030 0.043 9 0.0033 0.0048 CGI::import 2.27 0.028 0.037 317 0.0001 0.0001 interpolate_fancy::_resolve_nested 1.62 0.020 0.020 1 0.0200 0.0200 archives::filter 1.62 0.020 0.019 4 0.0050 0.0048 entries_index::BEGIN 1.62 0.020 0.030 5 0.0040 0.0060 autotrack::BEGIN 1.62 0.020 0.020 27 0.0007 0.0007 vars::import 1.62 0.020 0.010 62 0.0003 0.0002 Exporter::import
The problem in textrite::rite
came from the use of look-behind assertions in
Perl regexp. I've changed all of them so that look-behind is not needed
anymore. The same problem occurs in interpolate_fancy::__ANON__
, but I'm not
going to look into it yet.
The other problem was that textrite::rite
was called for all the stories,
including the ones that were not displayed (which was most of them). I changed
the internal structure of the main script so that the processing of the story
happens after discarding the non-displayed ones.
Profiling results for the blog engine (main page):
Total Elapsed Time = 3.74 Seconds User Time = 2.99 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 26.0 0.780 0.780 253 0.0031 0.0031 textrite::rite 18.0 0.540 0.800 876 0.0006 0.0009 interpolate_fancy::__ANON__ 9.03 0.270 2.540 1 0.2700 2.5400 blosxom::generate 7.02 0.210 0.260 1911 0.0001 0.0001 interpolate_fancy::_resolve_nested 3.68 0.110 0.110 3663 0.0000 0.0000 IO::File::open 2.34 0.070 0.160 11 0.0064 0.0145 blosxom::BEGIN 2.34 0.070 0.070 11867 0.0000 0.0000 UNIVERSAL::can 2.34 0.070 0.140 741 0.0001 0.0002 lang::__ANON__ 2.34 0.070 0.290 253 0.0003 0.0011 writeback::story 1.67 0.050 0.070 253 0.0002 0.0003 magiclink::story 1.34 0.040 0.070 9 0.0044 0.0078 CGI::import 1.34 0.040 0.040 254 0.0002 0.0002 writeback::real_path 1.34 0.040 0.040 253 0.0002 0.0002 translate::story 1.00 0.030 0.030 1052 0.0000 0.0000 UNIVERSAL::isa 1.00 0.030 0.080 8 0.0037 0.0100 writeback::BEGIN
That's what you get when you abuse of Perl regexp. Maybe caching of the HTMLized stories would help a bit?
]]>I submitted my collection of spam to Spamhaus's Zen service, but only 40% of my spammer's IP addresses were recognized. Checking for the existence of the URLs in these spams takes much more time, since most of them don't exist anymore and you waste time waiting for DNS to timeout. Testing with the last 40 ones didn't return any match, maybe they are too recent to be already in the RBL.
]]>De plus en plus de spam parvenait à se faufiler à travers mon filtre à spam, j'avais donc décidé à la mi-décembre d'interdire les commentaires jusqu'à ce que je trouve une meilleure solution. À partir de maintenant il est à nouveau possible de poster des commentaires, mais le contenu n'en sera pas affiché immédiatement s'il comporte des URL, ou si une URL a été saisie dans le champ URL.
Joulukuun puolivälissä estin blogin kommentit, koska sain liikaa roskapostia suodattimesta huolimatta. Nyt on taas mahdollista lähettää kommentteja, mutta niiden sisältö ei näy heti, jos kommentissa tai sen URL-kentässä on linkkejä.
]]>© spam-UK.com
Got my first spam comment today. Those bastards are fast, the blog's been up only one and a half month and I have less than ten visitors a day (search engines included)…
J'ai eu mon premier spam aujourd'hui. Ces enfoirés sont rapides, le blog n'existe que depuis un mois et demi et j'ai moins de dix visiteurs par jour (en comptant les moteurs de recherche)…
Sain minun ensimmäisen spamin tänään. Ne paskiaiset ovat nopeita, blogi on ollut olemassa vain puolitoista kuukautta ja minulla on alle kymmenen kävijää päivässä (sisältää hakukoneet)…
]]>Most pictures of the blog come from the Web. I made local copies of them, which in turn act as links to the originating website. The name of the source is printed below the picture.
Suurin osa blogin kuvista on peräisin muualta webistä. Olen tehnyt kuvista paikalliset kopiot, ja kuva toimii linkkinä alkuperäiselle sivustolle. Kuvan alla mainitaan sen lähde.
]]>© Alexandre Alapetite
Ça y est, j'ai cédé aux sirènes du Web 2.0 et j'ai ouvert mon blog. Je ne suis pas sûr qu'il sera utile ou vivant, mais on verra bien.
]]>Blog sections containing more than 40 entries are divided into pages. This required to change the way Blosxom generates the HTML. I guess that my blog engine is not really Blosxom anymore, although it is compatible with Blosxom V2.
Blogin osat joissa on yli 40 artikkelia on jaettu sivuihin. Sitä varten täytyi muuttaa tapaa jolla Blosxom tuottaa HTML koodia. Luulen että minun blog-moottori ei enää ole Blosxom, mutta se on yhteensopiva Blosxomin kanssa.
]]>The blog supports at last several languages (French, English, Finnish).
Blogi on vihdoin monikielinen (ranska, englanti, suomi).
]]>MaxMind met à disposition une base de données qui fait le lien entre les adresses IP et les pays, voire les villes, et fournit encore d'autres informations géographiques. Le pays et la ville (si elle est disponible) des personnes qui postent des commentaires sont maintenant affichées à coté du nom de la personne.
MaxMind tarjoaa ilmaisen tietokannan joka yhdistää IP osoiteet maihin (ja mahdollisesti myös kaupunkeihin). Kommentteja postavan henkilön maa ja kaupunki näkyvät blogissa nimen jälkeen.
]]>I found a logo for the blog: the wannabe hacker emblem. It's black & white and pretty cryptic.
Löysin blogille logon: wannabe-hackerin tunnuskuva. Se on mustavalkoinen ja sopivan salaperäinen.
]]>Passé de RSS 0.91 à 2.0 pour avoir des éléments guid. Les flux sont maintenant valides (on croise les doigts).
RSS 0.91:stä 2.0:an, niin että niissä on guid elementteja. Syötöt ovat nyt oikeellisia (pidetään peukkuja pystyssä).
]]>Les flux RSS (0.91 et 1.0) sont valides ; les items ont chacun leur propre date de publication.
RSS (0.91 ja 1.0) syötöt ovat nyt luultavasti oikeellisia; nimikkeillä on nyt omat julkaisupäivämäärät.
]]>Articles can be associated to multiple categories by using symbolic links (Unix only).
Artikkelit voivat kuulua useaan kategoriaan, symbolisia linkkeja käyttämällä (vain Unixissa).
]]>