Friday, 19 October 2012

Twitter Character Count

【Update: Do not know when it happened but Twitter no longer differentiates between BMP and non BMP characters WRT character count. All characters now have a count of 1. I may, at some stage, delete this article but, for the time being, I will leave it here as an historical record of the evolution of Twitter.

In a previous article I examined Sina Wēibó 新浪微博 character count for a user post schappo.blogspot.co.uk/2012/10/weibo-character-count.html Lets now examine twitter. The stated and generally understood limit is 140 characters for a tweet. This is not strictly true. The actual tweet limit is variable and ranges from 70 to 140, inclusive. Different characters have different counts, as follows:

  • Characters from Unicode range U+0000➜U+FFFF have a count of 1
  • Characters from Unicode range ≥ U+010000 have a count of 2
Or, to put it another way — Characters in the Basic Multilingual Plane (BMP) have a count of 1 and characters in the other planes have a count of 2. The 2 Mahjong Tile characters used in the example below are from the Supplementary Multilingual Plane (SMP).

Lets illustrate with a made-up posting that contains characters from the 2 Unicode ranges, above. The following text has a tweet character count of 17.
  • one two 一二三四五
  • 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 = 17

Saturday, 6 October 2012

Weibo Character Count

Same as all the other microblog systems I have encountered, Sina Wēibó 新浪微博 has a 140 character limit for a user post. This is not strictly accurate. The character limit is variable and ranges from 70 to 280, inclusive. It depends on which characters are included. Different characters have different counts, as follows:
  1. Characters from Unicode range U+0000➜U+00FF have a count of 0.5
  2. Characters from Unicode range U+0100➜U+FFFF have a count of 1
  3. Characters from Unicode range ≥ U+010000 have a count of 2
Some of the consequences of these differing counts are:

  • If one writes in everyday English then one has up to 280 characters as these will be Latin characters in Unicode blocks Basic Latin and Latin-1 Supplement U+0000➜U+00FF. The Latin Script does though occur in several Unicode blocks en.wikipedia.org/wiki/Latin_characters_in_Unicode. Latin characters in Unicode blocks other than Basic Latin and Latin-1 Supplement will have counts of 1 or 2 and usage of them will reduce the 280 limit.
  • For a Chinese only post then if all the Chinese characters used are in the Unicode Basic Multilingual Plane (BMP) then the limit will be the accepted 140 characters. There are many Chinese characters outside of the BMP and because they have a count of 2, usage of these will reduce the 140 limit. The extreme case being a limit of 70 if all characters used are Chinese characters outside of the BMP.
  • In recent releases of OSX and iOS, Apple incorporated Emoji characters en.wikipedia.org/wiki/Emoji The majority of these Emoji characters are outside the BMP (ie ≥ U+010000) and so will have a count of 2.
Lets illustrate with a nonsensical posting that contains characters from the 3 Unicode ranges, above. The following text has a Weibo character count of 13.

  • one two 🀂一二三四五🀀
  • 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 2 + 1 + 1 + 1 + 1 + 1 + 2 = 13

Tuesday, 24 July 2012

My Adopted Chinese Name

In China there is a very famous Canadian by the name of Mark Rowswell dashan.com. One of the reasons he is so famous in China is that his Chinese is very very good. His adopted Chinese name is 大山 (dàshān) which means great or large mountain.

Several years ago I decided to also adopt a Chinese name. One day a name popped into my mind. Mark's Chinese is very good but my Chinese is only basic. Consequently, I chose the name 小山 (xiǎoshān) which means little mountain 😀

An advantage of having an adopted name is that one can change it and I can change it to reflect my progress in mastering the Chinese language. So as my Chinese improves I can change it to 中山 (zhōngshān) which means middle mountain. Then 大山 (dàshān) and finally, if I ever reach this level of proficiency, 巨山 (jùshān) which means gigantic mountain.

There are though, some days when I think my Chinese is so poor that maybe my adopted name should be 微山 (Wēishān) as this means micro mountain.

Thursday, 14 June 2012

Rethinking data

"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay." — Sherlock Holmes in The Adventure of the Copper Beeches.

Data may be the preeminent obsession of our age[1]. We marvel at the ever-growing quantity of data on the Internet, and fortunes are made when Google sells shares for the first time on the stock market. We worry about how corporations and governments collect, protect, and share our personal information. A beloved character on a television science fiction show is named Data. We spend billions of dollars to convert the entire human genome into digital data, and having completed that, barely pause for breath before launching similar and even larger bioinformatic endeavours. All this attention being paid to data reflects a real societal transformation as ubiquitous computing and the Internet refashion our economy and, in some respects, our lives. However, as with other important transformations—think of Darwin's theory of natural selection, and the revolutionary advances in genetics and neuroscience—misinterpretation, misapplication, hype, and fads can develop. In this post, I'd like to examine the current excitement about data and where we may be going astray.

Big Data

Writing in the New York Times, Steve Lohr points out that larger and larger quantities of data are being collected—a phenomenon that has been called "Big Data":
In field after field, computing and the Web are creating new realms of data to explore — sensor signals, surveillance tapes, social network chatter, public records and more. And the digital data surge only promises to accelerate, rising fivefold by 2012, according to a projection by IDC, a research firm. 
Widespread excitement is being generated by the prospect of corporations, governments, and scientists mining these immense data sets for insights. In 2008, a special issue of the journal Nature was devoted to Big Data. Microsoft Research's 2009 book, The Fourth Paradigm: Data-Intensive Scientific Discovery, includes these reflections by Craig Mundie (p.223):
Computing technology, with its pervasive connectivity via the Internet, already underpins almost all scientific study. We are amassing previously unimaginable amounts of data in digital form—data that will help bring about a profound transformation of scientific research and insight. 
The enthusiasm in the lay press is even less restrained. Last November, Popular Science had a special issue all about data. It has a slightly breathless feel—one of the articles is titled "The Glory of Big Data"—which is perhaps not so surprising in a magazine whose slogan is "The Future Now".

Data Science

Along with the growth in data, there has been a tremendous growth in analytical and computational tools for drawing inferences from large data sets. Most prominently, techniques from computer sciencein particular data mining and machine learninghave frequently been applied to big data. These approaches can often be applied automatically—which is to say, without the need to make much in the way of assumptions, and without explicitly specifying models. What is more, they tend to be scalable—it is feasible (in terms of computing resources and time) to apply them to enormous data sets. Such approaches are sometimes seen as black boxes in that the link between the inputs and the outputs is not entirely clear. To some extent these characteristics stand in contrast with statistical techniques, which have been less optimized for use with very large data sets and which make more explicit assumptions based on the nature of the data and the way they were collected. Fitted statistical models are interpretable, if sometimes rather technical.

In an article on big data, Sameer Chopra suggests that organizations should "embrace traditional statistical modeling and machine learning approaches". Some have argued that a new discipline is forming dubbed data science[2]which combines these and other techniques for working with data. In 2010, Mike Loukides at O'Reilly Media wrote a good summary of data science, except for this odd claim:
Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis.
Leaving aside the confusion between statistics and actuarial science (not to mention stereotyped notions of typical attire), what is curious is the suggestion that "traditional statistics" has little role to play in the effective use of data. Chopra is more diplomatic: machine learning "lends itself better to the road ahead". Now, in many cases, a fast and automatic method may indeed be just what's needed. Consider the recommendations we have come to expect from online stores. They may not be perfect, but they can be quite convenient. Unfortunately, the successes of computing-intensive approaches for some applications has encouraged some grandiose visions. In an emphatic piece titled "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Chris Anderson, the editor in chief of Wired magazine, writes:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity.
Furthermore:
We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Anderson proposes that instead of taking a scientific approach, we can just "throw the numbers" into a machine and through computational alchemy transform data into knowledge. (Similar thinking shows up in commonplace references to "crunching" numbers, a metaphor I have previously criticized.) The suggestion that we should "forget" the theory developed by experts in the relevant field seems particularly unwise. Theory and expert opinion are always imperfect, but that doesn't mean they should be casually discarded.

Anderson's faith in big data and blind computing power can be challenged on several grounds. Take selection bias, which can play havoc with predictions. As an example, consider the political poll conducted by The Literary Digest magazine, just before the 1936 presidential election. The magazine sent out 10 million postcard questionnaires to its subscribers, and received about 2.3 million back. In 1936, that was big data. The results clearly pointed to a victory by the republican challenger, Alf Landon. In fact, Franklin Delano Roosevelt won by a landslide. The likely explanation for this colossal failure: for one thing, subscribers to The Literary Digest were not representative of the voting population of the United States; for another, the 23% who responded to the questionnaire were likely quite different from those who did not. This double dose of selection bias resulted in a very unreliable prediction. Today, national opinion polls typically survey between 500 and 3000 people, but those people are selected randomly and great efforts are expended to avoid bias. The moral of this story is that, contrary to the hype, bigger data is not necessarily better data. Carefully designed data collection can trump sheer volume of data. Of course it all depends on the situation.

Selection biases can also be induced during data analysis when cases with missing data are excluded, since the pattern of missingness often carries information. More generally, bias can creep into results in any number of ways, and extensive lists of biases have been compiled. One important source of bias is the well-known principle of Garbage In Garbage Out. Anderson refers to measurements taken with "unprecedented fidelity". It is true that in some areas, impressive technical improvements in certain measurement have been made, but data quality issues are much broader and are usually problematic. Data quality issues can never be ignored, and can sometimes completely derail an analysis.

Another limitation of Anderson's vision concerns the goals of data analysis. When the goal is prediction, it may be quite sufficient to algorithmically sift through correlations between variables. Notwithstanding the previously noted hazards of prediction, such an approach can be very effective. But data analysis is not always about prediction. Sometimes we wish to draw conclusions about the causes of phenomena. Such causal inference is best achieved through experimentation, but here a problem arises: big data is mostly observational. Anderson tries to sidestep this by claiming that with enough data "Correlation is enough":
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
But on the contrary, investigations of cause and effect (mechanistic explanations) are central to both natural and social science. And in applied fields such as government policy, it is often of fundamental importance to understand the likely effect of interventions. Correlations alone don't answer such questions. Suppose, for example, there is a correlation between A and B. Does A affect B? Does B affect A? Does some third factor C affect both A and B? This last situation is known as confounding (for a good introduction, see this article [pdf]). A classic example concerns a positive correlation between the number of drownings each month and ice cream sales. Of course this is not a causal relationship. The confounding factor here is the season: during the warmer periods of the year when people consume more ice cream, there are far more water activities and hence drownings. When a confounding factor is not taken into account, estimates of the effect of one factor on another may be biased. Worse, this bias does not go away as the quantity of data increases—big data can't help us here. Finally, confounding cannot be handled automatically; expert input is indispensable in any kind of causal analysis. We can't do without theory.

Big data affords many new possibilities. But just being big does not eliminate the problems that have plagued the analysis of much smaller data sets. Appropriate use of data still requires careful thought—about both the content area of interest and the best tool for the job.

Thinking about Data

It is also useful to think more broadly about the concept of data. Let's start with an examination of the word data itself, to see what baggage it carries.

We are inconsistent in how we talk about data. The words data and information are often used synonymously (think of "data processing" and "information processing"). Notions of an information hierarchy have been around for a long time. One model goes by the acronym DIKW, representing an ordered progression from Data to Information to Knowledge and eventually Wisdom. Ultimately, these are epistemological questions, and easy answers are illusory.

Nevertheless, if what we mean by data is the kind of thing stored on a memory stick, then data can be meaningless noise, the draft of a novel, a pop song, the genome of a virus, a blurry photo taken by a cellphone, or a business's sales records. Each of these types of information and an endless variety of others can be stored in digital memory: on one level all data are equivalent. Indeed the mathematical field of information theory sets aside the meaning or content of data, and focuses entirely on questions about encoding and communicating information. In the same spirit, Chris Anderson argues that we need "to view data mathematically first and establish a context for it later."

But when we consider the use of data, it makes no sense to think of all data as equivalent. The complete lyrics of all of the songs by the Beatles is not the same as a CT scan. Data are of use to us when they are "about" something. In philosophy this is the concept of intentionality, which is an aspect of consciousness. By themselves, the data on my memory stick have no meaning. A human consciousness must engage with the data for them to be meaningful. When this takes place, a complex web of contextual elements come into play. Depending on who is reading them, the Beatles' lyrics may call to mind the music, the cultural references, the history of rock and roll, and diverse personal associations. A radiologist who examines a CT scan will recognize various anatomical features and perhaps concerning signs of pathology. Judgements of quality may also arise, whether in mistranscribed lyrics or a poorly performed CT scan.

The word data is the plural of the Latin word datum, meaning "something given". So the data are the "givens" in a problem. But in many cases, it might be helpful to think of data as taken rather than given. For example, when you take a photograph, you have a purpose in mind, you actively choose a scene, include some features and exclude others, adjust the settings of the camera. The quality of the resulting image depends on how steady your hand is, how knowledgeable you are of the principles of photography. Even when a photograph is literally given to you by someone else, it was still taken by somebody. The camera never lies, but the photograph may be misunderstood or misrepresented.

When a gift is given to you, it is easy to default to the passive role of recipient. The details of how the gift was selected and acquired may be entirely unknown to you. A dealer in fine art would carefully investigate a newly acquired work to determine its provenance and authenticity. Similarly, when you receive data from an outside source, it is important to take an active role. At the very least, you should ask questions. Chris Anderson claims that "With enough data, the numbers speak for themselves." But on their own, the numbers never speak for themselves, any more than a painting stolen during WWII will whisper the secret of its rightful ownership. One common source of received data today is administrative data, that is, data collected as part of an organization's routine operations. Rather than taking such data at face value, it is important to investigate the underlying processes and context.

It is also possible to make use of received data to design a study. For example, to investigate the effect of a certain exposure, cases of a rare outcome may be selected from a data set and matched with controls, that is individuals who are similar except that they did not experience that outcome. (This is a matched case-control study.) Appropriate care must be taken in how the cases and controls are selected, and in ensuring that any selection effects in the original database do not translate into bias in the analysis. Tools for the valid and efficient analysis of such observational studies have been investigated by epidemiologists and statisticians for over 50 years.

When we collect the data ourselves, we have an opportunity to take an active role from the start. In an experiment, we manipulate independent variables and measure the resulting values of dependent variables. Careful experimental design lets us accurately and efficiently obtain results. In many cases, however, true experiments are not possible. Instead, observational studies, where there is no manipulation of independent variables, are used. Numerous designs for observational studies exist, including case-control (as mentioned above), cohort, and cross-sectional. Again, careful design is vital to avoid bias, and to efficiently obtain results.

Conclusion

Excitement over a new developmentbe it a discovery, a trend, or a way of thinkingcan sometimes spill over, like popcorn jumping from a popper. This may give rise to related, but nevertheless distinct ideas. In the heat of the excitement (and not infrequently a good deal of hype), it's important to evaluate the quality of the ideas. Exaggerated claims may not be hard to identify, but they are also frequently pardoned as merely an excess of enthusiasm.

Still, the underlying bad idea may, in subtler form, gradually gain acceptance. The costs may only be appreciated much later. Today it is easy to see how damaging ideas like social Darwinismthe malignant offspring of a very good ideaproved to be. But at the time, it may have seemed like a plausible extrapolation from a brilliant new theory.

The role of data in our societies and our own lives is becoming increasingly central. We live in a world where the quantity of data is exploding and truly gargantuan data sets are being generated and analyzed. But it is important that we not become hypnotized by their immensity. It is all too easy to see data as somehow magical, and to imagine that big data combined with computational brute force will overcome all obstacles.

Let's enjoy the popcornbut turn down the heat a little. 

_____________________________ 
1. ^In this post, I won't worry too much about whether to treat data as singular or plural. It strikes me as a little bit like the question of whether to talk about bacteria or a bacterium. While the distinction is sometimes important, people can get awfully hung up on it, with little benefit. 
2. ^ See this interesting history of data science

Monday, 27 February 2012

Western Brands on Weibo

The purpose of this article is to list some of the Western Companies/Brands that are using China's Sina Wēibó 新浪微博. The text in the square brackets is the Sina Wēibó 新浪微博 name. This article is a continuation of schappo.blogspot.co.uk/2011/08/companies-on-sina.html
  1. 7 For All Mankind [@7ForAllMankind] weibo.com/7forallmankind
  2. Abercrombie & Fitch [@Abercrombie] weibo.com/abercrombieny
  3. Accenture [@埃森哲中国] weibo.com/accenture
  4. Accor Hotels [@雅高酒店AccorHotels] weibo.com/accorchina
  5. Air Liquide [@液空中国] weibo.com/airliquidechina
  6. AKG [@雅登-AKG中国] weibo.com/akgchina
  7. AkzoNobel [@阿克苏诺贝尔中国] weibo.com/akzonobelinchina
  8. Alberta Ferretti [@AlbertaFerretti] weibo.com/albertaferretti
  9. ALDO [@ALDO1972] weibo.com/n/ALDO1972
  10. Alexander McQueen [@Alexander-McQueen] weibo.com/alexandermcqueen
  11. Allen Edmonds [@AllenEdmonds中国] weibo.com/allenedmondschina
  12. Allianz Insurance [@安联保险-Allianz] weibo.com/allianzone
  13. Alpenliebe [@微有爱] weibo.com/alpenliebekindness
  14. American Express [@美国运通中国官方微博] weibo.com/amexchina
  15. Anya Hindmarch [@Anya_Hindmarch_Official] weibo.com/anyahindmarchlondon
  16. Argos [@Argos爱顾商城] weibo.com/2720491021
  17. ASOS [@ASOS] weibo.com/asosofficial
  18. Aspinal of London [@Aspinal-of-London] weibo.com/aspinaloflondonltd
  19. Associated Press [@美联社] weibo.com/apimages
  20. Aston Martin [@阿斯顿马丁拉共达] weibo.com/astonmartinlagondacn
  21. Aston Villa FC [@阿斯顿维拉足球俱乐部] weibo.com/AVFCOfficial
  22. AVIS [@AVIS安飞士租车] weibo.com/avischina
  23. Balenciaga [@Balenciaga] weibo.com/officialbalenciaga
  24. Balmain [@瑞士宝曼手表] weibo.com/balmainwatches
  25. Barbie [@Barbie芭比官方微博] weibo.com/barbieofficial
  26. BASF [@巴斯夫大中华] weibo.com/basfinchina
  27. Bayer [@拜耳中国官方微博] weibo.com/bayerchina
  28. Bentley Motors [@宾利BentleyMotors] weibo.com/bentleymotorsuk
  29. Bergdorf Goodman [@Bergdorfs] weibo.com/bergdorfs
  30. Best Buy [@BestBuy百思买] weibo.com/bestbuycn
  31. Bloomingdaleʼs [Bloomingdales_USA] weibo.com/bloomingdalesusa
  32. Blue Nile Inc [@BlueNileInc] weibo.com/bluenileinc
  33. Bobbi Brown [@BobbiBrownChina] weibo.com/bobbibrownchina
  34. Bonpoint [@Bonpoint-中国] weibo.com/bonpoint
  35. Bosch [@博世中国] weibo.com/boschauto
  36. Boucheron [@Boucheron宝诗龙微博] weibo.com/boucheronparis
  37. Breitling [@百年灵BREITLING] weibo.com/breitlingchina
  38. Bremont [@Bremont宝名表] weibo.com/bremont
  39. British Airways [@英国航空] weibo.com/britishairways
  40. Brompton Bicycle [@Brompton_bicycle_伯龙腾] weibo.com/bromptonbicycle
  41. BVLGARI [@BVLGARI宝格丽] weibo.com/bulgari
  42. BVLGARI Perfume [@宝格丽香水] weibo.com/bulgariperfume
  43. Cambridge Satchel Co. [@The_Cambridge_Satchel_Company] weibo.com/jianqiaobao
  44. Campo Marzio Design [@CampoMarzio中国区] weibo.com/campomarzio
  45. Camus [@卡慕CAMUS] weibo.com/camuschina
  46. CARAT London [@CARAT官方微博] weibo.com/caratlondon
  47. Caterpillar [@Caterpillar官方微博] weibo.com/caterpillarinchina
  48. Cath Kidston [@CathKidstonChina] weibo.com/cathkidstonchina
  49. Champagne Taittinger [@泰亭哲香槟] weibo.com/champagnetaittinger
  50. Cheerios [@雀巢脆谷乐] weibo.com/nestlecheerios
  51. Chopard [@萧邦Chopard] weibo.com/chopardchina
  52. Christian Louboutin [@ChristianLouboutin官方微博] weibo.com/LouboutinWorld
  53. Christie's [@佳士得国际] weibo.com/christies
  54. Clarisonic [@Clarisonic科莱丽-欧莱雅] weibo.com/clarisonicchina .
  55. Club Monaco [@Club_Monaco] weibo.com/clubmonaco
  56. CME Group [@CMEGroup] weibo.com/cmegroup
  57. Cows Creamery [@COWS冰激凌] weibo.com/cowscreamery
  58. Decanter [@Decanter醇鉴] weibo.com/decantercn
  59. Ducati [@杜卡迪中国] weibo.com/ducatichina
  60. Dulux [@多乐士Lets_Colour] weibo.com/letscolor
  61. DuPont [@杜邦公司] weibo.com/dupont
  62. eBay [@eBay] weibo.com/ebay
  63. Elizabeth Arden [@伊丽莎白雅顿美丽沙龙] weibo.com/elizabetharden
  64. EMC Corporation [@EMC中国-云计算] weibo.com/emcgreatchina
  65. Eppendorf [@eppendorf官方微博] weibo.com/eppendorfchina
  66. Ernst & Young [@安永EY] weibo.com/eyernstyoung
  67. Etro [@ETRO艾绰] weibo.com/etrochina
  68. Eurostar [@欧洲之星_Eurostar] weibo.com/eurostarchina
  69. Fairmont Hotels & Resorts [@费尔蒙酒店] weibo.com/fairmonthotels
  70. Fendi [@FENDI] weibo.com/fendi
  71. Financial Times [@FT中文网] weibo.com/ftchinese
  72. Finnair [@芬兰航空Finnair] weibo.com/finnaircom
  73. Firefox [@火狐] weibo.com/firefox
  74. Fisher-Price [@费雪中国官方微博] weibo.com/fisherprice
  75. Fisherman's Friend [@渔夫之宝官方微博] weibo.com/ffgfwb
  76. Fissler [@德国菲仕乐] weibo.com/fisslerchina2013
  77. Flipboard [@Flipboard] weibo.com/flipboard
  78. Freescale Semiconductor [@飞思卡尔] weibo.com/freescale
  79. Furla [@Furla_孚勒] weibo.com/furlaofficial
  80. G-Star RAW [@G-STARCHINA] weibo.com/gstarchina
  81. Geox [@健乐士GEOX] weibo.com/jianleshigeox
  82. Girard-Perregaux [@GP芝柏表] weibo.com/gpchina
  83. Glenmorangie [@格兰杰单一麦芽威士忌] weibo.com/glenmorangiechina
  84. GNC [@GNCLiveWell] weibo.com/gnclivewell
  85. GRAFF [@格拉夫GRAFF] weibo.com/graff
  86. Gregory Mountain Products [@Gregory官方微博] weibo.com/gregory1977
  87. Grey Goose [@法国灰雁GreyGoose] weibo.com/greygoosechina
  88. Guinevere Launcelot [@Guinevere_Launcelot] weibo.com/gltlondon
  89. Gymboree [@金宝贝国际早教微课堂] weibo.com/gymboree
  90. H2O+ [@H2O水芝澳官方微博] weibo.com/h2ochina
  91. Hackett London [@Hackett-London] weibo.com/hackettlondon
  92. Halma [@HALMA中国] weibo.com/halma
  93. Hardy Amies [@HardyAmies赫迪雅曼] weibo.com/HardyAmies
  94. Harry Winston [@海瑞温斯顿HarryWinston] weibo.com/harrywinston
  95. Hasbro [@孩之宝中国] weibo.com/hasbrochina
  96. Holland & Barrett [@HollandAndBarrett] weibo.com/hollandandbarrett
  97. Hollister [@Hollister] weibo.com/hollister
  98. Hooters [@美国猫头鹰餐厅-中国] weibo.com/hooterschina
  99. Hublot [@宇舶表] weibo.com/hublothanhan
  100. Hyatt [@凯悦酒店集团HYATT] weibo.com/hyatthotelscorp
  101. IBM [@IBM中国] weibo.com/ibm100
  102. IMAX [@IMAX] weibo.com/imax
  103. Irregular Choice [@IrregularChoice香港] weibo.com/irregularchoicehk
  104. IWC [@IWC万国表] weibo.com/iwcchina
  105. J.Lindeberg [@JLINDEBERG林德伯格] weibo.com/jlindeberg
  106. Jack Wolfskin [@JackWolfskin官方微博] weibo.com/jackwolfskingermany
  107. Jaeger-LeCoultre [@积家官方微博] weibo.com/jaegerlecoultrechina
  108. Jo Malone [@JoMaloneLondon祖玛珑] weibo.com/jomalonelondon
  109. Juniper Networks [@瞻博网络] weibo.com/junipernetworks
  110. Kate Spade New York [@katespade官方微博] weibo.com/katespadeny
  111. Kipsta [@KIPSTA中国] http://weibo.com/kipstachina
  112. Kleenex [@舒洁kleenex] http://weibo.com/n/舒洁kleenex
  113. Lagostina [@拉歌蒂尼] weibo.com/lagostina
  114. Lana Marks [@LANA-MARKS-CHINA] weibo.com/lanamarks
  115. Lancaster [@兰嘉丝汀] weibo.com/lancasterchina
  116. Le Coq Sportif [@lecoqsportif中国] weibo.com/lecoqsportif
  117. Lindt [@Lindt瑞士莲巧克力] weibo.com/lindtchina
  118. Lonely Planet [@LonelyPlanet] weibo.com/lonelyplanet
  119. Luis Via Roma [@LUISAVIAROMA官方微博] weibo.com/luisaviaroma
  120. MAC Cosmetics [@MAC魅可] weibo.com/maccosmetics
  121. Macy's [@美国梅西百货] weibo.com/MacysChina
  122. Manchester City FC [@曼城足球俱乐部MCFC] weibo.com/mcfcofficial
  123. Manchester United FC [@曼联足球俱乐部] weibo.com/manchesterunited
  124. Mango [@MANGO中国官网] weibo.com/mangofashion
  125. Marc Jacobs [@MarcJacobsIntl莫杰] weibo.com/marcjacobsintl
  126. Maria Luisa [@MARIA_LUISA] weibo.com/marialuisa
  127. Marimekko [@MARIMEKKO_玛莉美歌] weibo.com/marimekkoofficial
  128. Marmot [@Marmot中国] weibo.com/marmot001
  129. Marni [@MARNI] weibo.com/officialmarni
  130. Marvin Watches [@Marvin-瑞士摩纹表] weibo.com/marvinwatch
  131. MasterCard [@万事达人] weibo.com/mastercardchina
  132. Maxi-Cosi [@Maxi-Cosi] weibo.com/maxicosi
  133. McLaren [@迈凯伦汽车] weibo.com/mclarenchina
  134. Media Markt [@万得城电器] weibo.com/mediamarktchina
  135. Medtronic [@美敦力中国] weibo.com/medtronicchina
  136. Meltwater Group [@Meltwater] weibo.com/meltwater
  137. Mettler Toledo [@梅特勒-托利多中国] weibo.com/mettlertoledo
  138. Michael Kors [@Michael-Kors] weibo.com/michaelkors
  139. Monster Cable [@Monster-魔声中国] weibo.com/monsterchina
  140. Mothercare [@mothercare官方微博] weibo.com/mothercarechina
  141. Movado [@摩凡陀Movado] weibo.com/movado
  142. MTV [@MTV中文频道] weibo.com/mtvchina
  143. Mulberry [@Mulberry_Official] weibo.com/mulberryofficial
  144. NASDAQ OMX [@纳斯达克交易所] weibo.com/nasdaqomx
  145. Neiman Marcus [@NeimanMarcus尼曼] weibo.com/neimanmarcuschina
  146. NERF [@孩之宝NERF-热火] weibo.com/ilovenerf
  147. New Balance [@新百伦newbalance] weibo.com/newbalanceofficial
  148. New York Times [@纽约时报中文网] weibo.com/nytchinese
  149. Nuxe Paris [@Nuxe欧树] weibo.com/nuxe
  150. Old Navy [@OldNavyChina] weibo.com/oldnavychina
  151. Ovaltine [@阿华田Ovaltine] weibo.com/ovaltine001
  152. Oxford University Press [@牛津大学出版社全球学术出版] weibo.com/oupacademic
  153. Pandora [@PANDORA珠宝] weibo.com/pandorajewellery
  154. Papa John's Pizza [@棒约翰PapaJohns] weibo.com/papachina
  155. Paul Smith [@PaulSmith保罗史密斯] weibo.com/paulsmithofficial
  156. Paula's Choice [@PaulasChoice宝拉珍选] weibo.com/paulaschoice01
  157. PayPal [@PayPal_China] weibo.com/paypalmarketing
  158. Penguin Books [@企鹅出版社] weibo.com/penguinbooks
  159. Perficient [@博克软件] weibo.com/perficientchina
  160. Peugeot Scooters [@标致摩托] weibo.com/peugeotscooters
  161. Pfizer [@辉瑞中国] weibo.com/pfizerchina
  162. Piaget [@PIAGET] weibo.com/piaget
  163. Piaggio [@比亚乔机车] weibo.com/piaggio1884
  164. Pineider [@彼耐德Pineider] weibo.com/pineider
  165. Pizza Hut [@必胜客欢乐餐厅] weibo.com/pizzahut
  166. Pomellato [@Pomellato宝曼兰朵] weibo.com/pomellatoinchina
  167. Pony [@ponychina] weibo.com/ponychina
  168. Printemps [@春天百货Printemps] weibo.com/printempsparis
  169. Pull-in [@PULLIN内衣] weibo.com/pullinasia
  170. Razorfish [@RazorfishChina] weibo.com/razorfish
  171. Ritz-Carlton [@丽思卡尔顿酒店] weibo.com/ritzcarlton
  172. Rockport [@ROCKPORT美国乐步] weibo.com/rockportchina
  173. Roger Dubuis [@罗杰杜彼RogerDubuis] weibo.com/rogerdubuis
  174. Roger Vivier [@RogerVivier_罗杰维维亚] weibo.com/rogervivier
  175. Rovio Entertainment [@Rovio娱乐] weibo.com/rovioentertainment
  176. Rupert Sanderson [@RupertSanderson] weibo.com/rupertsanderson
  177. Schneider Electric [@施耐德电气中国] weibo.com/schneidercn
  178. SELECTED [@SELECTED中国官方微博] weibo.com/selectedchina
  179. Selfridges [@Selfridges] weibo.com/selfridgesuk .
  180. Sergio Rossi [@sergio_rossi] weibo.com/sergiorossi
  181. Shell [@壳牌中国集团] weibo.com/shellinchina
  182. Sheraton Hotels & Resorts [@喜来登酒店及度假村Sheraton] weibo.com/sheratonhotels
  183. Shopbop [@shopbop] weibo.com/shopbopchina
  184. Sigma-Aldrich [@SigmaAldrich] weibo.com/sigmaaldrich
  185. Skechers [@SKECHERS斯凯奇] weibo.com/skechers
  186. Skyscanner [@Skyscanner天巡] weibo.com/skyscannertx
  187. South Coast Plaza [@SouthCoastPlaza] weibo.com/southcoastplaza
  188. Standard Chartered Bank [@渣打银行中国] weibo.com/scbmainlandchina
  189. Stickhouse [@Stickhouse] weibo.com/stickhouse
  190. Stiebel Eltron [@斯宝亚创StiebelEltron] weibo.com/stiebeleltron
  191. Stroili Oro [@StroiliOro] weibo.com/stroilioro
  192. TAG Heuer [@豪雅TAGHeuer] weibo.com/tagheuerchina
  193. Ted Baker [@TedBakerLondon] weibo.com/tedbakerlondon
  194. Tesco [@乐购中国官方微博] weibo.com/TESCOofficial
  195. The Glenlivet [@格兰威特威士忌] weibo.com/theglenlivet
  196. Thermo Fisher Scientific [@赛默飞] weibo.com/thermofishercn
  197. Times Higher Education [@泰晤士报高等教育期刊] weibo.com/timeshighereducation
  198. TLD Registry [@域通联达] weibo.com/tldregistry
  199. Toblerone [@瑞士三角巧克力] weibo.com/toblerone
  200. Tom & Jerry [@华纳兄弟-猫和老鼠] weibo.com/tomandjerryoffical
  201. Topshop Shēnzhèn [@TOPSHOP深圳] weibo.com/topshopsz
  202. Tottenham Hotspur [@热刺TottenhamHotspur] weibo.com/tottenhamhotspur
  203. Truefitt & Hill [@TRUEFITT-HILL-CHINA] weibo.com/truefittandhill
  204. Unisys [@优利中国] weibo.com/unisyschina
  205. Valentino [@Valentino官方微博] weibo.com/valentinoofficial
  206. Van Cleef & Arples [@VanCleefArpels梵克雅宝] weibo.com/vancleefarpelschina
  207. Vichy Laboratoires [@薇姿医生] weibo.com/vichybrand
  208. Visa [@Visa中国] weibo.com/visachina
  209. VMware [@VMware中国] weibo.com/vmware
  210. Volvo [@沃尔沃集团中国] weibo.com/volvogroupchina
  211. Wall Street Journal [@华尔街日报中文网] weibo.com/chinesewsj
  212. Wallpaper* Magazine [@WallpaperMagazine] weibo.com/wallpapermag
  213. Walmart [@沃尔玛中国官方微博] weibo.com/wmcsr
  214. West Bromwich Albion [@西布朗足球俱乐部官微] weibo.com/westbrom
  215. Westin Hotels & Resorts [@Westin] weibo.com/westinhotels
  216. Wiggle [@Wiggle中国] weibo.com/wigglechina .
  217. William & Son [@WilliamandSon] weibo.com/williamandson
  218. Wolfram [WolframChina] weibo.com/wolframchina
  219. YOOX [@YOOX网络概念店] weibo.com/yooxcn
  220. Yves Rocher [@Yves-Rocher伊夫黎雪] weibo.com/yvesrocher1959
  221. Zatchels [@Zatchels] weibo.com/zatchelsuk
  222. Zenith [@ZENITH真力时] weibo.com/zenithchina

Monday, 20 February 2012

Language Characteristics

In this article I list some of the characteristics of natural languages and scripts as they are manifested and used in modern day IT. With languages there are always exceptions and so there will be some exceptions to these characteristics. I will not be delving into linguistic technicalities such as the distinction between mora and syllable or the distinction between logogram and ideogram. I will take a more broad brush approach.

Arabic

  1. Arabic is written in the Arabic script
  2. Written from right to left
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write Arabic
  7. The Arabic script is inherently cursive and hence is presented/displayed in it's cursive form.
  8. Letters change shape according to their position within a word. These different shapes are named Initial, Medial, Final and Isolated forms. en.wikipedia.org/wiki/Arabic_alphabet#Letter_forms

Chinese

  1. Chinese is written in the Chinese script which consists of hànzì (汉字) characters, of which, there are tens of thousands
  2. Written from left to right. Once browsers implement CSS3 Writing Modes we may well see some return to the traditional vertical text in webpages dev.w3.org/csswg/css-writing-modes/#vertical-intro
  3. There is no space character separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+3002 IDEOGRAPHIC FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. An Input Method is required in order to write Chinese
  7. All characters, including punctuation, are monospaced. Thus, for example, the list items separator in the text string "北京,南京,东京" is the single character U+FF0C FULLWIDTH COMMA. The text string "北京、南京、东京" uses the single character U+3001 IDEOGRAPHIC COMMA as the list items separator.
  8. With respect to number of characters required to communicate, Chinese is much more compact than English. Given a sentence written in English, the same sentence written in Chinese would require far fewer characters. This compactness gives Chinese a significant advantage over English for IDNs and when microblogging.

English

  1. English is written in the Latin script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Has uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write English

Japanese

  1. Japanese is written in the Japanese scripts Kanji (漢字), Hiragana (ひらがな) and Katakana (カタカナ)
  2. Written from left to right. Once browsers implement CSS3 Writing Modes we may well see some return to the traditional vertical text in webpages dev.w3.org/csswg/css-writing-modes/#vertical-intro
  3. There is no space character separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+3002 IDEOGRAPHIC FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms. Uppercase is sometimes used for emphasis in English. Similarly, Katakana is sometimes used for emphasis.
  6. An Input Method is required in order to write Japanese
  7. In general, Japanese, like Chinese is monospaced. The exception is that there are half-width forms of Katakana and some punctuation characters. The half-width forms are in Unicode block Half-width and Full-width Forms U+FF00 ➤ U+FFEF.
  8. With respect to number of characters required to communicate, Japanese is much more compact than English. Given a sentence written in English, the same sentence written in Japanese would require far fewer characters. This compactness gives Japanese a significant advantage over English for IDNs and when microblogging.

Korean

  1. Korean is written in the Hangeul (한글) script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. An Input Method is required in order to write Korean
  7. The individual Korean letters (jamo/자모) are grouped into and displayed as Syllabic blocks. e.g. the individual jamo ㅎ ㅏ ㄴ ㄱ ㅜ ㄱ are combined to form the two Korean characters 한국

Russian

  1. Russian is written in the Cyrillic (Кириллица) script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Has uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write Russian