Blog moved to codeahoy.com

I have moved this blog to www.codeahoy.com. Please join me over there.

Advertisements

See these tax mistakes businesses make. Or don’t, and pay Uncle Sam.

Very good tax advice for startups.

Paperistic Blog

Money Mistake

Tax deadline is just around the corner. Sometimes even a straight-forward individual return can be quite a challenge. Things get more interesting when you add the business complexities associated with a small business or a startup. For many new entrepreneurs, filing taxes can be a real learning curve.

Most of us are aware of IRS audits and the associated pains. But what most entrepreneurs don’t realize is that they often overpay Uncle Sam by committing small mistakes and overlooking various tax deductions.

Keep on reading to discover common tax mistakes startups make and as a result leave hundreds, even thousands of dollars off the table — and how to avoid making them yourself.

Don’t Overpay Your Taxes – Mistakes 1to 5

In the SlideShare presentation below, we’ve listed 5 common tax mistakes startups (and small-businesses) make. Enjoy the slides and save yourself money, time and headaches.

So the main takeaway from the presentation is, and we cannot…

View original post 641 more words

Paperless Techniques to Stop Wasting Away Your Time

Paperistic Blog

Forest

Couple of days ago, I received a link from a colleague to an article. The author talked about how his company declared war on paper by taking extreme measures such as hiding printers in hard to find places and getting rid of toilet paper from bathrooms (they installed Aqua Clean WC which cleans with water).

Gotta love the modern workplace.

While I cannot think of a scenario where the use of paper should not be avoided, the reality is that paper is an important part of our modern society. It touches our lives in so many different forms – receipts, contracts, checks, warranties, packaging, lecture notes, passports to name a few.

Verizon Letters Dumb use of paper: Verizon sent 53 letters in mail thanking a customer for subscribing to eco-friend paperless billing.

Whether you need to store receipts for expense or tax purposes, or you love the tangible feel of taking lecture notes on paper, here…

View original post 507 more words

4 Great Mobile Scanning Apps Android Users Need to See

Must have apps for scanning and digitizing those receipts you’d need later. Scan it and toss it in the bin.

Paperistic Blog

Scanning a document used to be tricky in the past – you had to access an actual scanner, load the paper and wait 10 seconds or more for the scanner to finish scanning.

Scanner

These days, converting a document to PDF is almost hassle-free and doesn’t require a clumsy scanner. The apps we’re going to review in this post make scanning a breeze. The principal concept for all the apps is the same: you snap a photo of the document from the app, it detects document edges and enhances the image to look and feel like paper and exports to PDF for sharing.

Let’s start.

1) CamScanner

It’s no surprise that CamScanner is at the top of this list — they have the largest number of users (about 50 million). Although there are lots of great features, the two best that I absolutely love are: 1) Excellent algorithm for detecting document edges inside…

View original post 247 more words

Top 10 Amazing Books I Read in 2013 & Recommend

Here’s my top ten list of the books I managed to read in 2013.

1. Don’t Make Me Think by Steve Krug – One of the best UI/UX design books I have ever read. Short, to the point, easy to read with lots of examples. While an excellent book, it first came out in 2001 and then in 2005, so some web design suggestions may be dated. It’s not about how to add pizzazz to your site by applying lipstick but how to make websites that are usable and doesn’t force users to think. 

2. Endurance – Shackleton’s Incredible Voyage by Alfred Lansing – A compelling account of Shackleton’s incredible, but doomed voyage from England to the southern Antarctica. Heartwarming story of Shackleton’s courageous leadership skills that allowed his lost crew to survive bitter cold, darkness, constant danger for months. Highly recommended for anyone aspiring to be a leader.

3. The Snowball: Warren Buffett and the Business of Life by Alice Schroeder – I picked this up at local Chapters without knowing what to expect. Turned out to be a very good find. From an early account of Buffet’s life in Omaha, his childhood, influences, and decisions that made him the best investor in the world that he is. My only complain is that the book is huge!

4. Peopleware by Tom DeMarco – Amazing, fantastic, mind blowing guide on managing Software teams. Non-conventional, no non-sense approach to management. Couldn’t recommend it more for anyone who is a Manager or aspiring to be a Manager. In fact, it should be made compulsory for all managers to read this book every quarter.

5. Salem’s Lot by Stephen King – Not a big fan of horror fiction, but this was a great story. Spoiler Alert: Dracula in a small, sleepy New England town.

6. The Five Dysfunctions of a Team: A Leadership Fable by Patrick M. Lencioni – An excellent leadership book written as easy to read story about a dysfunctional team in an imaginary Tech company in the Valley. Short and sweet: could be read in two or three sittings.

7. Hadoop: The Definitive Guide, 3rd Edition by Tom White – An excellent resource for anyone wanting to learn Apache Hadoop. Hard to read cover to cover, the first few chapters are an excellent introduction to Hadoop. Staying on my reference shelf.

8. How to Win Friends & Influence People by Dale Carnegie – I’ve been meaning to read this book for years now. Reading about it in “The SnowBall” and that Warren Buffet went as far as enrolling in Carnegie’s seminars convinced me to finally buy it. I’ve not much to say other than it is a good book.

9. Programming in Scala by Martin Odersky – I’m into functional programming and like Scala. This is the best book on Scala written by the same homely genius who created Scala. He also runs a free online course on Scala on Coursera which I also highly recommend.

10. Enterprise Integration Patterns by Gregor Hohpe – Patterns on integrating enterprise applications using messaging and asynchronous communications. Kind of old, but very good book for understanding and building solid concepts.

Trip to Santiago

Last month I visited Santiago, Chile. It was a short business trip that lasted 5 days. at the Holiday Inn located in the financial district of Santiago. The neighbourhood was very well developed relative to other parts of the city and a lot of people were very formally dressed . The trip was excellent overall except an unfortunate event on the last day – but we’ll get to that story later.

In this blog, I would share a few pictures and stories from the trip.

856856_563170020425614_766122012_o

Outside the hotel

Image

Overall, people were nice and helpful. Don’t expect everyone to speak or even understand english thought the language barrier becomes annoying. In meetings, I noticed that several locals were able to understand English but were either shy or lacked confidence to hold conversations. Luckily, we had people with us who spoke both languages.

The food was ok. On my first day, I said yes to a raw meat dish since I’m a fan of tartare (and its Vietnamese version). To my surprise, I receive an *enormous* plate of minced raw meat with absolutely nothing on it, some (very hot) chillies, lemon juice and mayo on the side. It wasn’t that bad – but I wouldn’t try it again. A local burger joint called Mr. Jack has fantastic burgers. Not to mention Peruvian Ceviche which I enjoyed as well. This discussion is incomplete without mentioning Pisco Sour. I normally watch my sugar intake, but those drinks were just amazing.

Gran Torre Santiago is the tallest building in South America, which is an amazing feat considering earthquakes are common occurrences in Chile.

The weather was perfect. Not too hot, nor cold. Santiago has lots of hills and presents a nice view of snow covered Andes Mountains which looked similar to the Rockies. They have long underground tunnels for traffic.

Image

Hills outside Telco office

The day I and my colleague were supposed to leave Santiago, our luggage got stolen from the vehicle. It probably happened in the parking lot of the mall we visited to buy souvenirs. We lost our luggage and everything in them. I had taken passport out of my luggage for whatever reason before I left hotel so I was lucky to at least not have that stolen. In the end, I was happy to board the flight back home and learned an expensive lesson: don’t leave your valuables in your vehicle – especially if you are going to park it in shady looking parking lot in Santiago 🙂

Unicode isn’t harmful for health – Unicode Myths debunked and encodings demystified

if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

This infamous threat was first published a decade ago by Joel Spolsky. Unfortunately, a lot of people thought he was merely kidding and as a result, many of us still don’t fully understand Unicode and for that matter the difference between Unicode, UTF-8 and UTF-16. And that is the main motivation behind this article.

Without further ado, let us jump straight into action. Say, one fine afternoon, you receive an email from a long lost friend from High School with an attachment in .txt, or as it is often referred to as: the “plain text” format. The attachment consists of the following sequence of bits:

0100100001000101010011000100110001001111

The email itself is empty, adding to the mystery. Before you kickstart your favorite text editor and open the attachment, have you ever wondered how does the text editor interpret the bits pattern to display characters? Specifically, how does your computer know the following two things:

  1. How the bytes are grouped (E.g. 1 or 2-byte characters?)
  2. How to map byte or bytes to characters?

The answer to these questions lie in the document’s Character Encoding. Loosely speaking, encoding define two things:

  1. How the bytes are grouped, for example 8-bits or 16 bits. Also known as Code Unit.
  2. Mapping of Code Units to Characters (E.g. In ASCII, decimal 65 maps to the letter A).

Character Encodings are tiny bit different from Character Sets but that really isn’t relevant to you unless you are designing a low level library.

One of the most popular encoding schemes, at least in the Western World, of the last century was known as ASCII. The table below shows how code units map of characters in ASCII.

US ASCII Chart

There is a common misconception even amongst seasoned developers that “plain text” uses ASCII and that each character is 8-bits.

 Truth be told, there is no such thing as “plain text”.  If you have a string in memory or disk and you do not know its encoding, you cannot interpret it or display it. There is absolutely no other way around it.

How can your computer interpret the attachment you just received when it doesn’t specify encoding? Does this mean you can never read what your long lost friend really wanted to tell you? Before we get to the answer, we must travel back in time to the dark ages… where 29 MB hard disk was the best money (and a lot of it) could buy!

Historical Perspective

Long, long time ago, computer manufacturers had their own way of representing characters. They didn’t bother to talk to one another and came up with whatever algorithm they liked to render “glyphs” on screens. As computers became more and more popular and the competition intensified, people got sick and tired of this “custom” mess as data transfer between different computer systems became a pain in the butt.

Eventually, computer manufacturers got their heads together and came up with a standard way of describing characters. “Lo and behold”, they declared “the low 7-bits in a byte represent character“. And they created a table like the one shown in the first figure to map each of the 7-bit value to a character. For example, the letter A was 65, c was 99, ~ was 126 and so on. And ASCII was born. The original ASCII standard defined characters from 0 to 127, which is all you can fit in 7 bits. Life was good and everyone was happy. That is, for a while…

Why they picked 7 bits and not 8? I don’t exactly care. But a byte can fit in 8 bits. This means 1 whole bit was left completely unused and the range from 128 to 255 was left unregulated by the ASCII guys, who by the way, were Americans, who knew nothing, or even worst, didn’t care about the rest of the world.

People in other countries jumped at this opportunity and they started using the 128-255 range to represent characters in their languages. For example, 144 was گ in Arabic flavour of ASCII, but in Russian, it was ђ. Even in the United States of America, there were many different interpretations of the unused range. IBM PC came out with the “OEM font” or the “Extended ASCII” which provided fancy graphical characters for drawing boxes and supported some of the European characters like the pound (£) symbol.

.

 A “cool” looking DOS splash screen made using IBM’s Extended ASCII Charset.

To recap: the problem with ASCII was that while everyone agreed what to do with codes up to 127, the range 128-255 had many, many different interpretations. You had to tell your computer the flavor of ASCII to display characters in the 128-255 range correctly.

This wasn’t a problem for North Americans and people of British Isles since no matter which ASCII flavor was being used, the latin alphabets stayed in the same – The British had to live with the fact that the original ASCII didn’t include their currency symbol. “Blasphemy! those arseholes.” But that’s water under the bridge.

Meanwhile, in Asia, there was even more madness going on. Asian languages have a lot of characters and shapes that need to be stored. 1 byte isn’t enough. So they started using 2 bytes for their documents.. This was known as DBCS (Double Byte Coding Scheme). In DBCS, String manipulation using pointers was a pain: how could you do str++ or str–?

All this craziness caused nightmares for system developers. For example, MS DOS had to support every single flavour of ASCII since they wanted their software to sell in other countries. They came out with a concept called “Code Pages”. For example, you had to tell DOS that you wish to use the Bulgarian Code Page to display Bulgarian letters, using “chcp” command in DOS. Code Page change was applied system wide. This posed a problem for people working in multiple languages (e.g. English and Turkish) as they had to constantly change back and forth between code pages.

While Code Pages was a good idea, it wasn’t a clean approach. It was rather a hack or “quick” fix to make things work.

Enter Unicode

Eventually, Americans realized that they need to come up with a standard scheme to represent all characters in all languages of the world to alleviate some of the pain software developers were feeling and to prevent Third World War over Character Encodings. And out of this need, Unicode was born.

The idea behind Unicode is very simple, yet widely misunderstood. Unicode is like a phone book: A mapping between characters and numbers. Joel called them magic number since they may be assigned at random and without explanation. The official term is code points and they always begin with U+. Every single alphabet of every single language (theoretically) is assigned a “magic number” by the Unicode Consortium. For example, The Aleph letter in Hebrew, א, is U+2135, while the letter A is U+0061.

Unicode doesn’t say how characters are represented in bytes. Not at all. It just assigns magic numbers to characters. Nothing else.

Other common myths include: Unicode can only support characters up to 65,536.  Or that all Unicode characters must fit 2 bytes. Whoever told you get must immediately get a brain transplant!

Remember, Unicode is just a standard way to map characters to magic numbers. There is no limit on the number of characters Unicode can support. No, Unicode characters don’t have to fit in 2, 3, 4 or any number of bytes.

How Unicode characters are “encoded” as bytes in memory a is separate topic. One that is very well defined by “Unicode Transformation Formats” or UTF’s.

Unicode Encodings

Two of the most popular Unicode encodings remain the UTF-8 and UTF-16. Let’s look at them in detail.

UTF-8

UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.

In UTF-8, every character from 0-127 is represented by 1 byte, using the same encoding as US-ASCII. This means that that a document written in 1980’s could be opened in UTF-8 without any problem. Only characters from 128 and above are represented using 2, 3,or 4 bytes. For this reason, UTF-8 is called variable width encoding.

Going back to our example at the beginning of this post, the attachment from your long lost high school friend had the following byte stream:

0100100001000101010011000100110001001111

The byte stream in both ASCII and UTF-8 displays the same characters: HELLO.

UTF-16

Another popular variable width encoding for Unicode characters: It uses either 2 bytes or 4 bytes to store characters. However, people are now slowly realizing that UTF-16 may be wasteful and not such a good idea. But that’s another topic.

Little Endian or Big Endian

Endian is pronounced “End-ian” or “Indian”. The term traces its origin to Gulliver’s Travels.

Little or Big Endian is just a convention for storing and reading groups of bytes (called words) from memory. This means when you give your computer the letter A to store in memory in UTF-16 as two bytes, your computer decides using Endianness scheme it is using whether to place the first byte ahead of second byte or the other way around. Ah, this getting confusing. Let’s look at an example: Let’s say you want to save the attachment from your long lost friend you downloaded using UTF-16, you could end up with the following bytes in UTF-16 depending on the computer system you are on:

00 48  00 65  00 6C  00 6C  00 6F (big end, the high order byte is stored first, hence Big Endian)

OR,

48 00  65 00  6C 00  6C 00  6F 00 (little end, the low order byte is stored first, hence Little Endian)

Endianness is just a matter of preference by microprocessor architecture designers. For example, Intel uses Little Endian, while Motorola uses Big Endian.

Byte Order Mark

If you regularly transfer documents between Little and Big Endian systems and wish to specify endianness, there is a weird convention known as the Byte Order Mark or BOM for that. A BOM is a cleverly designed character which is placed at the beginning of the document to inform reader about the endianness of the encoded text. In UTF-16, this is acheived by placing FE FF, as the first byte. Depending on the Endianness of the system the document is accessed on, this will appear as either FF FE or FE FF, giving parser immediate hint of the endianness.

BOM, while useful, isn’t neat since people have been using a similar concept called “Magic Byte” to indicate the File Type for ages. The relation between BOM and Magic Byte isn’t well defined and may confuse some parsers.

Alright, that is all folks. Congratulations on making it this far: You must be an endurance reader.

Remember the bit about there being no such thing as “plain text” introduced at the beginning of this post that left you  wondering how does your text editor or Internet Browser displays correct text every time? The answer is that the software deceives you and that is why a lot of people don’t know about encoding: when the software cannot detect the encoding, it guesses. Most of the time, it guesses the encoding to be UTF-8 which covers its proper subset ASCII, or for that matter ISO-8859-1, as well as partial coverage for almost every character set ever conceived. Since the latin alphabet used in English is guaranteed to be the same in almost all encodings, including UTF-8, you still see your english language characters displayed correctly even if the encoding guess is wrong.

But, every now and then, you may see � symbol while surfing the web… a clear sign that encoding is not what your browser thought it was. Time to click on View->Encoding menu option of your web browser and start experimenting with encodings.

Summary

If you didn’t have time to read the entire document or you skimmed through it, it is Okay. But make sure that you understand the following points at all cost otherwise, you will miss on some of the finest pleasures this life has to offer.

  • There is no such thing as plain text. You must know the encoding of every String you want to read.
  • Unicode is simply a standard way of mapping characters to numbers. The Brave Unicode people deal with all the politics behind including new characters and assigning numbers.
  • Unicode does NOT say how characters are represented as bytes. This is dictated by Encodings and specified by Unicode Transformation Formats (UTF’s).

And, most importantly,

  • Always, I mean always, indicate the encoding of your document either by using Content-Type or meta charset tag. By doing this, your are preventing web browsers from guessing the encoding and telling exactly which encoding they should use to render the page.

The inspiration, ideas for this article came from the best article on Unicode by Joel.