On the Digitalisation of Indian Languages..

 

This is a writeup on the digitalisation of Indian languages - a development which allows millions to use intelligent devices in their native language!

(This one is close to my heart, because it involves pioneering work done at my alma mater @IITKanpur, and my first employer CMC Ltd).



Warning: This is a long Post. 

It starts with a bit of a deep dive into the common features of Indian scripts, identifies the issues for digitalisation and then describes pioneering solutions and pioneering workers.

I have stressed more on people and work which I was privileged to see first hand. But there were many other significant contributors. Collective work has brought us where we are today.

Let's roll!

When you type Indian languages into your phone today, you probably do so on an onscreen "soft" keyboard. It could be a Roman keyboard (the system transliterates to the Indian script as you type), or an Indian language keyboard.

Let us take a Devanagari keyboard example.




Say you want to type "सात्विक" (Satvik). You probably enter it as:

  • Sa (the consonant स modified by the "matra" or vowel sound ा),
  • Tvi (half letter त्, followed by व, and the resulting composite character त्व modified by the "matra" ि),
  • K (क)


Notice something? You entered the word syllable-by-syllable. Or - to use the name for it in Indian languages - akshara (अक्षर)-by-akshara.

(अक्षर : From Sanskrit +क्षर - that which cannot be destroyed; imperishable. In this context - the unit of sound).

Note:

"Akshara" does not mean letter. That's varna (वर्ण), and the alphabet set is "varnamala" (वर्णमाला). "Akshara" has to do with sound. It means a syllable. Or, in phonetics - a "phoneme".

The "aksharas" in Satvik (सात्विक) are sa (सा), tvi (त्वि), k ().

In fact when native speakers of Indian languages first learn to write, they are taught to do so phonetically. But on manual typewriters till a few years ago, typing was not done phonetically. It was done by visual order - in the order in which characters appear in printed form.

On such a typewriter typing त्वि meant that matra ि would be typed before त्व.

And typing अर्ध would mean typing , , backspace half a character to the centre of , and add the half above it. Typing would not be in the phonetic sequence where half र् comes before .

All nine Indic scripts (Devanagari, Bengali, Oriya, Gujarati, Gurumukhi, Tamil, Telugu, Kannada) which originated from the Brahmi script over 2,500 years ago, are phonetic.

The way a word is written in these scripts specifies how it is to be spoken.

(Image: Wikipedia)




All nine scripts represent an identical set of basic sounds, though with different sets of symbols (characters).

So if typing/storage were phonetic, then it should be possible to convert the original text into any script simply by changing symbols to that of the new script.

Sounds interesting, doesn't it?

In fact not only are all nine Indic scripts phonetic, they also follow several other common rules.

This commonality can be exploited to come up with a uniform approach common to all Indic scripts.


The man who laid the foundation for digitalising Indian languages, R Mahesh Kumar Sinha, referred to the “common script grammar" of all Indic scripts, and used it as the basis for formulating fundamental concepts on which today's edifice is built.




An alumnus of Allahabad Univ (M.Sc Tech), IIT Kharagpur (M Tech) and IIT Kanpur (PhD), Sinha's tenure as Faculty at IIT Kanpur (starting 1975) saw many developments in the area. Several of his associates would later go on to other organisations and build on this work.

Sinha became interested in Optical Character Recognition (OCR) for Devanagari as a PhD student in 1969, following a talk with his professor H N Mahabala in which Mahbala was describing an OCR based system for the blind he had seen in the US.

Multiple conversations between Sinha and Putcha Narasimham, a Telugu student pursuing his Masters at IIT Kanpur, led to a realisation of the features that unite all Indic scripts. Narasimham was specially interested in the problem of keyboarding of Indian scripts.

One thing led to another.

Driven by the belief that digital inclusion was necessary for India, and that for this digitalisation of Indian scripts was a must, an exercise was set in motion, which, largely, is responsible for where we are today.

Let us run through some more features of Indian scripts. This will help us understand the challenges faced and solved by these pioneers.

In this description we will continue to use Devanagari as representative of all nine Indic scripts.

All Indic scripts have, broadly, the same vowels (“swara”) and consonants (“vyanjan”).

“Swaras” are the sounds, in pronouncing which the breath passes out without being obstructed by any part of the mouth.


In Devanagari, the “Swaras” (vowels) are:

अ आ इ ई उ ऊ ऋ ए ऐ ओ औ


To make a “Vyanjan” (consonants) sound on the other hand, some mouth part or the other has to be placed in the way of the breath.

  • क ख ग घ ङ
  • च छ ज झ ञ
  • ट ठ ड ढ ण
  • त थ द ध न
  • प फ ब भ म

  • य र ल व
  • श ष स ह

The grouping above into "vargas" is done based on how the sound corresponding to the first four letters of each group is produced.

The first four letters of "कवर्ग" are articulated using the back of the tongue and roof of the mouth ("velum"). This group is the "velar" group.

  • "चवर्ग" is the "palatal" group-the tongue touches the palate in pronouncing the first four sounds.
  • "टवर्ग" is "retroflex" - the tongue first curls back, and is then flexed to utter the sound.
  • "तवर्ग" is "dental" (tongue touches teeth)
  • "पवर्ग" is "labial" (the lips touch).

Also notice that the 2nd and 4th letters in each group are "aspirated" versions (pronounced with an explicit forcing out of breath) of the 1st and 3rd letters, respectively.

e.g., is with the breath forced out. Same with and ; and , etc.).


What about the 5th letter in each group?

This is the "anunasik". It is used when two consonants are combined with a nasalisation sound inserted between them. The rule is that the "anunasik" used is that of the group (वर्ग) to which the second consonant belongs.

E.g., when of तवर्ग is to be combined with of टवर्ग with nasalisation, "", the "anunasik" symbol of टवर्ग is used, forming "दण्ड".

(Today however, "anunasik" is mostly replaced with "anuswar" - a dot above the first consonant, as in पंख, तंग, चिंता).

The "chandrabindu" is used to denote nasalisation without explicit "n" sound - धुआँ, गाँव, साँप.

“Visarg” (the 2 dots in अतः) indicates the aspirated “h”.

A "nukta" (dot below the letter) is used for some sounds found in Urdu. e.g., the nukta in बाज़ार.


A standalone consonant is always pronounced with an inherent “" sound at the end. The corresponding "pure consonant sound" (without the trailing “"), is denoted by using a “halant” – e.g., क्. Or, alternatively, a special symbol is used, such as the "half " symbol in क्ल.

When a pure consonant sound is combined with any vowel other than "", the vowel is indicated not by its full symbol but by a “matra”. No “matra” is needed for "" because every standalone consonant is anyway inherently assumed to end with "".

However the position of the "matra" relative to the consonant - whether it appears above, below, before or after the consonant - varies with the "matra" and the language.


I previously mentioned the concept of "akshara". This is very important. Let's revisit it.

Formally, an "akshara" is described as a group of one or more pure consonants ending with a vowel sound. It is important to understand that the "akshara" ends with that vowel.

Here are some more examples:

  • In समय (samay), each of the consonants , , and , is an "akshara" (since each ends in the inherent vowel sound .). Each of them also happens to be a letter ("varna") as well. 
  • However “aksharas” need not always be “varnas”. In प्रार्थना, the “aksharas” are प्रा, र्थ and ना. प्रा combines 2 consonants - and - with the vowel sound . र्थ has 2 consonants - and - and the inherent vowel sound . ना has a single consonant - - and the vowel sound

"Aksharas" like प्रा and र्थ - which have a combination of multiple consonants ending in a vowel - are called "samyuktaksharas" or "conjoints". 

The visual form ("glyph") of प्र is different from the glyphs for प् and र्. This illustrates the general point that the visual form of "samyukyakshara" is different from that of its individual letters. Also, while entering an “akshara”, its form may change as its constituents are added (till the vowel ends it)!

So a "samyuktakshara" cannot be rendered (displayed) unless all its constituents are known .

Let's see more examples.

  • In वरदान, the "aksharas" , are in their standalone form.
  • Now see र्द (also from the , ) in सर्द, वर्दी, पर्दा.
  • Finally, see र्द्ध in अर्द्ध.

It is clear that the visual form of each "samyuktakshara" is known only after the final vowel is entered.

The way in which the form of the "samyuktakshara" changes as its constituent letters are added, is language dependent.

For example, in Devanagari, the “samyuktakshara” क्त in युक्त is formed by the glyph for “half " symbol combined with “".

In Telugu the rule is slightly is different. The first consonant “k” (కె) appears in toto, while the second, soft “t” (త్) is the "half letter" appended below it "క్త్”. So యుక్త్.

Which letter becomes half, and where the other letter is placed - depend on the script.

Some “samyuktaksharas” have been given their own unique symbols. e.g., क्ष(क्+), त्र(त्+), ज्ञ(ज्+), श्र(श्+). This is because these sounds are used often.


But enough background. On to the problem formulation now, and its solution.

The goals of digitalisation were defined early on: the system should have a common approach adaptable to all scripts, and should be suitable for all applications (with only minor tweaks if necessary).

This leads to the following basic questions:

  1. How should a keyboard for entering Indian scripts work? (This problem specially interested Putcha Narasimham).
  2. How should Indian scripts be "encoded" digitally for processing via intelligent devices?


1. Keyboarding:

Typists on manual typewriters were familiar with the visual order scheme. The design team considered it, but it was clear that visual order would not allow a common scheme across scripts.

A phonetic scheme would fit better with the "akshara" concept that all Indic scripts follow.

  • But what keys should be there on the keyboard? 
  • Should there be keys for both "full" and "half" consonants? 
  • Should there be separate keys for "matras"? 
  • Could one switch from Indian scripts to Roman? How? 
  • What about numerals? 
  • What should be the keyboard layout? Should it be possible to use the standard QWERTY keyboard?

And many more such questions.

RMK Sinha suggested that "matras" could be entered as the consonant + vowel (+for दा). But since full vowels occur much less frequently than "matras" it was finally decided that keys should be there for "matras" (+ ) and not for full vowels.

Similarly it was decided not to have separate keys for "half" consonants, but to enter them using a "halant" key. For numerals it was decided to retain Roman symbols.

With this it became possible to fit all symbols on the standard QWERTY keyboard, using an overlay.

The SHIFT key was used to access lesser used symbols such as full vowels and the aspirated consonants. Keys for switching between scripts (including Roman) were provided. this allowed mixed text.

This keyboard was named the "INSCRIPT" keyboard.

(https://bit.ly/3QSOmFG)



2. Indian language character encoding

The next issue was how Indian language characters would be encoded. Roman script has only 26 letters (52, counting upper and lower case), digits 0-9 and punctuation marks which need to be assigned codes. Add a few codes for system tasks - about 80-90 symbols are needed in all. Indian scripts have up to 55-60 standalone vowels and consonants. Then there are "matras", "pure consonants" represented with halants or special symbols, hundreds of "samyuktaksharas", digits, punctuation marks. Should each have a unique code?

The computing world had already developed codes to represent the Roman alphabet, numbers and punctuation marks digitally. The widely accepted code for this purpose was ASCII – American Standard Code for Information Interchange.

ASCII uses 7 bits to represent characters, giving 2^7 = 128 possible unique binary codes. The first standard code for Indian scripts also used 7 bits. Announced in 1982, ISSCII-7 assigned half its code space to control characters, numerals and punctuation marks.

(https://bit.ly/3QSOmFG)


The remaining 64 codes were assigned to full consonants and "matras". Half consonants were obtained by adding halant to the corresponding full consonant. Full vowels were obtained by adding the symbol "" to the corresponding "matra".

In 1983 8-bit ISSCII was announced. Now 2^8 or 256 codes were available. The first 128 were retained for ASCII codes for Roman characters, numerals and punctuation marks. Codes for Indian scripts occupied other 128 spaces.

(https://bit.ly/3QSOmFG)

It now became possible to accommodate all the full consonants, full vowels and the "matras". A "link" symbol was defined to denote the "halant".

A key was defined to specify script change - so transliteration between scripts was easy.

Several products started to take shape based on these standards.

Putcha Narasimham moved to CMC Ltd Secunderabad, where he initiated development of an Indian language word processor named "Lipi", aimed at providing processing and printing in Indian languages.

The "Lipi" team overcame several challenges. No readymade hardware existed. This was much before the IBM PC was launched. A group led by G L Narasimham literally designed a computer with an operating system and built word processor software over it.

Basic fonts for Indian scripts had to be created, which have beautiful curvatures and strokes. These needed to be captured and rendered at various font sizes. Systems to enable artists to create fonts in Indian languages did not exist back then. Sagar Anisingaraju led a project to develop software for converting font artworks into cubic spline equations so that shapes could be retained at various font sizes. Splines were converted to Bezier curves and PostScript to overcome processing power constraints.

"Kerning" (spacing between characters) was a big challenge. To preserve the scripts' beauty "matras" and half forms must meet exactly at the top and bottom of each "grapheme". Kerning was terrible in Indian script typewriters. Sagar and Sanjeev Chadda nailed it.



Each grapheme would have a "Tip & Toe" or "Cap & Shoe" ("TiToCaSh" - coined by G L Narasimham). Software would ensure alignment. Hats off to them and the entire Lipi team which also included G Murali Dhar, CS Moghe, PJ Narayanan, Kannan and others!


The printing technology developed for "Lipi" was later reused for applications such as railway reservation (the software for which was also developed by @CMCLtd) for printing charts in Indian languages.

Meanwhile Mohan Tambe, also from @IITKanpur, moved to @cdacindia, and led the design of System-on-Chip solutions for Indian languages, leading to GIST (Graphical Indian Script Technology) based solutions for Indian language printing and other applications.

CDAC India also developed codes for Persian derived scripts like Urdu, and Brahmi derived scripts like Thai, Sinhalese, Bhutanese and Tibetan. At BITS, Pilani, Pravin Dhyani and Aditya Mathur developed a multilingual computer which DCM Data Products later manufactured. Kalyana Krishna at IIT Madras developed software for character generation. Later the Institute also worked on computing with Indian scripts, developing a font editor, and text-to-speech and Braille systems.

Prof RMK Sinha's band from IIT Kanpur thus brought many practical projects to fruition, helping move India closer towards the original goal of digital inclusion for all.

In 1988 the Unicode consortium adopted ISSCII-8 as the base for Indian scripts in its emerging 16-bit Universal Code. It was no longer necessary to conserve codes. With 16 bits at its disposal, Unicode had enough space (2^16) to code every conceivable letter form.

This began the move away from ease of transliteration to ease of composition, which is the hallmark of today's soft keyboards in smartphones and other devices. Half letters, conjoints, even emojis, could now have their own codes.

I hope you enjoyed this journey into the history of digitalisation of Indian scripts.

Thanks for dropping by.

Cheerio!  


Ref: A Journey from Indian Scripts Processing to Indian Language Processing by R. Mahesh K. Sinha https://bit.ly/3QSOmFG








Comments

Popular posts from this blog

THE STORY OF SECUNDERABAD

Emperor Ashoka the Great - How he was lost and rediscovered

What caused the third largest earthquake ever measured, and the resulting 2004 Indian Ocean Tsunami?