Skip Navigation Links

If you know what to do, Microsoft Azure Machine Learning makes it real easy how to create a model, train it and deploy it to production. Prior to official preview following video tutorials are posted on MSDN Channel 9 and following are compilation of all in one place.


File operations in HDFS using java

by Mahananda Badaik | Mar 21, 2014

I am using HDP for windows ( single node and Eclipse as development environment. Below are few samples to read and write to HDFS.

  • Create a new Java Project in Eclipse.
  • In Java Settings go to Libraries and add External JARs. Browse to Hadoop installation folder and add below JAR file.

  • Go into lib folder and add below JAR files.


Quick notes on YARN (Hadoop 2.0)

by Mahananda Badaik | Mar 11, 2014

Problems we had before YARN:

  • JobTracker is solely responsible for handling resources and tasks progress.
  • Scalability Limitation: Maximum cluster size is 4000
  • Maximum concurrent task is 40,000
  • On failure in one job execution: Kills the complete job queue. User needs to resubmit all the jobs.
  • Restarting is complex.

I am working on a BI System for a social care project in local government authority where I need to achieve cell based security in SSAS Tabular Model.


  • One semantic model needs to be published for all the reporting/analytics needs of the project.
  • No security is required for measures. So everyone (who ever has access to the cube) can see all the actual figures.
  • There are end users (reports consumers) and there is a separate reporting team who is responsible for building/publishing ad-hoc reports as the business needs. For some very sensitive records only few in the reporting team has view permission.

I recently defined deployment of SSAS tabular model for one of the projects I am working on. Here it goes.

Deployment Procedure

As development team won't have any access to other environments like Test or PROD, all the deployable will be handed over to DBA team who can then make the deployment with the procedure described below.


Installing Stinger Technical Preview in HDP 2.0 Sandbox

by Mahananda Badaik | Jan 17, 2014

Yesterday I tried to install Stinger on Hortonworks HDP 2.0 Sandbox. Below are the steps I followed. I used the Sandbox 2 for Hyper-V.

Installing Stinger phase 3 preview

Import the sandbox 2 VM and make sure that it can access to the internet.

Start the VM and log into it using Alt+F5 keys. Download Stinger Quickstart Bundle using wget. Remember the url is case sensitive.


A Revolution That Will Transform How We Live, Work and Think

Recently I read a book, Big Data: A Revolution That Will Transform How We Live, Work and Think and find it really informative. I would recommend it to any one curious about big data and its impact. The book does not assume your technology background.

Below are some of the insights that the book provides:

Big data is one of the consequences of a change that is taking place now; the authors describes it as datafication - a concept which refers to taking information about all things under the sun - including ones we never used to think of as information at all, such as a person's location, the vibration of an engine, or the stress on a bridge - and transforming it into data format to make it quantified. This allows us to use information in new ways, such as in predictive analysis: detecting that an engine is prone to a break-down based on the heat or vibration that it produces. As a result, we can unlock the implicit, latent value of the information.


Extracting Keywords Using Map/Reduce

by Mahananda Badaik | Jan 05, 2014

In my last blog post I had discussed about using Map/Reduce to find co-authors in PubMed data on a HDP for windows. In this blog I will explain how to extract keywords from PubMed Abstracts. I am going to use the API provided by BjutCS on codeproject. The API basically extracts keywords based on entropy difference. For more details you can check this article on codeproject.


After a little bit research on creating PowerView report in SharePoint, I am going to share my experience and list down prerequisites that you must configure with your system before going to create Power View reports.

I had following configuration at server side:


Finding Co-Authors using Map/Reduce

by Mahananda Badaik | Dec 16, 2013

I was trying to write a map/reduce job for Hadoop using Visual Studio 2012 in a HDP for Window environment. In search for a suitable practical scenario I got some PubMed data from, I decided to find the co-authors and the numbers of PubMed they published together for each individual author. Some thing like below:

PubMedArticle1               Authors {"A, B, X"}

PubMedArticle 2              Authors {"B, X, Y"}

PubMedArticle 3              Authors {"A, K"}

PubMedArticle 4              Authors {"M"}


Installing HDP for Windows

by Mahananda Badaik | Dec 02, 2013

Today, Hadoop has been synonymous with big data as it has been the platform of choice for big data processing. Apache™ Hadoop® is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.

Hadoop started as an open source initiative (and still it is!) and soon it was adopted and nurtured by Yahoo! to support its web applications. Then many came forward to embrace it. According to Wikipedia, as of 2013, Hadoop adoption is widespread. For example, more than half of the Fortune 50 uses Hadoop.


While Hadoop 1.0 (the current distributions) is driving the world with increasing speed, Hadoop 2.0 has already made debut with a bigger promise of overcoming some of the limitations of Hadoop 1.0 like scalability, cluster utilisation, agility and data processing without Map Reduce.

Hadoop 1.0 does what it promises brilliantly. Map Reduce is like the backbone of Hadoop 1.0. It is very good for batch processing but not much of help for real time and near-real time processing. Again to make a job work, it has to be or converted to be a Map Reduce job.


Last week I was importing a bunch of CSV files to database using a SSIS package. What I found was that the CSV files were error free and as per my expectations except that few address information were inside double quotes and with commas. I did not know this as the CSV file size was huge and there were only few such instances. I discovered this only after I got some unexpected values in some DB columns.


Yesterday, I was discussing with my younger brother who works as a petrophysicist. He suddenly paused me, "Hey, wait. I am hearing much about this big data. What makes this big data and how is it different from the data we deal?" I face this question quite often by clients as well. As I started my 5 minutes lecture to him on big data, I decided to compose a post with some nice collections to give a beginner a head start to big data. And here I am...

From a layman's point of view, big data is a collection of massive data sets, which are complex enough to store and/or analyze by the traditional (existing) computer systems in an acceptable limit - economically and within the time constraint.


Hadoop mostly deals with unstructured data. And all your structured data lives in relational databases. After you made necessary processing it on the Hadoop cluster you may need to bring your analysis to your data warehouse or to your RDBMS tables for further analysis so that unstructured data could compliment to structured database.

As I was playing around with HDInsight (Microsoft's implementation of Apache Hadoop) on Azure I thought it will be useful to compile a step by step guide to integrate data from Hadoop cluster (HDInsight) using SQL Server Integration service.


Often I have been asked by a client, or a friend working with a client, who is busy making a business case for knowledge management/ enterprise social software for some useful pointers. Usually, I draft a mail with some useful quick links to build a sound business case. So I thought it would be a good idea to compile a little blog post with those pointers...

It is easy to find a value for something which is tangible but not for abstract stuff like organisational knowledge. When you are building a case to justify a new KM system in your organisation it is not simple and convincing the senior management is difficult as they cannot see an immediate value out of it. Maybe that building solid business justifying significant ROI is one of the nicer ways to make the management signal for a green flag.


Finally we are launching our web site with Metro style design!

After lots of buzz Microsoft now seems to dump the name "Metro" for legal or whatever reasons and calling it 'Windows 8 style UI'/'Morden UI'. Well, that really does not matter, what is important is that it has brought a revolution! (For the convenience and simplicity let me stick to the name 'Metro' in this post).

So metro is the new design principle for UI.Through the bold use of colour, typography and motion, Metro design style brings a fresh new approach to user experience. The principle adheres that content is more important not the chrome.


The "data" behind "Big Data"!

by Bikramaditya Singhal | December 12, 2011

Once there were no computers. In offices or organizations work was done with pen and paper. The pace of data collection and data processing were almost directly proportional and information was never an overload.

Slowly, Computers were introduced which could process more and more data. This resulted in more data formation and collection but in reality, data never grew so large. If we took an example of a person’s data, we considered his name, parent’s name, his date of birth, height etc. If we were to find a person’s buying behaviour, we only considered his purchases over last week or month.


20 years ago the web (many perceive as Internet!) was born and soon became a part of our life. It changed the way we did business to a large extent and also impacted quite a lot in the organisational workplace.

Many organisations caught the wave of the Web and responded quickly. It started in consumer space as people spent more and more time in web. Organisations tried to take advantage of the situation and the concept of e-business emerged. As a result of this, many IT firms providing e-business solutions made huge business. As it continued, the enterprise IT started becoming web-centric, hence the change in organisational workplace.


This week, Microsoft has finally launched Office 365. A suit of cloud based offerings of all that any business would need - email, office tools like Word, Excel and power point (online version) and communication and collaboration tools like Lync and SharePoint online. The real beauty is that all these products work with seamless integration which makes life easier and much more productive.

When I look at the competitive products like that of Google Apps for business which comprises similar offerings in cloud - gmail, gTalk and app comprising of Sites, calendar, documents etc. , Office 365 is way far ahead of it.


What comes to our mind when someone says Cloud Computing? We clearly understand the meaning of cloud, but when we combine “computing” along with it, it becomes little fuzzy. Before we go into defining it, let us understand few scenarios.

You are assigned a computer, and hence the computing power. When you are not using it, the computing power is not utilized. In Cloud Computing context, all processing work is done centrally at the Cloud Service provider’s end, so resources are not idle when a user is not using it; it is being used by someone else.

^ Back to Top