Data Science for Everyone (Part 3)

Now that we have reviewed the basics of data driven decision making categories and have discussed a few differences about how data science will require data processing, we are ready to jump into smaller subset of data mining techniques that are foundational to the data science process.

Following are brief descriptions of data mining techniques:

  • Regression or Estimation: Generally you would use regression to predict value of a variable (such as readmission probability for a patient). This technique is quite useful when you are trying to predict one trustworthy value for a variable.
  • Similarity matching: Often used to match an individual or group with another individual or group given a finite set of dimensional and measurable attributes. A lot of times organizations can use this to identify customer groups or peer groups
  • Classification: This technique is useful when you are attempting to segment or categorize a population of candidates/things. Generally used by marketers to identify positioning and targeting of segments.
  • Clustering: There is a fundamental difference between similarity matching (which is for a specific purpose) and clustering (typically used for identifying “natural” groups)

There are additional techniques that are less commonly used such as co-occurrence grouping, profiling, link prediction etc. but more on that in the next post.

Stay tuned and find out how you can implement these techniques to quickly create data based decisions…

Data Science for Everyone (Part 2)

Moving on to the next topic – mostly related to data processing. It is important to understand that data processing and data science are two separate yet related entities. Data processing is almost critical to maturation of data science.

We previously identified two separate classes of data based decisions.

  1. “Discover” or understand data: This group requires somewhat traditional approaches to data processing. Generally speaking, data have to be sourced from a wide variety of applications and/or systems. These data tend to be in a wide array of formats (but tends to be mostly structured data). These formats make it difficult to process data. In the past, data warehouses were typically used for data discovery. Now with Big Data, a wider variety of toolsets are available for data processing.
  2. Decisions that repeat: This type of decision requires slightly different approach to data processing. Generally reporting/monitoring and alerting tools are required and should be used for repeating decisions based on well understood data. However, data warehouses/data lakes or other architectural approaches can be used as well. These type of decisions are also based on data in motion (as opposed to data at rest).

With this basic difference in data processing and data science in mind, it will be interesting to figure out data science approaches and what can be done to fulfill the promise of pure data based decision making.

I will summarize the data science segments (and a few solutions) in the next post. Stay tuned….

Data Science for Everyone (Part 1)

Data science seems like a brand new term but isn’t so. We have always had data science – typically defined as principles, processes and techniques to understand the world around us through analysis of data.

Sometimes, data analysis does not necessarily result into decision making. So what do we need to do to get become a data driven decision making organization? First step is to understand what is generally involved in data science and data driven decision making.

I would have to say that there are two types of data based decisions groups generally identified –

  1. “Discover” or understand data: This group is often ignored or is not identified as a key element by most organization. This probably comes from a place of hubris – “well, we know our data well!”. However, the new norm (and the fact that more data are available) is to continuously discover data.
  2. Decisions that repeat: This group is very popular candidate when it comes to data driven decisions. Customer churn is an age-old problem that has haunted even the best marketer.

During the past few years, we have seen tremendous improvements in technology and the natural rise of “Big Data”. So how can we make use of these advances, think analytically at a massive scale and process giant volumes of data on a daily basis?

I will summarize the data processing challenge (and a few solutions) in the next post. Stay tuned….

Dashboard Designs Principles using Jaspersoft

Jaspersoft is gaining ground rapidly and as users get accustomed to using Jaspersoft on a daily basis, the problem of designing optimal dashboards and/or visualizations becomes urgent.

Having designed dashboards and other BI artifacts for a number of years, I have come to adopt a few simple fundamental principles that have helped me a great deal.

The five core principles are described below:

  1. Data complexity: Generally it is important to identify the complexity of the data at the very beginning. The complexity of data is usually reliant on the source of record system as well as the use cases attached with the data. As an example, an accountant will be able to understand accounting data (and KPIs) a lot easily than an average joe. So if you are designing a dashboard for data sourced from accounting system, it is better to “simplify” the data for general consumption based on the user groups. This directly gets us to principle#2…
  2. User Expertise: You should make user expertise with the data the best evaluator for your dashboard design. I have often found, that depending on the end user expertise, sometimes even a simple combo chart with to Y axis is difficult to read for some users. The user expertise problem is sometimes multiplied by the volume of data and the refresh frequency, which gets us to principle#3…
  3. Data Refresh: Providing timestamp context to users as you design the dashboards is fairly important. Most organizations would like to see data refreshed in real time or near real time BUT a key consideration is to determine WHO is monitoring the data refreshes and towards what end?
  4. Screen Resolution: Screen size and resolution should play a critical role in your consideration as well. I have seen requirements from customers where dashboards needed to be part of shop floors, manufacturing plants, retail space etc. Clearly having 20 inch monitors would not work for these venues. Having access to more “real estate” makes our job of designing dashboards a little bit easier.
  5. Dashboard Delivery: Knowing the technology that you have to use for dashboard delivery is also important. Some technologies make it easier to distribute dashboards on mobile devices vs others that more geared towards desktop delivery.

Hope this provides you with a good starting point. Do not hesitate to reach out to us [email protected] if you have further questions.