You are not logged in.

Dear visitor, welcome to Palo Community Forum. If this is your first visit here, please read the Help. It explains in detail how this page works. To use all features of this page, you should consider registering. Please use the registration form, to register here or read more information about the registration process. If you are already registered, please login here.

MS_XXX

Trainee

  • "MS_XXX" started this thread

Posts: 1

Date of registration: Aug 5th 2010

  • Send private message

1

Thursday, August 5th 2010, 7:05pm

Palo suitable for very large data sets?

Hello,

I understand that Palo uses a purely "in-memory" technique for storing the aggregated data. I wonder how this works with very large data sets which are too large to be stored in memory?

For instance let's say you have a fact table with about 100 billion rows, and you are generating an aggregate table from that which has about 1 billion rows. This is the level of aggregation you need for real-time analysis, so you cannot aggregate further without reducing your requirements.

Let's say the 1 billion rows of aggregated data take about 1 TB to store - there is no way to have that much memory, so how are you going to do it? Does Palo simply not work for these requirements, or is there some way to get around it?

Is there any other (preferably open-source) way to have real-time analysis capabilities with <1 second response time for (aggregate) data sets of 1 TB and larger?

Thanks.

  • "realquo" is male

Posts: 255

Date of registration: Mar 11th 2009

Location: Italy

Occupation: BI Consultant

  • Send private message

2

Thursday, August 5th 2010, 7:35pm

Hi,
I don't have the answers to that question(s), I'm just reporting data from my real world experience to feed the discussion:

- source table (facts): 1.796.330 records
- dimensions: about 20 (with a customer dimension of over 17000 records, to say)

From that schema, a single (for the moment) Palo cube is physically loaded: the results is a memory need of about 385 MB (see attached) of which about 60 MB (my estimation considering the memory occupation of an almost empty instance of Palo Server) occupied by the engine service itself.

Other than this, I'm quite worried about the time required to load cubes: our loader based on palo java apis takes about 1h30m to load that cube (full load), does Palo ETL performances are significantly different?

This said, the question posed by MS_XXX is not trivial IMO and having some data from Jedox techical people, other than that other serious users, would be interesting.

Kind regards,
RQ

This post has been edited 1 times, last edit by "realquo" (Aug 5th 2010, 7:37pm)


Posts: 163

Date of registration: Feb 4th 2009

Location: berlin

  • Send private message

3

Friday, August 6th 2010, 9:30am

theory in general says:
m-olap for small datasets
r-olap for large datasets

palo: in-memory m-olap database

hope palo can answer your question.
sivgin

Withnail

Professional

Posts: 34

Date of registration: Mar 30th 2009

Location: Australia

Occupation: BI Consultant

  • Send private message

4

Sunday, August 8th 2010, 1:29pm

IMO, with datasets around 1TB , you have 2 options:

1. Seriously investigate the GPU technology from Jedox. All indications suggest this is the only practible way of dealing with Big Data in and OLAP environment. The best way to contact them would be an email to the Helpdesk.

2. Throw ROLAP, MOLAP and Relational databases out the window, and start looking at Map Reduce technologies like Apache Hadoop (Open Source) http://hadoop.apache.org/. This stuff is for serious number crunching (Terabytes to Petabytes and beyond)

HTH,

Withnail.
Naked Data
Business Intelligence & Performance Management
Level 23 40 City Road Southbank 3006 VIC Australia
national: +61 1300 406 334
www.nakeddata.com

This post has been edited 1 times, last edit by "Withnail" (Aug 8th 2010, 1:30pm)


v_malicevic

Palo Team

  • "v_malicevic" is male

Posts: 454

Date of registration: Oct 26th 2005

Location: Germany

  • Send private message

5

Sunday, August 8th 2010, 6:31pm

Hi MS XXX,
Option 1) is worth evaluating. Send us a note on:

http://www.jedox.com/en/about-jedox/Contact/Contact-us.html

and we will have a look at what you are trying to do.

Often

"you have a fact table with about 100 billion rows"

tends to be full of duplicates and empty values making it far less then 100 billion. When you add sparsity and you do a really good evaluation of "do I really need to analyze all of those records" on top of that, you get into area where maybe Palo can help. Let us have a look it.
Mit freundlichen Gruessen/ With kind Regards / Meilleures salutations

Vladislav Malicevic
Head of Research and Development

Jedox AG

Rate this thread