ENTRIES TAGGED "real-time data"

Expanding options for mining streaming data

New tools make it easier for companies to process and mine streaming data sources

Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces, and libraries, recently released tools make it easier for companies to process and mine streaming data sources.

Of the distributed stream processing systems that are part of the Hadoop ecosystem0, Storm is by far the most widely used (more on Storm below). I’ve written about Samza, a new framework from the team that developed Kafka (an extremely popular messaging system). Many companies who use Spark express interest in using Spark Streaming (many have already done so). Spark Streaming is distributed, fault-tolerant, stateful, and boosts programmer productivity (the same code used for batch processing can, with minor tweaks, be used for realtime computations). But it targets applications that are in the “second-scale latencies”. Both Spark Streaming and Samza have their share of adherents and I expect that they’ll both start gaining deployments in 2014.

Read more…

Comment: 1 |

Strata Week: Big data gets warehouse services

AWS Redshift and BitYota launch, big data's problems could shift to real time, and NYPD may be crossing a line with cellphone records.

Here are a few stories from the data space that caught my attention this week.

Amazon, BitYota launch data warehousing services

Amazon announced the beta launch of its Amazon Web Services data warehouse service Amazon Redshift this week. Paul Sawers at The Next Web reports that Amazon hopes to democratize data warehousing services, offering affordable options to make such services viable for small businesses while enticing large companies with cheaper alternatives. Depending on the service plan, customers can launch Redshift clusters scaling to more than a petabyte for less than $1,000 per terabyte per year.

So far, the service has drawn in some big players — Sawers notes that the initial private beta has more than 20 customers, including NASA/JPL, Netflix, and Flipboard.

Brian Proffitt at ReadWrite took an in-depth look at the service, noting its potential speed capabilities and the importance of its architecture. Proffitt writes that Redshift’s massively parallel processing (MPP) architecture “means that unlike Hadoop, where data just sits cheaply waiting to be batch processed, data stored in Redshift can be worked on fast — fast enough for even transactional work.”

Proffitt also notes that Redshift isn’t without its red flags, pointing out that a public cloud service not only raises issues of data security, but of the cost of data access — the bandwidth costs of transferring data back and forth. He also raises concerns that this service may play into Amazon’s typical business model of luring customers into its ecosystem bits at a time. Proffitt writes:

“If you have been keeping your data and applications local, shifting to Redshift could also mean shifting your applications to some other part of the AWS ecosystem as well, just to keep the latency times and bandwidth costs reasonable. In some ways, Redshift may be the AWS equivalent of putting the milk in the back of the grocery store.”

In related news, startup BitYota also launched a data warehousing service this week. Larry Dignan reports at ZDNet that BitYota is built on a cloud infrastructure and uses SQL technology, and that service plans will start at $1,500 per month for 500GB of data. As to competition with AWS Redshift, BitYota co-founder and CEO Dev Patel told Dignan that it’s a non-issue: “[Redshift is] not a competitor to us. Amazon is taking the traditional data warehouse and making it available. We focus on a SaaS approach where the hardware layer is abstracted away,” he said.

Read more…

Comments: 2 |