Feed aggregator

The Tools of Java Development (Part 3): CI, Testing, and the Web

Javalobby Syndicated Feed - Fri, 28-Jul-17 13:01

Welcome back! If you're just joining us in this series on the tools available for Java development, feel free to check out Part 1, which covers general tools and IDEs, and Part 2, which talks about code coverage, APM, and logging. If you're all caught up, take a look at the best tools out there that cover continuous integration, testing, web frameworks, app servers, and app management tools for your Java projects.

Continuous Integration Tools

32. Bamboo

@atlassian

Categories: Java

Escape Analysis

Javalobby Syndicated Feed - Fri, 28-Jul-17 09:01

Escape Analysis (EA) is a very important technique that the just-in-time Java compiler can use to analyze the scope of a new object and decide whether it might not be allocated to Java heap space.

Many resources available on the Internet say that EA allows objects to be allocated on the method stack. While, technically, they could be allocated on the stack, they are not in the current version of Java 8.

Categories: Java

JAX-RS 2.1 SSE Client API Using GlassFish 5 on Docker

Javalobby Syndicated Feed - Fri, 28-Jul-17 03:01

Along with the Server API for SSE, JAX-RS 2.1 (part of Java EE 8) also has an equivalent client-side API.

Here is a quick peek – you can grab the project from GitHub.

Categories: Java

Akka Monitoring and Telemetry

Javalobby Syndicated Feed - Fri, 28-Jul-17 00:01

The behavior of actors running in an actor system can be very dynamic. In some cases, the behavior is more organic than mechanical. Some actor systems are relatively simple and lightweight. On the other end of the spectrum, other actor systems are composed of hundreds to thousands to millions of individual running actors.

The lifestyle of different actors varies as well. Some actors live for a long time while others come and go in milliseconds. While some actors are relatively inactive other actors may be extremely chatty, sending and receiving messages as quickly as possible. Things get even more fun and interesting when actor systems are spread across a cluster of networked nodes.

Categories: Java

Database design using Anchor Modeling

codecentric Blog - Thu, 27-Jul-17 23:22

Anchor modeling offers agile database design, immutable data storage, and enables temporal queries using regular relational database. This catchy excerpt certainly spiked my interest two years ago at Data Modeling Zone conference in Hamburg.

I enjoy making discipline crossovers and in this article I would like to discuss the concept of Anchor modeling related to software development. Anchor modeling has become a trendy subject in the field of business intelligence (BI). However, not everybody will be familiar with what’s buzzing in BI land so I will provide a background into Anchor modeling and then discuss its potential merits for developers.

A lot of companies now favour an iterative approach over a big up-front design approach. Although it might not feel like that on every day. In general I would say we are certainly improving our capability to change the things we built in the past and adapt our plan for what we will built in the future. However the big improvements have generally been at the source code level of our software and in recent years at the deployment of our software. While refactorability of our code might have improved a great deal, I feel industry practices regarding incremental database design changes are not up to par in most development teams. This is exactly something that Anchor Modeling promises to address.

A word of warning. To keep this article readable, I might have stereotyped software development in this article.

Where did Anchor Modeling originate? 

Anchor modeling originated in the field of data warehouses ten years ago and was formally presented in academia in 2009. The basic premise of Anchor Modeling is that it offers non-destructive model iterations while avoiding null values and eliminating redundancies. Anchor modeling is an extension of normalisation practices where the result is mostly in the 6th normal form. But don’t be alarmed or put off. A major part of this design approach is that Anchor modeling gives you the means to make this design manageable.

In recent years, these practices have found their way from academia to especially the business intelligence discipline. I am particularly interested how I, as a developer, can learn from these advances and where I can apply them in practice.

Anchor modeling basics

Anchor modeling has four basic building blocks: anchors, attributes, knots and ties. All four of these building blocks are implemented using their own table. The basic concept of any anchor model are the anchors. The concepts of Anchor models are explained below. 

  • An anchor holds the identity of an entity using a generated surrogate key rather than its natural key. An example could be “Tenant” and “House”.
  • A tie defines the relation between usually two – but potentially more – entities (thus anchors). An example would be a relation from Tenant to House that defines the ownership
  • An attribute – representing some piece of information – holds the attribute value and is related to an anchor. Attributes are strongly typed.  An example could be “Rent”.
  • A knot defines the possible limited states of a tie or attribute. So effectively it provides context by defining for example a limited set of genders. It provides context on for example  a tie like active (yes/no).

Let’s visualise this using the free online modeling tool on www.anchormodeling.com.

Anchor Model

Anchor model

House and Tenant are anchors here connected using a tie. AmountOfRooms, Rent and PhoneNumber are attributes connected using knots. Exported to PostgresSQL this simple model will provide us six tables, twelve views to process this information, and 21 stored procedures. This might look complex, but the tables are our main focus, and you could easily follow Anchor Modeling principles without the online tool and sql generation. When you start out, it’s just an easy quick start. You will get highly optimised views to query information out of the box.

Why Anchor modeling in online transaction processing (OLTP) software?

As a developer you first question at this point could be: why blog about a relational database technique in 2017? Well, it may not be the most popular topic to write about in the development scene, but it is still the most popular storage technology. On the topic of agility, many NoSQL solutions offer relaxed schemas as a way of dealing with changes in requirements, which is nice of course. But NoSQL technology has its own downside – which is really product specific (and beyond the scope of this article). I also personally think that the richness of an SQL interface to your data storage is as yet unmatched in NoSQL technology and is still highly appreciated. Moreover, initiatives like the development of – the highly potent – Cockroach DB seem to support that idea and could potentially unleash a new wave of RDBMS adoption. So it seems that keeping some of your eggs in the relational basket and some of them in the latest-practices-basket will go a long way.

As a developer we often only deal with databases as ‘the system of record’ in our systems that support a business function. The storage behind our applications that we store into and retrieve from is often aimed at producing information about business operations. Information is mostly stored in a state and structure that is very closely related to the software that uses it. So having a highly normalised database might be a scary thought for a developer. In university you are probably taught the basic principles of database design and implementation. If you are like me, you probably haven’t encountered the same amount of academic purity in the databases you use in your working life. A common – potentially outdated – understanding is that highly normalised databases have many practical, but also performance related downsides on the operational front. Another problem is that as a programmer, we have to deal with the structure of information in our storage, and this tends to reflect on the representation of our models in our code. One could say that in many cases the design of the database leaks through. Of course this also happens the other way around. A customer object that consists of ten fields could end up in a database in a single table with ten columns. We are used to create an abstraction layer in our software to protect the business side of our application from changes in the database and vice versa. But I would argue protection is not good enough.

This is the case I am specifically interested in. I would like to discuss how Anchor modeling can aid an agile team in properly designing, building, and maintaining a relational storage layer. Secondly my goal is to show that we can do proper design iterations in an agile team. If storage is easier to modify people are less inclined to take a one-time right approach.

Agility and temporality 

One of the defining features of anchor modeling is the capacity for non-destructive schema evolution. In other words, the characteristics of the storage can differ over time without invasive redesigns and large migrations. This is certainly desirable if you are designing and maintaining a huge data lake and have to satisfy business wishes on a weekly basis, like in the position of business intelligence analyst. So let’s see how these rules apply to business software development.

Anchor modeling has been designed with temporality in mind. Attributes and ties can both be of a temporal nature. This is optional. Ties and attributes can be meta-dated with timestamps to signal the lifespan of relations. I will show an example later on.

Let’s enumerate the more invasive database changes that we think about on a daily level, where we are used to deal with databases that are designed foremost to remove redundancy and are often in 3rd normal form.

Conventional database Anchor modeling
A new column This changes the structure of a table. An addition of a column is still moderately easy. We alter the existing table. Sometimes you can keep your application alive, but it is most likely unable to query the table for some time. Rollbacks are possible but require some effort. Using anchor modeling you add an attribute table and, if required, a knot table to your database. Existing tables are left untouched, so most RDBMS will stay up and running.
Removal of a column This changes the structure of a table. A column removal is difficult. The table is locked during the transaction. The application needs migration to be able to handle the new table design. Rollbacks are not possible or require a lot of preparation. Using anchor modeling removing the attribute of an entity is substantially easier. By default everything is designed to be immutable. So the removal of a relation between an entity and an attribute will actually not cause an update in the database schema. Inserts to the knot table is enough to couple or decouple a relation. There is also the option to use the temporal aspect of an anchor model design.
Removal of a table This changes the structure of a schema. A table removal is difficult. The table is locked during the transaction. Related tables often as well. It is destructive in nature. The application needs migration. Rollbacks are not possible or require a lot of preparation. Using anchor modeling the removal of an entity is substantially more easy. By default everything is designed to be immutable. So the removal of a table will actually not cause an update in the database schema. The relations can simply be invalidated.

In the example above the rent and telephone number are marked as temporal (indicated by a dot in the design), as well as the relation between a tenant and a house. This means that when things change, the current reality can be replaced by a new one. By leveraging the power of SQL select statements, combined with parameterized views one can even travel in time to see what the state of our database looked liked on Christmas 2015! And thanks to the dispersed character of the data, this is agnostic of both the structure or the content of your tables.

Drawbacks

So, enough with all those nice features. Let’s get some drawbacks on the table. Well, for starters there’s of course the mental effort it takes to fill your head with the number of tables used by a 6th normal form database. I think one of the biggest contributions to the enormous success of RDBMS is the degree to which the solution fits in with our mental (3rd normal form) model of the world.

It is perfectly possible (I have seen this first-hand) for people to adopt these design principles and write down properly structured tables. However, most people will need a tool for this. Luckily most RDB design tools will work and the more advanced ones like Vertica offer features to support this work for uses-cases where the landscape will be too complex to grasp.

Your performance behaviour will vary. I am not a professional DBA nor do I profile databases on a daily basis. So I will refrain from making bold performance claims. One thing to note is that Anchor modeling is built with SQL execution optimization in mind. It will use many advanced features that modern systems offer. One of the most important is table elimination. This limits the number of products you can use (taken from www.anchormodeling.com).

Database Engine Support
Microsoft SQL Server 2005 full
Microsoft SQL Server 2008 full
Oracle 10gR2 Express Edition* partial
Oracle 11gR1 Enterprise/Express Edition full
IBM DB2 v9.5 full
PostgreSQL v8.4 beta full
Teradata v12** partial
MySQL v5.0.70 none
MySQL v6.0.10 alpha none
MariaDB v5.1 full
Sybase not tested

Conclusions

So far we have gone over the concepts of Anchor modeling and seen some nice features that Anchor modeled databases can offer. By leveraging the temporal nature of a database design a development team is potentially better able to adapt to change. The temporal nature of a relation in this design makes schema evolution non-disruptive. You will no longer be restricted by the decisions of your past self  (or a colleague). Furthermore the concept of Anchor modeling delivers a set of rules that implicitly work towards a proper design.

A second important lesson I took from looking into this is that I, as a 30-year old developer, have become biased by the maturity of RDBMSs. I normally work with technology that helps me abstract the knowledge I need when working with databases – like Hibernate – and in general lost interest in what problems databases can solve for me. Added to this is of course a movement towards the hip and buzzing NoSQL solutions over the past years. In the development scene I just don’t see many developers blogging (for example) about what’s new and hip in this latest PostgreSQL release. Or how a new release opens up nice technological opportunities. So this new kind of modeling definitely renewed my interest.

The third big takeaway is that the world beyond 3rd normal form has come within reach and should be part of our toolset when we face problems in practice.  Anchor modeling is not something that you have to adopt wholesale. It is a design approach that helps you to deal with change. You could for example only apply it to the parts of the system that have a high rate of change, like your product entity.

In the end it’s a matter of making a proper design decision, and also Anchor modeling is not a silver bullet

Categories: Agile, Java, TDD & BDD

Typeclass in Scala Libraries and the Compiler (Part 2)

Javalobby Syndicated Feed - Thu, 27-Jul-17 21:01

Type class, strictly speaking, is a library feature instead of a compiler feature; they are used not only in Scala application design, but also in library design. The compiler wires pieces together through type level programming and implicits, eliminates object inheritance, and achieves polymorphism. It is an integrated part of the Scala language. In Part 1 of this series, we have discussed TypeTags and CanBuildFrom type classes. We continue this discussion in Part 2.

Generalized Type Constraint Type Classes

The Generalized type constraints <:<[A, B] and =:=[A,B] are infix types. The purpose of this type class is to inform the compiler to do type checking.

Categories: Java

XML Sitemap Generation in Java

Javalobby Syndicated Feed - Thu, 27-Jul-17 13:01

XML sitemaps are a great way to expose your site's content to search engines, especially when you do not have an internal or external linking structure built out yet. An XML sitemap, in its simplest form, is a directory of every unique URL your website contains. This gives Google and other search engines a one-stop-shop for all pages they should index. XML sitemaps are restricted to 10MB or 50k links per sitemap, but this limitation can be circumvented with sitemap indexes that link to multiple sitemaps. Sitemaps can also include additional metadata. such as how frequently pages get updated or when was the last time a page was updated. After you design a site with HTML/CSS templates, make sure you include sitemaps to index the pages quicker.

XML Sitemap With Java

The SitemapGen4J library gives a nice object model for generating all URLs required to build out a sitemap. Most likely, you will need to write code that can generate all possible URLs for your website. Another alternative is to build a generic crawler that can build a sitemap for any website. It's not too difficult to build all of the custom URLs so we can create a method for each page type. We section them all out because we plan on making a sitemap index later.

Categories: Java

Debugging: Filtering Arrays and Collections in IntelliJ IDEA

Javalobby Syndicated Feed - Thu, 27-Jul-17 09:01

As usual, the newest version of IntelliJ IDEA contains updates to help you debug applications. Given that we are working more and more with large data sets, IntelliJ IDEA 2017.2 has added the ability to filter arrays and collections in our variables or watches.

In this example, I have a variable, allWords, that’s a list of Strings. This is a large list, so when I’m debugging my application, I may want to look for specific values, or types of values, in this list. I can do this by right-clicking on the list and selecting “filter”.

Categories: Java

Lobotomize Your OO Thinking: “Elegant Objects, Vol. 1” Book Review

Javalobby Syndicated Feed - Thu, 27-Jul-17 03:01
Step one in the transformation of a successful procedural developer into a successful object developer is a lobotomy. (by David West)

This is the first sentence in the “Elegant Objects, volume 1” book by Yegor Bugayenko, and after reading it from cover to cover, I could not agree more. This book will not leave you neutral, you will either strongly agree or disagree with claims stated there, but it is definitely worth your time. It will challenge what you know about programming, it will challenge what you think a proper object-oriented design is, and it will challenge many old, well-established so-called “good practices” you have seen during your career. Fasten your seatbelts, move your coffee mug away from your keyboard, and keep reading.

Content

“Elegant Objects, vol. 1”, in over 200 pages, gives you 23 practical tips to write more object-oriented, thus more maintainable code. The author uses a very interesting allegory by treating every object as a human being and splitting these suggestions into four anthropomorphized chapters: birth, school, employment, and retirement.

Categories: Java

Akka Monitoring: Telemetry OpenTracing

Javalobby Syndicated Feed - Thu, 27-Jul-17 00:01

In April 2017, Lightbend Telemetry version 2.4 was released. One of the most significant changes in this release is the addition of OpenTracing integration with support for Jaeger and Zipkin. OpenTracing is a “vendor-neutral open standard for distributed tracing.” Being that Akka is, by its nature, a platform for building distributed systems, tracing is a much-needed and often-requested feature that is a valuable tool for Akka developers.

This monitoring tip focuses on the required changes that are necessary to install and configure tracing in an existing Akka project. In future tips, we will dive deeper into how to use tracing to track activity in Akka systems.

Categories: Java

Did Someone Say Java 9?

Javalobby Syndicated Feed - Wed, 26-Jul-17 21:01

In a previous blog, we discussed the features that were added in Java 8.

More than three years after the release of Java 8, the next version is now just around the corner, with a tentative release date of Sept. 21, 2017.

Categories: Java

Upgrading to Vaadin Framework 8 (Part 1 of 2)

Javalobby Syndicated Feed - Wed, 26-Jul-17 13:01

With each major release of a framework, Vaadin in this case, you usually expect major core modifications. But this time, migration is not too complicated. Not only because of the migration tool provided to make a smooth transition from Framework 7 to Framework 8, but also because of the similarity in many of the components’ APIs.

A good upgrade strategy is needed though, and I summarize them under the following headlines:

Categories: Java

This Week in Spring: Java 9, REST, and Microservices

Javalobby Syndicated Feed - Wed, 26-Jul-17 09:01

Hi, Spring fans! This week I’m in Istanbul, Turkey, talking to customers and speaking at the Spring and Java meetups tonight. I hope you’ll join me and we’ll talk about cloud native Java! As usual, we’ve got a lot to cover this week so let’s get to it!

  • Spring Integration contributor Artem Bilan just announced Spring Integration 4.3.11. This release includes security updates as well as bug fixes.
  • Spring Batch lead Michael Minella just announced Spring Batch 3.0.8, which is mainly a maintenance release and a bugfix release.
  • Spring ninja Stéphane Nicoll has announced Spring Framework 4.3.0 which is a maintenance release for the upcoming Spring Boot 1.5.5 maintenance release. It includes 25 fixes and improvements.
  • Spring Cloud co-founder Spencer Gibb just announced Spring Cloud Dalston SR2. The release is primarily a bugfix. Also of note, this release marks the end of life for Spring Cloud Angel and Spring Cloud Brixton.
  • Spring Framework lead Juergen Hoeller just announced Spring Framework 5.0 RC3. The new release, part of an extended release candidate phase to allow Reactor 3.1, JUnit 5.0, Jackson 2.9 and so much more, includes API refinements in Spring WebFlux, Kotlin support, refined nullability declarations, and updated JDK9 support.
  • Spring Security contributor Joe Grandja just announced Spring Security 5.0.0.M3. The new release includes support for JSON Web Tokens (JWT), JSON Web Signatures (JWS) and integrated ID Token support for OpenID Connect authentication flows.
  • Spring integration and messaging ninja Artem Bilan just announced Spring AMQP 2.0.0 milestone 5. This release includes security fixes and a number of nice new features.
  • Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics. This project will serve as the metrics collection subsystem for Spring Boot 2.0 and will be backported to Spring Boot 1.0. Of note, though, is that Micrometer does not require Spring Boot to work. Check it out. There’s so much cool stuff here, one hardly knows where to start.
  • Gyula Lakatos put together an inspired, and detailed, blog detailing his first, and fresh, look at Spring Cloud Function on AWS Lambda. Nice job Gyula!
  • This is an oldie-but-a-goodie: Antonio Simoes talks about moving his companies’ architecture from a monolith to a non-blocking Spring Cloud-based architecture.
  • Microservices were driven by vendors. This isn’t strictly related to Spring, but I liked Stephen O’Grady’s discussion about microservices — why people adopt them and what differentiates them from SOA.
  • James Governer follows up on Stephen O’Grady’s post on microservices.
  • And in the absolutely-great-news-that-has-nothing-to-do-with-Spring-per-se column: 10 years ago, 2,600 female students took AP Computer Science exam. In 2017, 29,000 female students took the exam. The growth among female students has been incredible, increasing participation in AP CS exams by 135% since 2016. Not to be outdone, underrepresented minorities have increased participation by nearly 170% over last year! Things are trending in the right direction. I sincerely hope our ecosystem sees increased diversity, more inclusion and participation and that it continues to grow.
  • Spring Boot user Nicky Mølholm chimes in to share that the new Lego Life app is, behind the scenes, powered by Spring Boot. As a fan of both Legos and Spring Boot, I thought this was super cool! Congrats to the Lego Life team!
  • Move fast and don’t break things, by Rod Johnson — Spring creator and Atomist CEO Rod Johnson details how his company’s main offering can be a boon to teams using Spring Boot. This is a really cool offering that simplifies the end-to-end story for creating new services (which could and should be in Spring Boot), supporting collaboration, detecting breaking-and-outage-inducing changes, and managing the path to production (in, for example, Cloud Foundry). I’ve signed up!
  • Oracle Java Magazine, this month, has a nice look at some of the new features in Java 9 which is, of course, just around the corner.
  • This is an oldie-but-a-goodie from Spring ninja Greg Turnquist that looks at some issues people think they have with REST and how to get around them. It’s a good read.

Categories: Java

JIT Inlining [Snippets]

Javalobby Syndicated Feed - Wed, 26-Jul-17 03:01

Among all Just-In-Time Java compiler optimizations, inlining methods are a powerful approach. Usually, when we write code following good object-oriented practices, we end up having lots of small objects with well-encapsulated attributes – most of them accessible via getters. But there is an overhead of making additional calls and increasing the callstack. Fortunately, with JIT inlining, we can follow good practices and benefit from performant code.

A method is eligible for inlining if:

Categories: Java

Spring Cloud Config (Part 1)

Javalobby Syndicated Feed - Wed, 26-Jul-17 00:01

I’m a big fan of the Spring Cloud set of tools. I’ve been using them for a while both in a professional capacity (though quite limited) and a personal one. If you have not used it before, you should try it out. You’ll see how convenient it is to set up a microservices environment where your applications can follow the Twelve Factor App Manifesto.

Introduction to Spring Cloud Config

Store config in the environment.

That is the Third Factor of the Manifesto. In a Continuous Delivery world, it becomes more important to manage our apps' configurations in a way where we can de-couple changing configuration from deploying our apps, since you want to be able to react to certain events as quickly as possible. For example, changing the timeout for an HTTP call should not mean we need to deploy an application. It should be something you can do pretty quickly if you see that your environment is having some temporary issues.

Categories: Java

Lookup additional data in Spark Streaming

codecentric Blog - Tue, 25-Jul-17 22:00

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event.

In this blog post I would like to discuss various ways to solve this problem in Spark Streaming. The examples assume that the additional data is initially outside the streaming application and can be read over the network – for example in a database. All samples and techniques refer to Spark Streaming and not to Spark Structured Streaming. The main techniques are

  • broadcast: static data
  • mapPartitions: for volatile data
  • mapPartitions + connection broadcast: effective connection handling
  • mapWithState: speed up by a local state

Broadcast

Spark has an integrated broadcasting mechanism that can be used to transfer data to all worker nodes when the application is started. This has the advantage, in particular with large amounts of data, that the transfer takes place only once per worker node and not with each task.

However, because the data can not be updated later, this is only an option if the metadata is static. This means that no additional data, for example information about new sensors, may be added, and no data may be changed. In addition, the transferred objects must be serializable.

In this example, each sensor type, stored as a numerical ID (1,2, …), is to be replaced by a plain-text name in the stream processing (tire temperature, tire pressure, ..). It is assumed that the assignment type ID -> name is fixed.

val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure")
stream.map (typId => (typId,namesForId(typId)))
A lookup without broadcast. The map is serialized for each task and transferred to the worker nodes, even if tasks were previously executed on the worker.
val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure")
val namesForIdBroadcast = sc.broadcast(namesForId)
stream.map (typId => (typId,namesForIdBroadcast.value(typId)))
The map is distributed to the workers via a broadcast and no longer has to be transferred for each task.

MapPartitions

The first way to read non-static data is in a map() operation. However, not map() should be used but mapPartitions(). mapPartitions() is not called for every single element, but for each partition, which then contains several elements. This allows to connect to the database only once per partition and then to reuse the connection for all elements.

There are two different ways to query the data: Use a bulk API to process all elements of the partition together, or an asynchronous variant: an asynchronous, non-blocking query is issued for each entry and the results are then collected.

wikiChanges.mapPartitions(elements => {
  Session session = // create database connection and session
  PreparedStatement preparedStatement = // prepare statement, if supported by database
  elements.map(element => {
    // extract key from element and bind to prepared statement
    BoundStatement boundStatement = preparedStatement.bind(???)
    session.asyncQuery(boundStatement) // returns a Future
  })
  .map(...) //retrieve value from future
})
An example for an lookup on data stored in Cassandra using mapPartitions and asynchronous queries

The above example shows a lookup using mapPartitions: expensive operations like opening the connection are only done once per partition. An asynchronous, non-blocking query is issued for each element, and then the values are determined from the futures. Some libraries for reading from databases mainly use this pattern, such as the joinWithCassandraTable from the Spark Cassandra Connector.

Why is the connection not created at the beginning of the job and then used for each partition? For this purpose, the connection would have to be serialized and then transferred to the workers for each task. The amount of data would not be too large, but most connection objects are not serializable.

Broadcast Connection + MapPartitions

However, it is a good idea not to rebuild the connection for each partition, but only once per worker node. To achieve this, the connection is not broadcasted because it is not serializable (see above), but instead a factory that builds the connection on the first call and then returns this connection on all other calls. This function is then called in mapPartitions() to get the connection to the database.

In Scala it is not necessary to use a function for this. Here a lazy val can be used. The lazy val is defined within a wrapper class. This class can be serialized and broadcasted. On the first call, an instance of the non-serializable connection class is created on the worker node and then returned for every subsequent call.

class DatabaseConnection extends Serializable {
  lazy val connection: AConnection = {
    // all the stuff to create the connection
    new AConnection(???)
  }
}
val connectionBroadcast = sc.broadcast(new DatabaseConnection)
incomingStream.mapPartitions(elements => {
  val connection = connectionBroadcast.value.connection
  // see above
})
A connection creation object is broadcasted and then used to retrieve the actual connection on the worker node.

MapWithState()

All solution approaches shown so far retrieve the data from a database, if necessary. This usually means a network call for each entry or at least for each partition. It would be more efficient to have the data directly in-memory available.

Stateful stream procssing within an operation

With mapWithState() Spark itself offers a way to change data by means of a state and, in turn, also to adjust the state. The state is managed by a key. This key is used to distribute the data in the cluster, so that all data must not be kept on each worker node. An incoming stream must therefore also be constructed as a key-value pair.

This keyed state can also be used for a lookup. By means of initialState(), an RDD can be passed as an initial state. However, any updates can only be performed based on a key. This also applies to deleting entries. It is not possible to completely delete or reload the state.

To update the state, additional notification events must be present in the stream. These can, for example, come from a separate Kafka topic and must be merged with the actual data stream (union()). The amount of data sent, can range from a simple notification with an ID, which is then used to read the new data, to the complete new data set.

Messages are published to the Kafka topic, for example, if metadata is updated or newly created. In addition, timed events can be published to the Kafka topic or can be generated by a custom receiver in Spark itself.

Data Lookup in Spark Streaming using mapWithState

A simple implementation can look like this. First, the Kafka topics are read and the keys are additionally supplemented with a marker for the data type (data or notification). Then, both streams are merged into a common stream and processed in mapWithState(). The state was previously specified by passing the function of the state to the StateSpec.

val kafkaParams = Map("metadata.broker.list" -> brokers)
val notifications = notificationsFromKafka
  .map(entry => ((entry._1, "notification"), entry._2))
val data = dataFromKafka
  .map(entry => ((entry._1, "data"), entry._2))
val lookupState = StateSpec.function(lookupWithState _)
notifications
  .union(data)
  .mapWithState(lookupState)

The lookupWithState function describes the processing in the state. The following parameters are passed:

  • batchTime: the start time of the current microbatch
  • key: the key, in this case the original key from the stream, together with the type marker (data or notification)
  • valueOpt: the value to the key in the stream
  • state: the value stored in the state for the key

A tuple consisting of the original key and the original value as well as a number will be returned. The number is taken from the state or – if not already present in the state – is chosen randomly.

def lookupWithState(batchTime: Time, key: (String, String), valueOpt: Option[String], state: State[Long]): Option[((String, String), Long)] = {
  key match {
    case (originalKey, "notification") =>
      // retrieve new value from notification or external system
      val newValue = Random.nextLong()
      state.update(newValue)
      None // no downstream processing for notifications
    case (originalKey, "data") =>
      valueOpt.map(value => {
        val stateVal = state.getOption() match {
          // check if there is a state for the key
          case Some(stateValue) => stateValue
          case None =>
            val newValue = Random.nextLong()
            state.update(newValue)
            newValue
        }
      ((originalKey, value), stateVal)
      })
  }
}

In addition, the timeout mechanism of the mapWithState() can also be used to remove events after a certain time without updating from the state.

Conclusion

Loading additional information is a common problem in streaming applications. With Spark Streaming, there are a number of ways to accomplish this.

The easiest way is to broadcast static data at the start of the application. For volatile data, read per partition is easy to implement and provides a solid performance. With the use of the Spark states, the speed can be increased further, but it is more complex to develop.

Optimally, the data is always directly present on the worker node, on which the data is processed. This is the case, for example, with the use of Spark states. Kafka streams pursue this approach even more consistently. Here, a table is treated as a stream and – provided the streams are identical partitioned – distributed in the same way as the original stream. This makes local lookups possible.

Apache Flink is also working on efficient lookups, here under the title Side Inputs.

The post Lookup additional data in Spark Streaming appeared first on codecentric AG Blog.

Categories: Agile, Java, TDD & BDD

Recognizing Patterns to Understand and Transform Apps

Javalobby Syndicated Feed - Tue, 25-Jul-17 21:01

Code is arguably the most valuable asset for many organizations. However, the value trapped in the code is not easy to use.

Why?

Categories: Java

FP in Scala for OOP Programmers (Part 1)

Javalobby Syndicated Feed - Tue, 25-Jul-17 14:16

Have you ever been to a Scala conference and told yourself, "I have no idea what this guy is talking about?" Did you nervously look around and see everyone smiling, saying, "Yeah, that's obvious?" If so this post is for you. Otherwise, just skip it. You already know FP in Scala.

This post is optimistic, and although I'm not going to say functional programming in Scala is easy, our target is to understand it, so bear with me. Let's face the truth: functional programming in Scala is difficult if you are just another working class programmer coming mainly from a Java background. If you came from a Haskell background, then, hell, it's easy. If you come from a heavy math background, then hell yes it's easy! But if you are a standard working-class Java backend engineer with an OOP design background, then hell yeah it's difficult.

Categories: Java

5 Elements of a Perfect Pull Request

Javalobby Syndicated Feed - Tue, 25-Jul-17 13:01

Raise your hand if you remember the days of in-person code reviews. You may recall entire afternoons spent checking out changes from SVN, running them locally, and making notes of areas that could be improved. Next, you’d spend another hour or two in a room with your team discussing suggestions live. Once changes were incorporated, the whole process would begin again until it was finally time to merge. Ah merging… never fun and often times a total nightmare.

Quality time with the team is great and all, but we sure are glad those days are over. Thanks to the rise of distributed version control (DVCS), like Git, the peer feedback process has vastly improved. Git’s ability to branch and merge easily has made it possible to review smaller sets of changes more often. This type of code review is based on a concept known as pull requests. Pull requests provide a forum to discuss proposed changes to your codebase before they’re merged into shared branches (e.g. before merging a feature branch into master).

Categories: Java

State of the Dev Ecosystem: Java and JavaScript

Javalobby Syndicated Feed - Tue, 25-Jul-17 09:01

The team at JetBrains decided to poll developers from 2016-2017 in order to publish what they called "The State of Developer Ecosystem in 2017" and the results were made available in July. While over 5,000 respondents were noted as participating in the research, the source of the respondent pool used for the survey was not made available.

About JetBrains

JetBrains was initially founded in 2000 as IntelliJ and gained notoriety for making an Integrated Development Environment (IDE) that was just as functional as other IDEs without the cost of utilizing a majority of the RAM on the developer's workstation. Back in the days when IDEs were using somewhere between 1-2 MB of RAM upon launching, IntelliJ started with only 128k, which was amazing and very much appreciated.

Categories: Java

Thread Slivers eBook at Amazon

Syndicate content