723 research outputs found
HOW CAN WE TEACH STUDENT TO ESTIMATE VERTICAL JUMP HEIGHTS USING GROUND REACTION FORCE DATA
The purpose of this study was to estimate vertical jump heights using ground reaction force (GRF) data and to suggest one practical example of biomechanical theory application to a real human motion. Vertical jump heights of impulse and flight time method were statistically smaller than three-dimensional video method. The causes of height differences seemed mainly from the fact that impulse was used to move jumper into the horizontal direction as well as into the vertical direction. Other important factors for accurate height calculation are jumper's mass and threshold value of GRF data collection. Vertical jump height calculation with GRF data showed an example of practical application of biomechanical theory to human motion and demonstrated a way of GRF equipment use for effective biomechanical theory education
Two new species of the similis-subgroup of Triconia Böttger-Schnack, 1999 (Copepoda, Oncaeidae) and a redescription of T. denticula Wi, Shin & Soh, 2011 from the northeastern equatorial Pacific
Three species of the similis-subgroup of the genus Triconia Bottger-Schnack, 1999 in the family Oncaeidae Giesbrecht, 1893 ["1892"] are described based on specimens collected by using a fine mesh net in the northeastern equatorial Pacific Ocean. One species is newly recorded in the equatorial Pacific, and the other two species are new to science. Triconia komo n. sp. is closely related to T. hawii (Bottger-Schnack & Boxshall, 1990), but differs distinctly in the relative length of the outer basal seta on P5 in the female as well as slightly in the relative length of the seta VI on caudal ramus in both sexes. Triconia onnuri n. sp. closely resembles T. similis (Sars, 1918), but females can be distinguished by the relative length of the outer exopodal seta and the outer basal seta on P5. Both sexes differ from T. similis in the relative lengths of endopodal spines on swimming legs 3 and 4 as well as in the form of caudal seta VI. The female of Triconia denticula Wi, Shin & Soh, 2011, which is newly recorded in the equatorial Pacific, is redescribed including morphological details and differences compared to the original description from Korean waters. The type material of T. denticula deposited in the National Institute of Biological Resources, Incheon (NIBR) was re-examined and found to be inconclusive for taxonomic purposes because the deposited copepod material and its labelling does not correspond to the description of the species. A fundamental revision of the type material of T. denticula is required. The present account includes an indication of the intraspecific variation in the endopodal spine lengths on swimming legs 2 to 4 for all three species, which is essential for assessing the usefulness of these characters for unequivocal identification of Triconia species. The spine lengths on exopodal segments 1 and 2 on swimming legs 3 and 4 are proposed as new morphometric characters for the identification of males of Triconia species, which are otherwise very similar in morphology
Investigation of reliability of the cutoff probe by a comparison with Thomson scattering in high density processing plasmas
A “cutoff probe” uses microwaves to measure the electron density in a plasma. It is particularly attractive because it is easy to fabricate and use, its measurement is immune to surface contamination by dielectric materials, and it has a straightforward analysis to measure electron density in real time. In this work, we experimentally investigate the accuracy of the cutoff probe through a detailed comparison with Thomson scattering in a low temperature, high density processing plasma. The result shows that the electron density measured by the cutoff probe is lower than that by Thomson scattering and that the discrepancy of the two results becomes smaller as the gap between the two tips increases and/or the neutral gas pressure decreases. The underestimated electron density found by the cutoff probe can be explained by the influence of the probe holder, which becomes important as the pressure increases and the gap gets closer
RTS noise reduction of CMOS image sensors using amplifier-selection pixels
This paper describes a RTS (random telegraph signal) noise reduction technique for an active pixel CMOS image sensor (CIS) with in-pixel selectable dual source-follower amplifiers. In this CMOS image sensor, the lower-noise transistor in each pixel is selected in the readout operation using a table of determining the lower-noise transistors of all the pixels. A prototype image sensor with 65×290 pixels for demonstrating the effectiveness of this technique has been implemented using 0.18µm CMOS image sensor technology with pinned photodiode option. The measured result shows that the maximum noise using the amplifier-selection technique is reduced to 9.6e- from 17.2e- which is the maximum noise of the image array using one of two amplifiers in each pixel without selection
Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm
Abstract: The kd-tree is one of the most commonly used spatial data structures for a variety of graphics applications because of its reliably high acceleration performance. Several years ago, Zhou et al. devised an effective kd-tree construction algorithm that runs entirely on a GPU. In this chapter, we present improved GPU programming techniques for implementing the algorithm more efficiently on current GPUs. One of the major ideas is to reduce the number of necessary kernel functions by replacing the essential, segmented-scan, and reduction computations by simpler per-block atomic operations, thereby alleviating the overheads from multiple synchronous kernel calls. Combined with the efficient implementation of intrablock scan and reduction, using recently introduced intrinsic functions, these changes achieve remarkable performance enhancement to the kd-tree construction process. Through an example of real-time ray tracing for dynamic scenes of nontrivial complexity, we demonstrate that the proposed GPU techniques can be exploited effectively for various real-time applications. Background and our contribution For many important applications in computer graphics, such as ray tracing and those relying on particle-based computations, adopting a proper acceleration structure will affect their run-time performance greatly. Among the variety of spatial data structures, the kd-tree is frequently used because of its reliably high acceleration performance. Compared to other techniques such as grids and boundingvolume hierarchies, its relatively higher construction cost has been regarded as a drawback, despite efforts to develop an optimized algorithm (e.g., [1] and Wu et al. In this chapter, we present enhanced CUDA programming techniques for implementing the GPU method of Zhou et al. Optimizations for the large-node stage In Zhou et al.'s method, the upper levels of the kd-tree were constructed using a node-splitting scheme that comprised spatial median splitting and empty-space maximizing. In particular, based on the observation that the assumptions made in the SAH may often be inaccurate for large nodes, this stage of computation, called the large-node stage, simply selects the spatial median of the longest axis of the axis-aligned bounding box (AABB) of a node as its split position. For efficient parallel implementation on a GPU, all triangles in each large node are grouped in-3 to chunks of fixed size (i.e., 256), parallelizing the computation over the triangles in the chunks. (Note that the triangles and chunks are mapped to the threads and blocks, respectively, in the CUDA implementation.) Triangle sorting with respect to splitting planes The large-node stage iterates the node-splitting process until no large node is left. In Algorithm 2 [11], the most time-consuming parts of each iteration are the fourth and fifth steps, corresponding to lines 24-34 and 35-40, respectively, where the triangles for each large node are first sorted with respect to the splitting plane, and the triangle numbers of the resulting two child nodes are then counted. In this subsection, we present two different approaches to implementing these two steps on a GPU. We then analyze their performance in the section on experimental results. Implementation using standard data-parallel primitives As was done in For each triangle in a large node, mapped to a CUDA thread, the key issue is how to efficiently calculate its address(es) in parallel in the new triangle index list next list, whose production is complicated because of the simultaneous subdivisions of the large nodes in the current list active list. For this, a kernel is first executed over every thread block corresponding to a chunk of triangles, classifying each triangle against the respective splitting plane, and generating two bit-flag sequences of size 256 per chunk triangle bit flags. Then, for each of these, an exclusive scan is performed using the shared memory of the GPU, resulting in the local triangle offset sequences. In addition, the kernel counts the number of triangles in each bit-flag sequence by simple addition, and places this number in an array in the global memory. (Note that, for the example in Implementation using atomic operations The triangle-sorting technique described in the previous subsection requires a segmented scan to be carried out twice on the data sequences stored in the global memory, and can easily be implemented using the data-parallel primitive functions provided by the CUDPP library [2], for example. Although very effective, such an approach forces the run-time execution to be split into a sequence of synchronous kernel calls, whose overheads will impact the run-time performance adversely. To address this, observe that a side effect of using a standard segmented-scan method is that the relative order of triangle indices within a large node made of multiple chunks is retained in the respective child nodes. Such a property is important when the order of elements is essential, as in a radix sort algorithm, for example. However, retaining the strict order is unnecessary in the kd-tree construction algorithm because the order of triangles within a kd-tree's leaf node is not critical in the later ray-tracing stage. This observation allows us to implement the triangle-sorting computation by using a single faster-running kernel and replacing the segmented-scan operations with simpler per-chunk atomic operations that are supported by the CUDA API. In the new implementation, the memory configuration for the triangle index lists is slightly different, as shown in For each chunk of triangle indices in the current list, the new kernel repeats the same computation until the triangle numbers are calculated in the array [A]. A representative thread then carries out two atomic additions, respectively fetching the local offsets, one for each child node, from the corresponding atomic variables and simultaneously adding the triangle counts to them, through which we will know where to start storing the sorted triangle indices in the child nodes. Then, once per child node, each thread checks the corresponding bit flag in the triangle bit flag array, and, if set to on, puts its triangle index in the proper place in the next triangle index list, whose location can easily be deduced from the fetched offset and the offset in the triangle offsets array. In this implementation, the two segmented scans over the arrays in the global memory have been replaced by two atomic-add operations per thread block. While the computation time is already reduced markedly by this change, two per-block scans, one for each child, must still be carried out per chunk to compute the triangle offsets. While such scans can be performed effectively in the shared memory by using a standard scan method AABB computations for active large nodes Another time-consuming part of the large-node stage is the second step (lines 9 to 14 of Algorithm 2), in which the AABB of all triangles in each node is calculated. The optimization techniques described in the previous subsection can also be applied to this AABB computation. The standard reduction in the shared memory for computing per-chunk bounding boxes can be implemented more efficiently on the GPU by a simple modification of the scan implementation using the intrinsic shuffle function __shfl_up(). Then, via three pairs of atomic min and max operations, the result of each chunk reduction is written in parallel to the location in the global memory that corresponds to the large node to which the chunk belongs. Although such atomic operations are still regarded as expensive on current GPUs, we observe that our single-kernel implementation based on atomic operations runs significantly faster on the GPU than the original implementation, which needed to perform segmented reductions six times. Optimizations for the small-node stage After all large nodes are split into nodes whose triangle numbers do not exceed 64, the small-node stage starts. Because sufficient nodes are available, the computation in this stage is parallelized over nodes instead of triangles, evaluating the precise SAH metric to find the best splitting plane for each small node. The key to the efficient implementation of this stage is exploiting a preprocessed data structure that facilitates the iterative node-splitting process. For each initial small node, called the small root node, up to 384 (= 64 (triangles) * 3 (x-, y-, z-axes) * 2 (min/max)) splitting-plane candidates are first collected from triangles in the node. Then, for each candidate, two 8-byte bit masks are generated to represent the triangle sets contained in both sides. To represent this information, 20 bytes of memory per node is necessary, including the 4 bytes used to store the location of the splitting plane, implying that up to 7,680 (=20 * 384) bytes of memory may be necessary for each small root node. It is important to choose an appropriate memory layout for the representation because the nontrivial amount of data will be accessed in parallel during the small-node stage. Although several different configurations are possible, we observed that the combination of a 4-byte access from the global memory for the splitting plane location and another 16-byte access from the texture memory for the triangle sets incurred the lowest memory latency on the GPU tested. (Our analysis of the generated PTX code showed that 16 bytes of data were fetched from texture memory even for a 4-byte access command.) With this representation, the SAH cost evaluation and triangle sorting in the subsequent node-splitting step can be performed efficiently using simple bitwise 8 operations. In this process, a parallel bit-counting operation is carried out very frequently to obtain the numbers of triangles in the child nodes. Whereas the method presented in Experimental results To measure the performance improvement achieved by the optimization techniques presented here, we first implemented the kd-tree construction algorithm of Zhou et al. on an NVIDIA GeForce GTX 680 GPU, effectively as described in the original paper. In doing this, we used the scan and reduction techniques described in Concluding remarks In this chapter, we have presented efficient GPU programming techniques for implementing the well-known kd-tree construction algorith
- …